Dataset Title - (Descriptive title of the dataset)
Dataset Description - (Description of the dataset, 75 words max)
Dataset ID - (Unique identifier for internal tracking)
Owner/Maintainer Name - (Team or individual responsible)
Owner/Maintainer Contact - (Email address of team or individual responsible)
Owner/Maintainer ORC ID/OSF ID (optional) - 0000-0000-0123-4567, osf.io/abcd0
DLP Research Organization Registry (ROR) ID (Optional) - https://ror.org/123ab4567
Last Updated - (YYYY-MM-DD)
Update cadence - (real-time, daily, weekly, monthly, term, annual, one-time)
Data versioning policy - semantic versioning (MAJOR.MINOR.PATCH)
(MAJOR - when you make incompatible API changes,
MINOR - when you add functionality in a backward-compatible manner,
PATCH - when you make backward-compatible bug fixes.)
Data Format - (Database/Parquet/CSV/Other - specify exact format)
Accessing Data within Enclave - (Instructions to be developed with the Rice SafeInsights Team)
Demo/Sample Access - (Yes/No - brief description of preview capabilities)
Documentation Quality - (Comprehensive/Basic/Minimal)
Cost Estimation for Queries - (To Do)
- (how to control your costs when you query)
- Possible example - Queries on the table estimated at $0.05 per GB scanned.
- Cost control tips - use partitioned data, select specific columns, limit rows.
Data Lineage and Dependencies
Source Systems - (Where this data originates, institutional, and geographic)
Transformation Process - (How raw data becomes this dataset)
Dependencies - (Other datasets or systems this depends on)
Downstream Usage - (What systems/processes consume this data)
AI features - (Yes/No)
Example research questions (When possible) -
Data Characteristics
Table/File Inventory
| Table/File Name | Description | Record Count | Primary Keys | Related Tables |
|---|---|---|---|---|
| Brief description | Approx count | |||
| Brief description | Approx count |
Schema Information
Table -
Description - (Detailed description of what this table contains)
Primary Use - (Main purpose/analysis this table supports)
Sample data
- Real sample - (Yes/No)
- Synthetic sample - (Yes/No)
Link to sample data - URL
Data Schema
| Column/Entity Name | Data Type | Description | Example Values | Null handling (how does the data manage and respond to the absence of a value) | Primary Key | Foreign Key | Expected Cardinality (per parent) | Can be missing (Yes/No) | Parent entity | Missing data (percentage) |
|---|---|---|---|---|---|---|---|---|---|---|
| Clear description; if categorical then include levels |
- Total Records - (Exact count)
- Unique Entities - (e.g., students, schools, districts)
- Time Period - (Specific period this table covers)
- Key Numerical Variables -
- (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
- Key Categorical Variables -
- (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))
File Formats Provided -
- CSV/Excel - (Yes/No)
- JSON/XML - (Yes/No)
- Database formats - (Yes/No - specify SQL/NoSQL)
- Video formats - (Yes/No - specify formats)
- Audio formats - (Yes/No - specify formats)
- Image formats - (Yes/No - specify formats)
- Text formats - (Yes/No - specify formats)
Temporal Granularity Options -
- Real-time/continuous - (Yes/No)
- Per interaction/click - (Yes/No)
- Per session/class - (Yes/No)
- Daily - (Yes/No)
- Weekly - (Yes/No)
- Monthly - (Yes/No)
- Semester/term - (Yes/No)
- Academic year - (Yes/No)
- Multi-year - (Yes/No)
Unit of Analysis Levels -
- Individual learner - (Yes/No)
- Event level - (Yes/No)
- Learning group/team - (Yes/No)
- Classroom/cohort - (Yes/No)
- Course/program - (Yes/No)
- Institution - (Yes/No)
- District/system - (Yes/No)
- Regional/national - (Yes/No)
Technical Specifications
Dataset Size and Complexity -
- Number of Participants - (Total unique participants)
- Number of Variables - (Total variables across all tables)
- Number of Records - (Total records across all tables)
- Data Volume - (Specify MB/GB/TB)
- Contextual Notes - (Free text response of any events that might have impacted data, including feature changes, major UI changes, etc.)
Data Structure
- Total Tables/Files - (Number of tables or files in this dataset)
- Total Records - (Approximate count across all tables)
- Dataset Size - (Total size if applicable)
- Update Frequency - (Real-time/Daily/Weekly/Monthly/One-time)
- Partitioning - (How data is organized/partitioned across tables)
Temporal Coverage -
- Date Range - (YYYY-MM-DD to YYYY-MM-DD)
- Academic Years - (e.g., 2020-21 through 2023-24)
- Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (Text))
Geographic Coverage -
-
Geographic Scope - (National/State/District/School level)
-
Number of Regions - (Count of regions covered)
-
Coverage Gaps - (Any notable geographic exclusions)
Table-Specific Statistics
Data Availability
Privacy Level -
- Fully anonymized (All identifying details have been removed) - (Yes/No)
- De-identified (Direct identifiers have been removed) - (Yes/No)
- Pseudonymized (Direct identifiers have been replaced with pseudonyms) - (Yes/No)
- Identifiable (restricted) - (Yes/No)
- Other - (Yes/No) (Specify)
Query Example
Single Table 1 Queries
-- Example query for [table_name_1] SELECT column1, column2 FROM table_name_1 WHERE condition = 'value';
Multi-Table Queries
-- Example join query across related tables SELECT t1.column1, t2.column2 FROM table_name_1 t1 JOIN table_name_2 t2 ON t1.key_field = t2.key_field WHERE t1.condition = 'value';
Data preprocessing example
import pandas as pd from datetime import datetime df = pd.DataFrame(table_1) # --- Step 2 - Remove missing values in the 'data' column --- # The dropna() method removes rows with NaN values. We specify 'data' as the column to check. df_cleaned = df.dropna(subset=['data']) # --- Step 3 - Select dates from January 1, 2024, to the current day --- # First, ensure the 'date' column is in datetime format. df_cleaned['date'] = pd.to_datetime(df_cleaned['date']) # Define the start date (January 1, 2024) and the end date (today's date). start_date = pd.to_datetime('2024-01-01') end_date = pd.to_datetime(datetime.now().strftime('%Y-%m-%d')) # Filter the DataFrame using the date range. df_filtered = df_cleaned[(df_cleaned['date'] >= start_date) & (df_cleaned['date'] <= end_date)]
Table -
Description - (Detailed description of what this table contains)
Primary Use - (Main purpose/analysis this table supports)
Sample data
- Real sample - (Yes/No)
- Synthetic sample - (Yes/No)
Link to sample data - URL
Data Schema
| Column/Entity Name | Data Type | Description | Example Values | Null handling (how does the data manage and respond to the absence of a value) | Primary Key | Foreign Key |
|---|---|---|---|---|---|---|
| Clear description; if categorical then include levels |
Temporal Coverage -
- Date Range - (YYYY-MM-DD to YYYY-MM-DD)
- Academic Years - (e.g., 2020-21 through 2023-24)
- Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (TEXT))
Geographic Coverage -
- Geographic Scope - (National/State/District/School level)
- Number of Regions - (Count of regions covered)
- Coverage Gaps - (Any notable geographic exclusions)
Table-Specific Statistics
- Total Records - (Exact count)
- Unique Entities - (e.g., students, schools, districts)
- Time Period - (Specific period this table covers)
- Key Numerical Variables -
- (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
- Key Categorical Variables -
- (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))
Table Relationships
Join Patterns
- connects tovia
- connects tovia
Cross-Table Analysis Notes
- (Overlap information between tables)
- (Completeness patterns across tables)
- (Temporal alignment patterns)
Common Multi-Table Queries
- (Description of typical analysis spanning multiple tables)
Data Quality Indicators
- Completeness - (Percentage and description)
- Accuracy - (Assessment and known issues)
- Consistency - (Standards and constraints)
- Uniqueness - (Duplicate handling approach)
- Timeliness - (Recency of last update)