Dataset Title - (Descriptive title of the dataset)

Dataset Description - (Description of the dataset, 75 words max)

Dataset ID - (Unique identifier for internal tracking)

Owner/Maintainer Name - (Team or individual responsible)

Owner/Maintainer Contact - (Email address of team or individual responsible)

Owner/Maintainer ORC ID/OSF ID (optional) - 0000-0000-0123-4567, osf.io/abcd0

DLP Research Organization Registry (ROR) ID (Optional) - https://ror.org/123ab4567

Last Updated - (YYYY-MM-DD)

Update cadence - (real-time, daily, weekly, monthly, term, annual, one-time)

Data versioning policy - semantic versioning (MAJOR.MINOR.PATCH)
(MAJOR - when you make incompatible API changes,
MINOR - when you add functionality in a backward-compatible manner,
PATCH - when you make backward-compatible bug fixes.)

Data Format - (Database/Parquet/CSV/Other - specify exact format)

Accessing Data within Enclave - (Instructions to be developed with the Rice SafeInsights Team)

Demo/Sample Access - (Yes/No - brief description of preview capabilities)

Documentation Quality - (Comprehensive/Basic/Minimal)

Cost Estimation for Queries - (To Do)

  • (how to control your costs when you query)
  • Possible example - Queries on the
    user_interactions
    table estimated at $0.05 per GB scanned.
  • Cost control tips - use partitioned data, select specific columns, limit rows.

Data Lineage and Dependencies

Source Systems - (Where this data originates, institutional, and geographic)

Transformation Process - (How raw data becomes this dataset)

Dependencies - (Other datasets or systems this depends on)

Downstream Usage - (What systems/processes consume this data)

AI features - (Yes/No)

Example research questions (When possible) -

Data Characteristics

Table/File Inventory

Table/File NameDescriptionRecord CountPrimary KeysRelated Tables
[table_name_1]
Brief descriptionApprox count
[key fields]
[Related tables]
[table_name_2]
Brief descriptionApprox count
[key fields]
[Related tables]

Schema Information

Table -
(table_name_1)

Description - (Detailed description of what this table contains)

Primary Use - (Main purpose/analysis this table supports)

Sample data

  • Real sample - (Yes/No)
  • Synthetic sample - (Yes/No)

Link to sample data - URL

Data Schema

Column/Entity NameData TypeDescriptionExample ValuesNull handling (how does the data manage and respond to the absence of a value)Primary KeyForeign KeyExpected Cardinality (per parent)Can be missing (Yes/No)Parent entityMissing data (percentage)
(field_name)
(varchar/int/date/etc)
Clear description; if categorical then include levels
(sample values)
(sample values, e.g., NA, NULL, None)
(Yes/No)
(Reference if applicable)
(many, one, –)
(Yes/No)
(session, item, school, student, district)
XX
  • Total Records - (Exact count)
  • Unique Entities - (e.g., students, schools, districts)
  • Time Period - (Specific period this table covers)
  • Key Numerical Variables -
    • (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
  • Key Categorical Variables -
    • (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))

File Formats Provided -

  • CSV/Excel - (Yes/No)
  • JSON/XML - (Yes/No)
  • Database formats - (Yes/No - specify SQL/NoSQL)
  • Video formats - (Yes/No - specify formats)
  • Audio formats - (Yes/No - specify formats)
  • Image formats - (Yes/No - specify formats)
  • Text formats - (Yes/No - specify formats)

Temporal Granularity Options -

  • Real-time/continuous - (Yes/No)
  • Per interaction/click - (Yes/No)
  • Per session/class - (Yes/No)
  • Daily - (Yes/No)
  • Weekly - (Yes/No)
  • Monthly - (Yes/No)
  • Semester/term - (Yes/No)
  • Academic year - (Yes/No)
  • Multi-year - (Yes/No)

Unit of Analysis Levels -

  • Individual learner - (Yes/No)
  • Event level - (Yes/No)
  • Learning group/team - (Yes/No)
  • Classroom/cohort - (Yes/No)
  • Course/program - (Yes/No)
  • Institution - (Yes/No)
  • District/system - (Yes/No)
  • Regional/national - (Yes/No)

Technical Specifications

Dataset Size and Complexity -

  • Number of Participants - (Total unique participants)
  • Number of Variables - (Total variables across all tables)
  • Number of Records - (Total records across all tables)
  • Data Volume - (Specify MB/GB/TB)
  • Contextual Notes - (Free text response of any events that might have impacted data, including feature changes, major UI changes, etc.)

Data Structure

  • Total Tables/Files - (Number of tables or files in this dataset)
  • Total Records - (Approximate count across all tables)
  • Dataset Size - (Total size if applicable)
  • Update Frequency - (Real-time/Daily/Weekly/Monthly/One-time)
  • Partitioning - (How data is organized/partitioned across tables)

Temporal Coverage -

  • Date Range - (YYYY-MM-DD to YYYY-MM-DD)
  • Academic Years - (e.g., 2020-21 through 2023-24)
  • Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (Text))

Geographic Coverage -

  • Geographic Scope - (National/State/District/School level)

  • Number of Regions - (Count of regions covered)

  • Coverage Gaps - (Any notable geographic exclusions)

    Table-Specific Statistics

Data Availability

Privacy Level -

  • Fully anonymized (All identifying details have been removed) - (Yes/No)
  • De-identified (Direct identifiers have been removed) - (Yes/No)
  • Pseudonymized (Direct identifiers have been replaced with pseudonyms) - (Yes/No)
  • Identifiable (restricted) - (Yes/No)
  • Other - (Yes/No) (Specify)

Query Example

Single Table 1 Queries

-- Example query for [table_name_1] SELECT column1, column2 FROM table_name_1 WHERE condition = 'value';

Multi-Table Queries

-- Example join query across related tables SELECT t1.column1, t2.column2 FROM table_name_1 t1 JOIN table_name_2 t2 ON t1.key_field = t2.key_field WHERE t1.condition = 'value';

Data preprocessing example

import pandas as pd from datetime import datetime df = pd.DataFrame(table_1) # --- Step 2 - Remove missing values in the 'data' column --- # The dropna() method removes rows with NaN values. We specify 'data' as the column to check. df_cleaned = df.dropna(subset=['data']) # --- Step 3 - Select dates from January 1, 2024, to the current day --- # First, ensure the 'date' column is in datetime format. df_cleaned['date'] = pd.to_datetime(df_cleaned['date']) # Define the start date (January 1, 2024) and the end date (today's date). start_date = pd.to_datetime('2024-01-01') end_date = pd.to_datetime(datetime.now().strftime('%Y-%m-%d')) # Filter the DataFrame using the date range. df_filtered = df_cleaned[(df_cleaned['date'] >= start_date) & (df_cleaned['date'] <= end_date)]

Table -
(table_name_2)

Description - (Detailed description of what this table contains)

Primary Use - (Main purpose/analysis this table supports)

Sample data

  • Real sample - (Yes/No)
  • Synthetic sample - (Yes/No)

Link to sample data - URL

Data Schema

Column/Entity NameData TypeDescriptionExample ValuesNull handling (how does the data manage and respond to the absence of a value)Primary KeyForeign Key
(field_name)
(varchar/int/date/etc)
Clear description; if categorical then include levels
(sample values)
(sample values, e.g., NA, NULL, None)
(Yes/No)
(Reference if applicable)

Temporal Coverage -

  • Date Range - (YYYY-MM-DD to YYYY-MM-DD)
  • Academic Years - (e.g., 2020-21 through 2023-24)
  • Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (TEXT))

Geographic Coverage -

  • Geographic Scope - (National/State/District/School level)
  • Number of Regions - (Count of regions covered)
  • Coverage Gaps - (Any notable geographic exclusions)

Table-Specific Statistics

  • Total Records - (Exact count)
  • Unique Entities - (e.g., students, schools, districts)
  • Time Period - (Specific period this table covers)
  • Key Numerical Variables -
    • (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
  • Key Categorical Variables -
    • (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))

Table Relationships

Join Patterns

  • [Table A]
    connects to
    [Table B]
    via
    [key_field]
  • [Table B]
    connects to
    [Table C]
    via
    [key_field]

Cross-Table Analysis Notes

  • (Overlap information between tables)
  • (Completeness patterns across tables)
  • (Temporal alignment patterns)

Common Multi-Table Queries

  • (Description of typical analysis spanning multiple tables)

Data Quality Indicators

  • Completeness - (Percentage and description)
  • Accuracy - (Assessment and known issues)
  • Consistency - (Standards and constraints)
  • Uniqueness - (Duplicate handling approach)
  • Timeliness - (Recency of last update)