SafeInsights Knowledge Base

Dataset Title - (Descriptive title of the dataset)

Dataset Description - (Description of the dataset, 75 words max)

Dataset ID - (Unique identifier for internal tracking)

Owner/Maintainer Name - (Team or individual responsible)

Owner/Maintainer Contact - (Email address of team or individual responsible)

Owner/Maintainer ORC ID/OSF ID (optional) - 0000-0000-0123-4567, osf.io/abcd0

DLP Research Organization Registry (ROR) ID (Optional) - https://ror.org/123ab4567

Last Updated - (YYYY-MM-DD)

Update cadence - (real-time, daily, weekly, monthly, term, annual, one-time)

Data versioning policy - semantic versioning (MAJOR.MINOR.PATCH)
(MAJOR - when you make incompatible API changes,
MINOR - when you add functionality in a backward-compatible manner,
PATCH - when you make backward-compatible bug fixes.)

Data Format - (Database/Parquet/CSV/Other - specify exact format)

Accessing Data within Enclave - (Instructions to be developed with the Rice SafeInsights Team)

Demo/Sample Access - (Yes/No - brief description of preview capabilities)

Documentation Quality - (Comprehensive/Basic/Minimal)

Cost Estimation for Queries - (To Do)

(how to control your costs when you query)
Possible example - Queries on the
user_interactions
table estimated at $0.05 per GB scanned.
Cost control tips - use partitioned data, select specific columns, limit rows.

Data Lineage and Dependencies

Source Systems - (Where this data originates, institutional, and geographic)

Transformation Process - (How raw data becomes this dataset)

Dependencies - (Other datasets or systems this depends on)

Downstream Usage - (What systems/processes consume this data)

AI features - (Yes/No)

Example research questions (When possible) -

Data Characteristics

Table/File Inventory

Table/File Name	Description	Record Count	Primary Keys	Related Tables
`[table_name_1]`	Brief description	Approx count	`[key fields]`	`[Related tables]`
`[table_name_2]`	Brief description	Approx count	`[key fields]`	`[Related tables]`

Schema Information

Table -
`(table_name_1)`

Description - (Detailed description of what this table contains)

Primary Use - (Main purpose/analysis this table supports)

Sample data

Real sample - (Yes/No)
Synthetic sample - (Yes/No)

Link to sample data - URL

Data Schema

Column/Entity Name	Data Type	Description	Example Values	Null handling (how does the data manage and respond to the absence of a value)	Primary Key	Foreign Key	Expected Cardinality (per parent)	Can be missing (Yes/No)	Parent entity	Missing data (percentage)
`(field_name)`	`(varchar/int/date/etc)`	Clear description; if categorical then include levels	`(sample values)`	`(sample values, e.g., NA, NULL, None)`	`(Yes/No)`	`(Reference if applicable)`	`(many, one, –)`	`(Yes/No)`	`(session, item, school, student, district)`	`XX`

Total Records - (Exact count)
Unique Entities - (e.g., students, schools, districts)
Time Period - (Specific period this table covers)
Key Numerical Variables -
- (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
Key Categorical Variables -
- (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))

File Formats Provided -

CSV/Excel - (Yes/No)
JSON/XML - (Yes/No)
Database formats - (Yes/No - specify SQL/NoSQL)
Video formats - (Yes/No - specify formats)
Audio formats - (Yes/No - specify formats)
Image formats - (Yes/No - specify formats)
Text formats - (Yes/No - specify formats)

Temporal Granularity Options -

Real-time/continuous - (Yes/No)
Per interaction/click - (Yes/No)
Per session/class - (Yes/No)
Daily - (Yes/No)
Weekly - (Yes/No)
Monthly - (Yes/No)
Semester/term - (Yes/No)
Academic year - (Yes/No)
Multi-year - (Yes/No)

Unit of Analysis Levels -

Individual learner - (Yes/No)
Event level - (Yes/No)
Learning group/team - (Yes/No)
Classroom/cohort - (Yes/No)
Course/program - (Yes/No)
Institution - (Yes/No)
District/system - (Yes/No)
Regional/national - (Yes/No)

Technical Specifications

Dataset Size and Complexity -

Number of Participants - (Total unique participants)
Number of Variables - (Total variables across all tables)
Number of Records - (Total records across all tables)
Data Volume - (Specify MB/GB/TB)
Contextual Notes - (Free text response of any events that might have impacted data, including feature changes, major UI changes, etc.)

Data Structure

Total Tables/Files - (Number of tables or files in this dataset)
Total Records - (Approximate count across all tables)
Dataset Size - (Total size if applicable)
Update Frequency - (Real-time/Daily/Weekly/Monthly/One-time)
Partitioning - (How data is organized/partitioned across tables)

Temporal Coverage -

Date Range - (YYYY-MM-DD to YYYY-MM-DD)
Academic Years - (e.g., 2020-21 through 2023-24)
Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (Text))

Geographic Coverage -

Geographic Scope - (National/State/District/School level)
Number of Regions - (Count of regions covered)
Coverage Gaps - (Any notable geographic exclusions)

Table-Specific Statistics

Data Availability

Privacy Level -

Fully anonymized (All identifying details have been removed) - (Yes/No)
De-identified (Direct identifiers have been removed) - (Yes/No)
Pseudonymized (Direct identifiers have been replaced with pseudonyms) - (Yes/No)
Identifiable (restricted) - (Yes/No)
Other - (Yes/No) (Specify)

Query Example

Single Table 1 Queries

-- Example query for [table_name_1]
SELECT column1, column2 
FROM table_name_1 
WHERE condition = 'value';

Multi-Table Queries

-- Example join query across related tables
SELECT t1.column1, t2.column2 
FROM table_name_1 t1
JOIN table_name_2 t2 ON t1.key_field = t2.key_field
WHERE t1.condition = 'value';

Data preprocessing example

import pandas as pd
from datetime import datetime

df = pd.DataFrame(table_1)

# --- Step 2 - Remove missing values in the 'data' column ---
# The dropna() method removes rows with NaN values. We specify 'data' as the column to check.
df_cleaned = df.dropna(subset=['data'])

# --- Step 3 - Select dates from January 1, 2024, to the current day ---
# First, ensure the 'date' column is in datetime format.
df_cleaned['date'] = pd.to_datetime(df_cleaned['date'])

# Define the start date (January 1, 2024) and the end date (today's date).
start_date = pd.to_datetime('2024-01-01')
end_date = pd.to_datetime(datetime.now().strftime('%Y-%m-%d'))

# Filter the DataFrame using the date range.
df_filtered = df_cleaned[(df_cleaned['date'] >= start_date) & (df_cleaned['date'] <= end_date)]

Table -
`(table_name_2)`

Description - (Detailed description of what this table contains)

Primary Use - (Main purpose/analysis this table supports)

Sample data

Real sample - (Yes/No)
Synthetic sample - (Yes/No)

Link to sample data - URL

Data Schema

Column/Entity Name	Data Type	Description	Example Values	Null handling (how does the data manage and respond to the absence of a value)	Primary Key	Foreign Key
`(field_name)`	`(varchar/int/date/etc)`	Clear description; if categorical then include levels	`(sample values)`	`(sample values, e.g., NA, NULL, None)`	`(Yes/No)`	`(Reference if applicable)`

Temporal Coverage -

Date Range - (YYYY-MM-DD to YYYY-MM-DD)
Academic Years - (e.g., 2020-21 through 2023-24)
Reporting Periods - (Annual/Semester/Quarter/Monthly/Other (TEXT))

Geographic Coverage -

Geographic Scope - (National/State/District/School level)
Number of Regions - (Count of regions covered)
Coverage Gaps - (Any notable geographic exclusions)

Table-Specific Statistics

Total Records - (Exact count)
Unique Entities - (e.g., students, schools, districts)
Time Period - (Specific period this table covers)
Key Numerical Variables -
- (Variable Name) - Min - (X), Max - (Y), Mean - (Z), Median - (W), Missing - (N%)
Key Categorical Variables -
- (Variable Name) - (Category 1 - N records (X%), Category 2 - N records (Y%))

Table Relationships

Join Patterns

[Table A]
connects to
[Table B]
via
[key_field]
[Table B]
connects to
[Table C]
via
[key_field]

Cross-Table Analysis Notes

(Overlap information between tables)
(Completeness patterns across tables)
(Temporal alignment patterns)

Common Multi-Table Queries

(Description of typical analysis spanning multiple tables)

Data Quality Indicators

Completeness - (Percentage and description)
Accuracy - (Assessment and known issues)
Consistency - (Standards and constraints)
Uniqueness - (Duplicate handling approach)
Timeliness - (Recency of last update)

Data Lineage and Dependencies

Data Characteristics

Table/File Inventory

Schema Information

Table - (table_name_1)

Technical Specifications

Data Structure

Table-Specific Statistics

Data Availability

Query Example

Single Table 1 Queries

Multi-Table Queries

Data preprocessing example

Table - (table_name_2)

Table-Specific Statistics

Table Relationships

Join Patterns

Cross-Table Analysis Notes

Common Multi-Table Queries

Data Quality Indicators

Table -
`(table_name_1)`

Table -
`(table_name_2)`