Dataset Title - OpenStax Tutor Learning Platform Export Datasets
Dataset Description - Comprehensive datasets from OpenStax Tutor platform including student task interactions, exercise responses, content annotations, reading assignments, and performance metrics spanning multiple academic subjects and courses with temporal learning analytics.
Dataset ID - openstax_tutor_2018-2024
Owner/Maintainer Name - OpenStax Research Data Team
Owner/Maintainer Contact - research@openstax.org
Owner/Maintainer ORC ID/OSF ID (optional) - TBD
DLP Research Organization Registry (ROR) ID (Optional) - https://ror.org/05qghxh33
Last Updated - 2024-12-30
Update cadence - term (semester-based updates with daily incremental processing)
Versioning policy - semantic versioning (MAJOR.MINOR.PATCH)
(MAJOR - when schema or data structure changes incompatibly,
MINOR - when new tables/columns are added in backward-compatible manner,
PATCH - when data quality fixes or minor enhancements are made)
Data Format - Parquet files (compressed columnar format optimized for analytics)
Accessing Data within Enclave - Instructions to be developed with the Rice SafeInsights Team
Demo/Sample Access - Yes - 1000-row sample datasets available for preview and schema exploration
Documentation Quality - Comprehensive
Cost Estimation for Queries - TBD
- Control costs by using date filters on columns (data is partitioned by date)
- Cost control tips - use date ranges, select specific columns only, limit to specific courses/periods/semesters/academic years
Data Characteristics
Table/File Inventory
| Table/File Name | Description | Record Count | Primary Keys | Related Tables |
|---|---|---|---|---|
| Individual steps within tasks (exercises, readings) | ~27,996,816 | , , | ||
| Question details with tags and response options | ~ 1,956,063 | , | ||
| Highlights and notes within OpenStax content on Tutor | ~3,912,221 | , , , |
Schema Information
Table -
Description - Individual learning steps within student tasks including exercises, reading assignments, videos, and interactive content. Each step represents a granular learning activity with completion timestamps and performance tracking.
Primary Use - Learning analytics, step completion analysis, time-on-task measurement, learning sequence analysis
Sample data
- Real sample - Yes
- Synthetic sample - No
Link to sample data - Available via R/Python export scripts
Data Lineage and Dependencies
Source Systems - OpenStax Tutor production platform (tutor.openstax.org), serving K-12 and college students globally with primary usage in United States educational institutions
Transformation Process - Daily extraction from PostgreSQL production database, anonymization of PII, conversion to parquet format with date-based partitioning, quality validation, and research identifier assignment
Dependencies - OpenStax Accounts system, OpenStax CNX content repository, institutional LMS integrations, course management systems
Downstream Usage - Learning analytics research, adaptive learning algorithm development, educational effectiveness studies, institutional reporting dashboards
AI features - Yes (adaptive question selection, personalized practice recommendations)
Example research questions (When possible)
- How does time-on-task correlate with learning outcomes across different subjects?
- What content annotation patterns predict student success?
- How effective are different assignment types in promoting retention?
- What factors influence course completion rates?
File Formats Provided
- CSV/Excel - Yes (via R/Python export scripts)
- Parquet - Yes
- JSON/XML - No
- Database formats - Yes (Parquet optimized for analytics)
- Video formats - No
- Audio formats - No
- Image formats - No
- Text formats - Yes (JSON fields containing exercise content and responses)
Temporal Granularity Options
- Real-time/continuous - No
- Per interaction/click - Yes (step completions, exercise submissions)
- Per session/class - Yes (aggregated via task completion data)
- Daily - Yes (daily export and partitioning)
- Weekly - Yes (aggregatable from daily data)
- Monthly - Yes (aggregatable from daily data)
- Semester/term - Yes (course-based temporal boundaries)
- Academic year - Yes (multi-semester analysis possible)
- Multi-year - Yes (historical data spanning multiple years)
Unit of Analysis Levels
- Individual learner - Yes (student-level learning analytics)
- Event level - Yes (individual step completions, exercise responses)
- Learning group/team - Yes (via period/section membership)
- Classroom/cohort - Yes (course membership period analysis)
- Course/program - Yes (course-level aggregations)
- Institution - Yes (school-level analysis via course data)
- District/system - No (school district data not systematically tracked)
- Regional/national - No (geographic aggregation not available)
Unit-of-Analysis Map
Provide one row per entity.
| entity_name | primary_key | parent_entity | can_be_missing (Yes/No) | expected_cardinality (per parent) |
|---|---|---|---|---|
| student | entity_role_id | — | No | — |
| course | course_profile_course_id | — | No | — |
| period | course_membership_period_id | course | No | many |
| task | tasks_task_id | student | No | many |
| step | tasks_task_steps_id | task | No | many |
| exercise | tasks_tasked_exercise_id | step | Yes | one |
| content_page | content_page_id | — | No | — |
| content_book | content_book_id | — | No | — |
| content_note | content_notes_id | student | Yes | many |
| research_study | research_study_id | — | Yes | — |
Note. Research studies represent controlled experiments or A/B tests conducted within the platform for educational research purposes.
Technical Specifications
Dataset Size and Complexity
- Number of Participants - XM+ unique students across institutional and individual accounts
- Number of Variables - 800+ columns across 50+ core tables
- Number of Records - 800M+ total records across all tables
- Data Volume - X GB (compressed parquet), X GB (uncompressed)
- Missing Data Percentage - XX% overall (varies by table, higher in optional features)
- Contextual Notes - COVID-19 remote learning surge (2020-2021), adaptive algorithm updates (2022), LMS integration expansion (2023)
Data Structure
- Total Tables/Files - 88 tables in full schema, 12 core tables in research export
- Total Records - XXM+ across all tables (XXM+ task steps, XXM+ exercises)
- Dataset Size - XX GB compressed parquet format
- Update Frequency - Daily (incremental), Full refresh quarterly
- Partitioning - Date-based partitioning by created_at, clustered by course and student
Temporal Coverage
- Date Range - 2018-01-01 to 2024-12-31
- Academic Years - 201X through 2024
- Reporting Periods - Semester-based with daily granularity
Data Availability
Privacy Level
- Fully anonymized (All identifying details have been removed) - No
- De-identified (Direct identifiers have been removed) - Yes
- Pseudonymized (Direct identifiers have been replaced with pseudonyms) - Yes
- Identifiable (restricted) - No
- Other - No
Query Example
Single Table Queries
-- Task step completion analysis SELECT tasked_type, COUNT(*) as total_steps, COUNT(first_completed_at) as completed_steps, AVG(EXTRACT(EPOCH FROM (last_completed_at - created_at))/60) as avg_time_minutes FROM tasks_task_steps WHERE created_at >= '2023-01-01' GROUP BY tasked_type;
Multi-Table Queries
-- Student performance with book titles (recreated from R script logic) SELECT er.research_identifier, cb.title as book_title, cp.title as section_title, COUNT(tts.id) as total_steps, COUNT(tts.first_completed_at) as completed_steps, AVG(tte.published_grader_points) as avg_grade FROM tasks_task_steps tts INNER JOIN tasks_tasks tt ON tt.id = tts.tasks_task_id INNER JOIN tasks_taskings tsg ON tsg.tasks_task_id = tt.id INNER JOIN entity_roles er ON er.id = tsg.entity_role_id LEFT JOIN content_pages cp ON cp.id = tts.content_page_id LEFT JOIN content_books cb ON cb.id = cp.content_book_id LEFT JOIN tasks_tasked_exercises tte ON tte.id = tts.tasked_id AND tts.tasked_type = 'Tasks::Models::TaskedExercise' WHERE tt.created_at BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY er.research_identifier, cb.title, cp.title;
Data preprocessing example
# R script example based on recreate_tutor_query_optimized.R library(tidyverse) library(arrow) library(lubridate) # Load task steps data with date filtering BASE_PATH <- "~/data/tutor" task_steps_ds <- open_dataset(file.path(BASE_PATH, "public.tasks_task_steps")) # Filter and process task steps data task_steps_filtered <- task_steps_ds %>% filter(created_at >= as_datetime("2023-01-01")) %>% collect() %>% mutate( completion_time_minutes = as.numeric( difftime(last_completed_at, first_completed_at, units = "mins") ), is_completed = !is.na(first_completed_at) ) # Basic completion statistics completion_stats <- task_steps_filtered %>% group_by(tasked_type) %>% summarise( total_steps = n(), completed_steps = sum(is_completed), completion_rate = mean(is_completed), avg_time_minutes = mean(completion_time_minutes, na.rm = TRUE) )