Data Profile Report

Generated on November 25, 2025 at 08:58 AM

Overview

This dataset contains processed question details extracted from OpenStax Tutor platform exercise content. The data includes detailed question text, response options, educational metadata tags, and exercise classifications that support learning analytics and educational research.


Dataset File:

content_exercises_processed.parquet

Data Last Updated: 2024-12-30

Dataset Description

The processed dataset transforms raw JSON exercise content from the OpenStax Tutor platform into a structured, analysis-ready format. Each record represents a unique question within an exercise, with associated metadata, tags, and response options.

Executive Dashboard

🟢 Data Quality Grade: B (89/100)

MetricValueStatus
Total Rows1,956,063
Total Columns31
Completeness82.8%⚠️
Columns with Issues2⚠️
Processing Time2.9s

🔍 Data Quality Alerts

❌ Critical Issues

These issues require immediate attention:

  • title: Column is completely empty (100% null)
  • preview: 98.7% null values (sparse data)
  • context: 99.1% null values (sparse data)
  • nickname: 96.3% null values (sparse data)
  • derived_from_id: Column is completely empty (100% null)

⚠️ Warnings

Review these columns for potential issues:

  • title: Only 0 unique values in 1,956,063 rows
  • preview: 484 unique categories (consider if this should be text/identifier)
  • derived_from_id: Only 0 unique values in 1,956,063 rows

📊 Column Profiles

Columns are organized by data type for easier navigation.

🔑 Identifier (6 columns)

Unique identifiers (IDs, keys, UUIDs)

View identifier columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
response_option_id
INTEGERidentifier100.0%349,77817.9%No samples availableIdentifier for multiple choice answers
content_exercise_id
INTEGERidentifier100.0%474,72424.3%No samples availableInstance of exercise inside Tutor
content_page_id
INTEGERidentifier100.0%26,2171.3%No samples available
uuid
VARCHARidentifier100.0%84,8374.3%93dde152-68c9-4d0e-a685-bd3595101a34, a2410663-c6c4-4573-b18c-fbd076f71905, 2b90b18b-db5e-4a24-a853-8ee74f559364
group_uuid
VARCHARidentifier100.0%24,0221.2%7dfe5195-b996-4b80-aa8d-82e3b34c34ce, cbb8e20f-5ae4-4e85-a177-1e0ebd1add33, 5e6323b9-966c-4254-962a-289f06647affExercise group identifier (corresponds to before the
@
)
question_uid
VARCHARidentifier100.0%84,8374.3%14531@4, 11187@1, 23348@2Unique question and version identifier within Tutor where the segment before
@
is the ID and the segment after is the version

📑 Categorical (10 columns)

Categorical variables with limited distinct values

View categorical columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
correctness
VARCHARcategorical100.0%20.0%0.0, 1.0
formats
VARCHAR[]categorical100.0%60.0%multiple-choice, free-response ... (6 total levels)Question format specifications
blooms
VARCHARcategorical100.0%140.0%5, -1, 6, -3, 1 ... (14 total levels)Bloom's taxonomy classification
dok
VARCHARcategorical100.0%60.0%3, 4, 1, 2 ... (6 total levels)Depth of Knowledge level
book
VARCHARcategorical100.0%270.0%apush, anp, cbio, apbio, bio ... (27 total levels)Textbook identifier
time
VARCHARcategorical100.0%70.0%time-short, time-long, long, time-medium, medium ... (7 total levels)Expected completion time
type
VARCHARcategorical100.0%550.0%__practice__has-context, assignment-reading, conceptual-or-recall, conceptual, __practice ... (55 total levels)Question type classification
version
INTEGERcategorical100.0%310.0%No samples available
preview
VARCHARcategorical⚠️ 1.3%4841.9%Play with the Moving Man controls with the Int..., Watch the following video on social change, th..., Watch the following video on migration, then a..., Watch the following video on education, then a..., Watch the following TED Talks video on problem... ... (484 total levels)
number_of_questions
INTEGERcategorical100.0%70.0%No samples available

🔢 Discrete (1 column)

Discrete numeric variables (integers, counts)

View discrete columns
ColumnData TypeVariable TypeCompletenessUnique ValuesMinMaxMeanSample ValuesDescription
number
INTEGERdiscrete100.0%24,022129,1549,385.56

☑️ Boolean (4 columns)

Boolean variables (true/false)

View boolean columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
has_interactive
BOOLEANboolean100.0%20.0%No samples available
has_video
BOOLEANboolean100.0%20.0%No samples available
is_copyable
BOOLEANboolean100.0%10.0%No samples available
anonymize_author
BOOLEANboolean100.0%10.0%No samples available

📅 Datetime (2 columns)

Date and time columns

View datetime columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
created_at
VARCHARdatetime100.0%10,2300.5%2019-08-12 15:00:52.280136, 2018-01-05 16:27:04.623406, 2016-09-16 20:39:35.302502
updated_at
VARCHARdatetime100.0%371,52319.0%2020-04-24 11:32:40.70988, 2020-04-24 10:18:09.384956, 2021-01-07 14:24:53.812729

📝 Text (6 columns)

Free-form text columns

View text columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
question_text
VARCHARtext100.0%27,6001.4%Define polyspermy in your own ..., Two point sources of <span data-math="500, \te..., Define lymphocyte in your own ...HTML content of question stem
response_options
VARCHARtext100.0%82,0514.2%{-35.0},..., inheritance pattern in which a character shows ..., the distribution of phenotypes in a populationHTML content of answer choices
feedback_html
VARCHARtext60.3%47,3044.0%The frequencies are slightly different for beat..., This is the boiling point of water on the Kelvi..., As in a community or ecosystem, many different ...Text of expert authored feedback corresponding to response option
url
VARCHARtext100.0%84,8724.3%https://exercises.openstax.org/exercises/14452@5, https://exercises.openstax.org/exercises/7917@3, https://exercises.openstax.org/exercises/14320@6Original exercise URL
context
VARCHARtext⚠️ 0.9%2,48313.7%<div data-type="note" data-has-label="true" id=...HTML
nickname
VARCHARtext⚠️ 3.7%1,9052.6%Ch05-N-ErieCanal-RQ01, Ch14-DP-ProtBerk-RQ02, Ch09-N-PlvFerg-RQ03Internal identifier for question

⚠️ Empty (2 columns)

Empty or null columns

View empty columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
title
VARCHARempty⚠️ 0.0%0NaN%No samples available
derived_from_id
INTEGERempty⚠️ 0.0%0NaN%No samples available

📈 Summary Statistics

Dataset Overview

AttributeValue
Total Rows1,956,063
Total Columns31
Total Cells60,637,953
Profiling Time2.93 seconds
Profiling Speed20,692,955 cells/second

Column Types Distribution

Data TypeCountPercentage
VARCHAR1961.3%
INTEGER722.6%
BOOLEAN412.9%
VARCHAR[]13.2%

Variable Types Distribution

Variable TypeCountPercentage
Categorical1032.3%
Identifier619.4%
Text619.4%
Boolean412.9%
Datetime26.5%
Empty26.5%
Discrete13.2%

Data Completeness

Completeness LevelColumn CountStatus
Complete (0% nulls)25
Mostly Complete (1-10% nulls)0
Partial (11-50% nulls)1⚠️
Sparse (51-90% nulls)0
Mostly Empty (>90% nulls)5

Overall Data Completeness: 82.8%

Cardinality Analysis

Cardinality indicates the uniqueness of values in each column.

Cardinality LevelColumn CountDescription
Very High (>95% unique)0Likely identifiers
High (50-95% unique)0High variability
Medium (10-50% unique)4Moderate variability
Low (<10% unique)25Categorical/Boolean

📖 Glossary

View term definitions
TermDefinition
CardinalityThe number of unique values in a column relative to total non-null values. High cardinality means many unique values.
CompletenessPercentage of non-null values in a column. Higher is better.
Data TypeThe technical storage type (e.g., INTEGER, VARCHAR, BOOLEAN).
IdentifierA column containing unique values that identify records (e.g., ID, UUID).
Missing ValuesNull, empty, or placeholder values (NA, null, empty string).
Null PercentageThe proportion of null/missing values in a column.
Sample ValuesExample values from the column to illustrate its contents.
Variable TypeThe semantic meaning of the column (categorical, continuous, etc.).

Generated by Data Profiler v5.2.3
Report Date: 2025-11-25 08:58:36
Status: ✅ All columns profiled successfully