Data Profile Report
Generated on November 25, 2025 at 08:58 AM
Overview
This dataset contains processed question details extracted from OpenStax Tutor platform exercise content. The data includes detailed question text, response options, educational metadata tags, and exercise classifications that support learning analytics and educational research.
Dataset File:
Data Last Updated: 2024-12-30
Dataset Description
The processed dataset transforms raw JSON exercise content from the OpenStax Tutor platform into a structured, analysis-ready format. Each record represents a unique question within an exercise, with associated metadata, tags, and response options.
Executive Dashboard
🟢 Data Quality Grade: B (89/100)
| Metric | Value | Status |
|---|---|---|
| Total Rows | 1,956,063 | ✅ |
| Total Columns | 31 | ✅ |
| Completeness | 82.8% | ⚠️ |
| Columns with Issues | 2 | ⚠️ |
| Processing Time | 2.9s | ✅ |
🔍 Data Quality Alerts
❌ Critical Issues
These issues require immediate attention:
- title: Column is completely empty (100% null)
- preview: 98.7% null values (sparse data)
- context: 99.1% null values (sparse data)
- nickname: 96.3% null values (sparse data)
- derived_from_id: Column is completely empty (100% null)
⚠️ Warnings
Review these columns for potential issues:
- title: Only 0 unique values in 1,956,063 rows
- preview: 484 unique categories (consider if this should be text/identifier)
- derived_from_id: Only 0 unique values in 1,956,063 rows
📊 Column Profiles
Columns are organized by data type for easier navigation.
🔑 Identifier (6 columns)
Unique identifiers (IDs, keys, UUIDs)
View identifier columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| INTEGER | identifier | 100.0% | 349,778 | 17.9% | No samples available | Identifier for multiple choice answers | |
| INTEGER | identifier | 100.0% | 474,724 | 24.3% | No samples available | Instance of exercise inside Tutor | |
| INTEGER | identifier | 100.0% | 26,217 | 1.3% | No samples available | ||
| VARCHAR | identifier | 100.0% | 84,837 | 4.3% | 93dde152-68c9-4d0e-a685-bd3595101a34, a2410663-c6c4-4573-b18c-fbd076f71905, 2b90b18b-db5e-4a24-a853-8ee74f559364 | ||
| VARCHAR | identifier | 100.0% | 24,022 | 1.2% | 7dfe5195-b996-4b80-aa8d-82e3b34c34ce, cbb8e20f-5ae4-4e85-a177-1e0ebd1add33, 5e6323b9-966c-4254-962a-289f06647aff | Exercise group identifier (corresponds to before the ) | |
| VARCHAR | identifier | 100.0% | 84,837 | 4.3% | 14531@4, 11187@1, 23348@2 | Unique question and version identifier within Tutor where the segment before is the ID and the segment after is the version |
📑 Categorical (10 columns)
Categorical variables with limited distinct values
View categorical columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | categorical | 100.0% | 2 | 0.0% | 0.0, 1.0 | ||
| VARCHAR[] | categorical | 100.0% | 6 | 0.0% | multiple-choice, free-response ... (6 total levels) | Question format specifications | |
| VARCHAR | categorical | 100.0% | 14 | 0.0% | 5, -1, 6, -3, 1 ... (14 total levels) | Bloom's taxonomy classification | |
| VARCHAR | categorical | 100.0% | 6 | 0.0% | 3, 4, 1, 2 ... (6 total levels) | Depth of Knowledge level | |
| VARCHAR | categorical | 100.0% | 27 | 0.0% | apush, anp, cbio, apbio, bio ... (27 total levels) | Textbook identifier | |
| VARCHAR | categorical | 100.0% | 7 | 0.0% | time-short, time-long, long, time-medium, medium ... (7 total levels) | Expected completion time | |
| VARCHAR | categorical | 100.0% | 55 | 0.0% | __practice__has-context, assignment-reading, conceptual-or-recall, conceptual, __practice ... (55 total levels) | Question type classification | |
| INTEGER | categorical | 100.0% | 31 | 0.0% | No samples available | ||
| VARCHAR | categorical | ⚠️ 1.3% | 484 | 1.9% | Play with the Moving Man controls with the Int..., Watch the following video on social change, th..., Watch the following video on migration, then a..., Watch the following video on education, then a..., Watch the following TED Talks video on problem... ... (484 total levels) | ||
| INTEGER | categorical | 100.0% | 7 | 0.0% | No samples available |
🔢 Discrete (1 column)
Discrete numeric variables (integers, counts)
View discrete columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Min | Max | Mean | Sample Values | Description |
|---|---|---|---|---|---|---|---|---|---|
| INTEGER | discrete | 100.0% | 24,022 | 1 | 29,154 | 9,385.56 |
☑️ Boolean (4 columns)
Boolean variables (true/false)
View boolean columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| BOOLEAN | boolean | 100.0% | 2 | 0.0% | No samples available | ||
| BOOLEAN | boolean | 100.0% | 2 | 0.0% | No samples available | ||
| BOOLEAN | boolean | 100.0% | 1 | 0.0% | No samples available | ||
| BOOLEAN | boolean | 100.0% | 1 | 0.0% | No samples available |
📅 Datetime (2 columns)
Date and time columns
View datetime columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | datetime | 100.0% | 10,230 | 0.5% | 2019-08-12 15:00:52.280136, 2018-01-05 16:27:04.623406, 2016-09-16 20:39:35.302502 | ||
| VARCHAR | datetime | 100.0% | 371,523 | 19.0% | 2020-04-24 11:32:40.70988, 2020-04-24 10:18:09.384956, 2021-01-07 14:24:53.812729 |
📝 Text (6 columns)
Free-form text columns
View text columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | text | 100.0% | 27,600 | 1.4% | Define polyspermy in your own ..., Two point sources of <span data-math="500, \te..., Define lymphocyte in your own ... | HTML content of question stem | |
| VARCHAR | text | 100.0% | 82,051 | 4.2% | {-35.0},..., inheritance pattern in which a character shows ..., the distribution of phenotypes in a population | HTML content of answer choices | |
| VARCHAR | text | 60.3% | 47,304 | 4.0% | The frequencies are slightly different for beat..., This is the boiling point of water on the Kelvi..., As in a community or ecosystem, many different ... | Text of expert authored feedback corresponding to response option | |
| VARCHAR | text | 100.0% | 84,872 | 4.3% | https://exercises.openstax.org/exercises/14452@5, https://exercises.openstax.org/exercises/7917@3, https://exercises.openstax.org/exercises/14320@6 | Original exercise URL | |
| VARCHAR | text | ⚠️ 0.9% | 2,483 | 13.7% | <div data-type="note" data-has-label="true" id=... | HTML | |
| VARCHAR | text | ⚠️ 3.7% | 1,905 | 2.6% | Ch05-N-ErieCanal-RQ01, Ch14-DP-ProtBerk-RQ02, Ch09-N-PlvFerg-RQ03 | Internal identifier for question |
⚠️ Empty (2 columns)
Empty or null columns
View empty columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | empty | ⚠️ 0.0% | 0 | NaN% | No samples available | ||
| INTEGER | empty | ⚠️ 0.0% | 0 | NaN% | No samples available |
📈 Summary Statistics
Dataset Overview
| Attribute | Value |
|---|---|
| Total Rows | 1,956,063 |
| Total Columns | 31 |
| Total Cells | 60,637,953 |
| Profiling Time | 2.93 seconds |
| Profiling Speed | 20,692,955 cells/second |
Column Types Distribution
| Data Type | Count | Percentage |
|---|---|---|
| VARCHAR | 19 | 61.3% |
| INTEGER | 7 | 22.6% |
| BOOLEAN | 4 | 12.9% |
| VARCHAR[] | 1 | 3.2% |
Variable Types Distribution
| Variable Type | Count | Percentage |
|---|---|---|
| Categorical | 10 | 32.3% |
| Identifier | 6 | 19.4% |
| Text | 6 | 19.4% |
| Boolean | 4 | 12.9% |
| Datetime | 2 | 6.5% |
| Empty | 2 | 6.5% |
| Discrete | 1 | 3.2% |
Data Completeness
| Completeness Level | Column Count | Status |
|---|---|---|
| Complete (0% nulls) | 25 | ✅ |
| Mostly Complete (1-10% nulls) | 0 | ✅ |
| Partial (11-50% nulls) | 1 | ⚠️ |
| Sparse (51-90% nulls) | 0 | ❌ |
| Mostly Empty (>90% nulls) | 5 | ❌ |
Overall Data Completeness: 82.8%
Cardinality Analysis
Cardinality indicates the uniqueness of values in each column.
| Cardinality Level | Column Count | Description |
|---|---|---|
| Very High (>95% unique) | 0 | Likely identifiers |
| High (50-95% unique) | 0 | High variability |
| Medium (10-50% unique) | 4 | Moderate variability |
| Low (<10% unique) | 25 | Categorical/Boolean |
📖 Glossary
View term definitions
| Term | Definition |
|---|---|
| Cardinality | The number of unique values in a column relative to total non-null values. High cardinality means many unique values. |
| Completeness | Percentage of non-null values in a column. Higher is better. |
| Data Type | The technical storage type (e.g., INTEGER, VARCHAR, BOOLEAN). |
| Identifier | A column containing unique values that identify records (e.g., ID, UUID). |
| Missing Values | Null, empty, or placeholder values (NA, null, empty string). |
| Null Percentage | The proportion of null/missing values in a column. |
| Sample Values | Example values from the column to illustrate its contents. |
| Variable Type | The semantic meaning of the column (categorical, continuous, etc.). |