Data Profile Report
Generated on November 25, 2025 at 08:58 AM
Overview
This dataset contains student-generated highlights and notes from OpenStax Tutor, an adaptive learning platform. The data captures detailed interactions where students annotate textbook content, providing insights into reading behaviors, content engagement, and learning strategies in digital educational environments.
Research Applications
This highlights and notes dataset is valuable for education research in areas including:
- Reading Analytics: Student annotation patterns, text engagement depth, reading comprehension strategies
- Content Analysis: Most highlighted sections, note-taking frequency by content type, misconception identification
- Learning Behavior: Correlation between annotation activity and performance, study habit analysis
- Engagement Metrics: Time spent on content, re-reading patterns, active vs. passive reading behaviors
- Personalization Research: Individual vs. collective annotation patterns, peer learning through shared notes
- Accessibility Studies: How different learners interact with digital text, annotation tool effectiveness
Data Structure Overview
The dataset follows this structure centered around content annotations:
Executive Dashboard
🟢 Data Quality Grade: A (97/100)
| Metric | Value | Status |
|---|---|---|
| Total Rows | 3,912,221 | ✅ |
| Total Columns | 18 | ✅ |
| Completeness | 93.0% | ✅ |
| Columns with Issues | 0 | ✅ |
| Processing Time | 15.4s | ✅ |
🔍 Data Quality Alerts
⚠️ Warnings
Review these columns for potential issues:
- book_location: 646 unique categories (consider if this should be text/identifier)
- course_id: 62.6% null values (partial data)
- book_name: 62.7% null values (partial data)
📊 Column Profiles
Columns are organized by data type for easier navigation.
🔑 Identifier (9 columns)
Unique identifiers (IDs, keys, UUIDs)
View identifier columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| INTEGER | identifier | 100.0% | 2,450,114 | 62.6% | 1749143, 1675689, 1975715 | Numeric identifier for notes | |
| INTEGER | identifier | 100.0% | 7,513 | 0.2% | 64957, 74779, 67095 | Numeric identifier for specific page in an OpenStax textbook | |
| INTEGER | identifier | 100.0% | 19,950 | 0.5% | 117960, 90124, 104477 | ||
| VARCHAR | identifier | 100.0% | 2,739 | 0.1% | 7ad44b4b-1f3a-44b3-acdf-a2931257a646, 0a0f04c9-a7ed-463a-9ac2-2611a1dbebc9, d856d831-d708-41d0-afd2-8dfe51cec7c2 | Unique identifier for the content page | |
| INTEGER | identifier | 100.0% | 50 | 0.0% | 221, 239, 299 | Numeric identifier for OpenStax textbook | |
| VARCHAR | identifier | 100.0% | 7,513 | 0.2% | 7bab5af8-f0c0-4b23-8da9-e246fce58260, 681f3248-a38e-4122-8b88-4fd99798bd99, f4e764f6-2317-4a7c-9c3a-b9239fdd0273 | Platform-specific content identifier | |
| DOUBLE | identifier | ⚠️ 37.4% | 1,177 | 0.1% | 13721.0, 8029.0, 16052.0 | Unique identifier for teacher-created course on OpenStax Tutor | |
| INTEGER | identifier | 100.0% | 2,070 | 0.1% | 23896, 14048, 29461 | Course-period id in which the student was enrolled | |
| VARCHAR | identifier | 100.0% | 303 | 0.0% | 2nd, 2nd: USI H, 6th period | Teacher created name for course period |
📑 Categorical (2 columns)
Categorical variables with limited distinct values
View categorical columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | categorical | 100.0% | 646 | 0.0% | [29,4], [3,16], [6,8], [8,15], [11,2] ... (646 total levels) | Chapter, Section within OpenStax textbook | |
| VARCHAR | categorical | ⚠️ 37.3% | 9 | 0.0% | Biology, US History, AP Bio, Anatomy & Physiology, Entrepreneurship ... (9 total levels) | OpenStax textbook name |
📅 Datetime (2 columns)
Date and time columns
View datetime columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| TIMESTAMP WITH TIME ZONE | datetime | 100.0% | 2,450,059 | 62.6% | 2022-04-01 22:44:48.72944+00, 2021-08-25 03:59:06.504806+00, 2020-04-05 05:31:52.884241+00 | Timestamp for when highlight/note was created | |
| VARCHAR | datetime | 100.0% | 2,450,062 | 62.6% | 2021-04-18 23:42:02.02931, 2020-10-22 01:14:51.93465, 2019-11-10 23:51:01.379711 | Timestamp for when highlight/note was modified |
📝 Text (5 columns)
Free-form text columns
View text columns
| Column | Data Type | Variable Type | Completeness | Unique Values | Cardinality | Sample Values | Description |
|---|---|---|---|---|---|---|---|
| VARCHAR | text | 100.0% | 9,399 | 0.2% | #0b8cbcb3-09db-486b-848f-fa0b8913f53f, #7e97e09d-8c26-473b-9118-9cbe049aaa97, #718b9b00-31ea-4b7c-ad6c-4d740f564164 | The specific text that was highlighted by the student | |
| VARCHAR | text | 100.0% | 190,184 | 4.9% | connected to the soma id branching extensions c..., Texas was a part of Mexico, Genome fusion: occurs when one species is taken... | Student's written note or comment about the highlighted text | |
| VARCHAR | text | 100.0% | 2,421,243 | 61.9% | {"id": "1685140254857", "rect": {"top": 689.757..., {"id": "1570817350411", "rect": {"top": 13476.0..., {"id": "1605037308758", "rect": {"top": 1486.68... | Complete JSON structure containing rich annotation data | |
| VARCHAR | text | 100.0% | 2,178 | 0.1% | The Science of Biology, 2.4<span class="..., Electric Potential Energy: Potential Difference | Title of the textbook section containing the annotation | |
| VARCHAR | text | 100.0% | 19,950 | 0.5% | r279086da, r2abc04f4, r535cd367 | Anonymized identifier for students, primary key that can be used to merge student data across datasets |
📈 Summary Statistics
Dataset Overview
| Attribute | Value |
|---|---|
| Total Rows | 3,912,221 |
| Total Columns | 18 |
| Total Cells | 70,419,978 |
| Profiling Time | 15.38 seconds |
| Profiling Speed | 4,579,422 cells/second |
Column Types Distribution
| Data Type | Count | Percentage |
|---|---|---|
| VARCHAR | 11 | 61.1% |
| INTEGER | 5 | 27.8% |
| DOUBLE | 1 | 5.6% |
| TIMESTAMP WITH TIME ZONE | 1 | 5.6% |
Variable Types Distribution
| Variable Type | Count | Percentage |
|---|---|---|
| Identifier | 9 | 50.0% |
| Text | 5 | 27.8% |
| Categorical | 2 | 11.1% |
| Datetime | 2 | 11.1% |
Data Completeness
| Completeness Level | Column Count | Status |
|---|---|---|
| Complete (0% nulls) | 16 | ✅ |
| Mostly Complete (1-10% nulls) | 0 | ✅ |
| Partial (11-50% nulls) | 0 | ⚠️ |
| Sparse (51-90% nulls) | 2 | ❌ |
| Mostly Empty (>90% nulls) | 0 | ❌ |
Overall Data Completeness: 93.0%
Cardinality Analysis
Cardinality indicates the uniqueness of values in each column.
| Cardinality Level | Column Count | Description |
|---|---|---|
| Very High (>95% unique) | 0 | Likely identifiers |
| High (50-95% unique) | 4 | High variability |
| Medium (10-50% unique) | 0 | Moderate variability |
| Low (<10% unique) | 14 | Categorical/Boolean |
📖 Glossary
View term definitions
| Term | Definition |
|---|---|
| Cardinality | The number of unique values in a column relative to total non-null values. High cardinality means many unique values. |
| Completeness | Percentage of non-null values in a column. Higher is better. |
| Data Type | The technical storage type (e.g., INTEGER, VARCHAR, BOOLEAN). |
| Identifier | A column containing unique values that identify records (e.g., ID, UUID). |
| Missing Values | Null, empty, or placeholder values (NA, null, empty string). |
| Null Percentage | The proportion of null/missing values in a column. |
| Sample Values | Example values from the column to illustrate its contents. |
| Variable Type | The semantic meaning of the column (categorical, continuous, etc.). |