Data Profile Report

Generated on November 25, 2025 at 08:58 AM

Overview

This dataset contains student-generated highlights and notes from OpenStax Tutor, an adaptive learning platform. The data captures detailed interactions where students annotate textbook content, providing insights into reading behaviors, content engagement, and learning strategies in digital educational environments.

Research Applications

This highlights and notes dataset is valuable for education research in areas including:

  • Reading Analytics: Student annotation patterns, text engagement depth, reading comprehension strategies
  • Content Analysis: Most highlighted sections, note-taking frequency by content type, misconception identification
  • Learning Behavior: Correlation between annotation activity and performance, study habit analysis
  • Engagement Metrics: Time spent on content, re-reading patterns, active vs. passive reading behaviors
  • Personalization Research: Individual vs. collective annotation patterns, peer learning through shared notes
  • Accessibility Studies: How different learners interact with digital text, annotation tool effectiveness

Data Structure Overview

The dataset follows this structure centered around content annotations:

Content Notes → Content Pages → Students → Courses → Periods

Executive Dashboard

🟢 Data Quality Grade: A (97/100)

MetricValueStatus
Total Rows3,912,221
Total Columns18
Completeness93.0%
Columns with Issues0
Processing Time15.4s

🔍 Data Quality Alerts

⚠️ Warnings

Review these columns for potential issues:

  • book_location: 646 unique categories (consider if this should be text/identifier)
  • course_id: 62.6% null values (partial data)
  • book_name: 62.7% null values (partial data)

📊 Column Profiles

Columns are organized by data type for easier navigation.

🔑 Identifier (9 columns)

Unique identifiers (IDs, keys, UUIDs)

View identifier columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
content_notes_id
INTEGERidentifier100.0%2,450,11462.6%1749143, 1675689, 1975715Numeric identifier for notes
content_page_id
INTEGERidentifier100.0%7,5130.2%64957, 74779, 67095Numeric identifier for specific page in an OpenStax textbook
entity_role_id
INTEGERidentifier100.0%19,9500.5%117960, 90124, 104477
page_uuid
VARCHARidentifier100.0%2,7390.1%7ad44b4b-1f3a-44b3-acdf-a2931257a646, 0a0f04c9-a7ed-463a-9ac2-2611a1dbebc9, d856d831-d708-41d0-afd2-8dfe51cec7c2Unique identifier for the content page
content_book_id
INTEGERidentifier100.0%500.0%221, 239, 299Numeric identifier for OpenStax textbook
tutor_uuid
VARCHARidentifier100.0%7,5130.2%7bab5af8-f0c0-4b23-8da9-e246fce58260, 681f3248-a38e-4122-8b88-4fd99798bd99, f4e764f6-2317-4a7c-9c3a-b9239fdd0273Platform-specific content identifier
course_id
DOUBLEidentifier⚠️ 37.4%1,1770.1%13721.0, 8029.0, 16052.0Unique identifier for teacher-created course on OpenStax Tutor
course_membership_period_id
INTEGERidentifier100.0%2,0700.1%23896, 14048, 29461Course-period id in which the student was enrolled
period_id
VARCHARidentifier100.0%3030.0%2nd, 2nd: USI H, 6th periodTeacher created name for course period

📑 Categorical (2 columns)

Categorical variables with limited distinct values

View categorical columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
book_location
VARCHARcategorical100.0%6460.0%[29,4], [3,16], [6,8], [8,15], [11,2] ... (646 total levels)Chapter, Section within OpenStax textbook
book_name
VARCHARcategorical⚠️ 37.3%90.0%Biology, US History, AP Bio, Anatomy & Physiology, Entrepreneurship ... (9 total levels)OpenStax textbook name

📅 Datetime (2 columns)

Date and time columns

View datetime columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
content_notes_created_at
TIMESTAMP WITH TIME ZONEdatetime100.0%2,450,05962.6%2022-04-01 22:44:48.72944+00, 2021-08-25 03:59:06.504806+00, 2020-04-05 05:31:52.884241+00Timestamp for when highlight/note was created
content_notes_updated_at
VARCHARdatetime100.0%2,450,06262.6%2021-04-18 23:42:02.02931, 2020-10-22 01:14:51.93465, 2019-11-10 23:51:01.379711Timestamp for when highlight/note was modified

📝 Text (5 columns)

Free-form text columns

View text columns
ColumnData TypeVariable TypeCompletenessUnique ValuesCardinalitySample ValuesDescription
anchor
VARCHARtext100.0%9,3990.2%#0b8cbcb3-09db-486b-848f-fa0b8913f53f, #7e97e09d-8c26-473b-9118-9cbe049aaa97, #718b9b00-31ea-4b7c-ad6c-4d740f564164The specific text that was highlighted by the student
annotation
VARCHARtext100.0%190,1844.9%connected to the soma id branching extensions c..., Texas was a part of Mexico, Genome fusion: occurs when one species is taken...Student's written note or comment about the highlighted text
contents
VARCHARtext100.0%2,421,24361.9%{"id": "1685140254857", "rect": {"top": 689.757..., {"id": "1570817350411", "rect": {"top": 13476.0..., {"id": "1605037308758", "rect": {"top": 1486.68...Complete JSON structure containing rich annotation data
section_title
VARCHARtext100.0%2,1780.1%The Science of Biology, 2.4<span class="..., Electric Potential Energy: Potential DifferenceTitle of the textbook section containing the annotation
research_identifier
VARCHARtext100.0%19,9500.5%r279086da, r2abc04f4, r535cd367Anonymized identifier for students, primary key that can be used to merge student data across datasets

📈 Summary Statistics

Dataset Overview

AttributeValue
Total Rows3,912,221
Total Columns18
Total Cells70,419,978
Profiling Time15.38 seconds
Profiling Speed4,579,422 cells/second

Column Types Distribution

Data TypeCountPercentage
VARCHAR1161.1%
INTEGER527.8%
DOUBLE15.6%
TIMESTAMP WITH TIME ZONE15.6%

Variable Types Distribution

Variable TypeCountPercentage
Identifier950.0%
Text527.8%
Categorical211.1%
Datetime211.1%

Data Completeness

Completeness LevelColumn CountStatus
Complete (0% nulls)16
Mostly Complete (1-10% nulls)0
Partial (11-50% nulls)0⚠️
Sparse (51-90% nulls)2
Mostly Empty (>90% nulls)0

Overall Data Completeness: 93.0%

Cardinality Analysis

Cardinality indicates the uniqueness of values in each column.

Cardinality LevelColumn CountDescription
Very High (>95% unique)0Likely identifiers
High (50-95% unique)4High variability
Medium (10-50% unique)0Moderate variability
Low (<10% unique)14Categorical/Boolean

📖 Glossary

View term definitions
TermDefinition
CardinalityThe number of unique values in a column relative to total non-null values. High cardinality means many unique values.
CompletenessPercentage of non-null values in a column. Higher is better.
Data TypeThe technical storage type (e.g., INTEGER, VARCHAR, BOOLEAN).
IdentifierA column containing unique values that identify records (e.g., ID, UUID).
Missing ValuesNull, empty, or placeholder values (NA, null, empty string).
Null PercentageThe proportion of null/missing values in a column.
Sample ValuesExample values from the column to illustrate its contents.
Variable TypeThe semantic meaning of the column (categorical, continuous, etc.).

Generated by Data Profiler v5.2.3
Report Date: 2025-11-25 08:58:28
Status: ✅ All columns profiled successfully