Data Profile Report

Generated on November 25, 2025 at 08:58 AM

Overview

This dataset contains student-generated highlights and notes from OpenStax Tutor, an adaptive learning platform. The data captures detailed interactions where students annotate textbook content, providing insights into reading behaviors, content engagement, and learning strategies in digital educational environments.

Research Applications

This highlights and notes dataset is valuable for education research in areas including:

Reading Analytics: Student annotation patterns, text engagement depth, reading comprehension strategies
Content Analysis: Most highlighted sections, note-taking frequency by content type, misconception identification
Learning Behavior: Correlation between annotation activity and performance, study habit analysis
Engagement Metrics: Time spent on content, re-reading patterns, active vs. passive reading behaviors
Personalization Research: Individual vs. collective annotation patterns, peer learning through shared notes
Accessibility Studies: How different learners interact with digital text, annotation tool effectiveness

Data Structure Overview

The dataset follows this structure centered around content annotations:

Content Notes → Content Pages → Students → Courses → Periods

Executive Dashboard

🟢 Data Quality Grade: A (97/100)

Metric	Value	Status
Total Rows	3,912,221	✅
Total Columns	18	✅
Completeness	93.0%	✅
Columns with Issues	0	✅
Processing Time	15.4s	✅

🔍 Data Quality Alerts

⚠️ Warnings

Review these columns for potential issues:

book_location: 646 unique categories (consider if this should be text/identifier)
course_id: 62.6% null values (partial data)
book_name: 62.7% null values (partial data)

📊 Column Profiles

Columns are organized by data type for easier navigation.

🔑 Identifier (9 columns)

Unique identifiers (IDs, keys, UUIDs)

View identifier columns

Column	Data Type	Variable Type	Completeness	Unique Values	Cardinality	Sample Values	Description
`content_notes_id`	INTEGER	identifier	100.0%	2,450,114	62.6%	1749143, 1675689, 1975715	Numeric identifier for notes
`content_page_id`	INTEGER	identifier	100.0%	7,513	0.2%	64957, 74779, 67095	Numeric identifier for specific page in an OpenStax textbook
`entity_role_id`	INTEGER	identifier	100.0%	19,950	0.5%	117960, 90124, 104477
`page_uuid`	VARCHAR	identifier	100.0%	2,739	0.1%	7ad44b4b-1f3a-44b3-acdf-a2931257a646, 0a0f04c9-a7ed-463a-9ac2-2611a1dbebc9, d856d831-d708-41d0-afd2-8dfe51cec7c2	Unique identifier for the content page
`content_book_id`	INTEGER	identifier	100.0%	50	0.0%	221, 239, 299	Numeric identifier for OpenStax textbook
`tutor_uuid`	VARCHAR	identifier	100.0%	7,513	0.2%	7bab5af8-f0c0-4b23-8da9-e246fce58260, 681f3248-a38e-4122-8b88-4fd99798bd99, f4e764f6-2317-4a7c-9c3a-b9239fdd0273	Platform-specific content identifier
`course_id`	DOUBLE	identifier	⚠️ 37.4%	1,177	0.1%	13721.0, 8029.0, 16052.0	Unique identifier for teacher-created course on OpenStax Tutor
`course_membership_period_id`	INTEGER	identifier	100.0%	2,070	0.1%	23896, 14048, 29461	Course-period id in which the student was enrolled
`period_id`	VARCHAR	identifier	100.0%	303	0.0%	2nd, 2nd: USI H, 6th period	Teacher created name for course period

📑 Categorical (2 columns)

Categorical variables with limited distinct values

View categorical columns

Column	Data Type	Variable Type	Completeness	Unique Values	Cardinality	Sample Values	Description
`book_location`	VARCHAR	categorical	100.0%	646	0.0%	[29,4], [3,16], [6,8], [8,15], [11,2] ... (646 total levels)	Chapter, Section within OpenStax textbook
`book_name`	VARCHAR	categorical	⚠️ 37.3%	9	0.0%	Biology, US History, AP Bio, Anatomy & Physiology, Entrepreneurship ... (9 total levels)	OpenStax textbook name

📅 Datetime (2 columns)

Date and time columns

View datetime columns

Column	Data Type	Variable Type	Completeness	Unique Values	Cardinality	Sample Values	Description
`content_notes_created_at`	TIMESTAMP WITH TIME ZONE	datetime	100.0%	2,450,059	62.6%	2022-04-01 22:44:48.72944+00, 2021-08-25 03:59:06.504806+00, 2020-04-05 05:31:52.884241+00	Timestamp for when highlight/note was created
`content_notes_updated_at`	VARCHAR	datetime	100.0%	2,450,062	62.6%	2021-04-18 23:42:02.02931, 2020-10-22 01:14:51.93465, 2019-11-10 23:51:01.379711	Timestamp for when highlight/note was modified

📝 Text (5 columns)

Free-form text columns

View text columns

Column	Data Type	Variable Type	Completeness	Unique Values	Cardinality	Sample Values	Description
`anchor`	VARCHAR	text	100.0%	9,399	0.2%	#0b8cbcb3-09db-486b-848f-fa0b8913f53f, #7e97e09d-8c26-473b-9118-9cbe049aaa97, #718b9b00-31ea-4b7c-ad6c-4d740f564164	The specific text that was highlighted by the student
`annotation`	VARCHAR	text	100.0%	190,184	4.9%	connected to the soma id branching extensions c..., Texas was a part of Mexico, Genome fusion: occurs when one species is taken...	Student's written note or comment about the highlighted text
`contents`	VARCHAR	text	100.0%	2,421,243	61.9%	{"id": "1685140254857", "rect": {"top": 689.757..., {"id": "1570817350411", "rect": {"top": 13476.0..., {"id": "1605037308758", "rect": {"top": 1486.68...	Complete JSON structure containing rich annotation data
`section_title`	VARCHAR	text	100.0%	2,178	0.1%	The Science of Biology, 2.4<span class="..., Electric Potential Energy: Potential Difference	Title of the textbook section containing the annotation
`research_identifier`	VARCHAR	text	100.0%	19,950	0.5%	r279086da, r2abc04f4, r535cd367	Anonymized identifier for students, primary key that can be used to merge student data across datasets

📈 Summary Statistics

Dataset Overview

Attribute	Value
Total Rows	3,912,221
Total Columns	18
Total Cells	70,419,978
Profiling Time	15.38 seconds
Profiling Speed	4,579,422 cells/second

Column Types Distribution

Data Type	Count	Percentage
VARCHAR	11	61.1%
INTEGER	5	27.8%
DOUBLE	1	5.6%
TIMESTAMP WITH TIME ZONE	1	5.6%

Variable Types Distribution

Variable Type	Count	Percentage
Identifier	9	50.0%
Text	5	27.8%
Categorical	2	11.1%
Datetime	2	11.1%

Data Completeness

Completeness Level	Column Count	Status
Complete (0% nulls)	16	✅
Mostly Complete (1-10% nulls)	0	✅
Partial (11-50% nulls)	0	⚠️
Sparse (51-90% nulls)	2	❌
Mostly Empty (>90% nulls)	0	❌

Overall Data Completeness: 93.0%

Cardinality Analysis

Cardinality indicates the uniqueness of values in each column.

Cardinality Level	Column Count	Description
Very High (>95% unique)	0	Likely identifiers
High (50-95% unique)	4	High variability
Medium (10-50% unique)	0	Moderate variability
Low (<10% unique)	14	Categorical/Boolean

📖 Glossary

View term definitions

Term	Definition
Cardinality	The number of unique values in a column relative to total non-null values. High cardinality means many unique values.
Completeness	Percentage of non-null values in a column. Higher is better.
Data Type	The technical storage type (e.g., INTEGER, VARCHAR, BOOLEAN).
Identifier	A column containing unique values that identify records (e.g., ID, UUID).
Missing Values	Null, empty, or placeholder values (NA, null, empty string).
Null Percentage	The proportion of null/missing values in a column.
Sample Values	Example values from the column to illustrate its contents.
Variable Type	The semantic meaning of the column (categorical, continuous, etc.).