OpenStax R Research Image

https://github.com/safeinsights/openstax-research-image/r-4.5.1

Version: 4.5.1

OpenStax Dataset Connection Functions

This document provides comprehensive documentation for all R functions in

openstax.R
that enable access to OpenStax educational datasets for research purposes.


Available Functions

  1. query_tutor
  2. query_tutor_exercises
  3. query_tutor_notes_and_highlights
  4. query_notes_and_highlights
  5. query_notes_and_highlights_db
  6. query_event_capture

Function Reference

query_tutor

Query OpenStax Tutor dataset from S3 parquet files using DuckDB.

Purpose: Executes SQL queries against OpenStax Tutor data stored as parquet files in S3. This function provides access to Tutor learning analytics data including student interactions, exercise completions, and performance metrics.

Parameters:

  • sql_query
    (required): SQL query string to execute against the 'tutor_data' table
  • start_date
    (optional): Start date for file selection (YYYY-MM-DD format or Date object). If not provided, extracted from SQL query or defaults to 2024-01-01
  • end_date
    (optional): End date for file selection (YYYY-MM-DD format or Date object). If not provided, extracted from SQL query or defaults to 2024-12-31

Returns: A data.frame containing the query results

Examples:

# Query with explicit date range data_2024 <- query_tutor( "SELECT * FROM tutor_data WHERE course_id = 123 LIMIT 1000", start_date = "2024-01-01", end_date = "2024-12-31" ) # Query with dates extracted from SQL WHERE clause result <- query_tutor( "SELECT * FROM tutor_data WHERE created_at >= '2023-02-01' AND created_at <= '2024-11-30'" ) # Get course statistics course_stats <- query_tutor( "SELECT course_id, COUNT(DISTINCT user_id) as student_count, AVG(score) as avg_score FROM tutor_data WHERE created_at >= '2024-01-01' GROUP BY course_id", start_date = "2024-01-01", end_date = "2024-12-31" ) # Query specific user activity user_activity <- query_tutor( "SELECT * FROM tutor_data WHERE user_id = 12345 AND created_at >= '2024-01-01' ORDER BY created_at", start_date = "2024-01-01", end_date = "2024-12-31" )

query_tutor_exercises

Query OpenStax Tutor Exercises dataset from S3 parquet file using DuckDB.

Purpose: Executes SQL queries against OpenStax Tutor Exercises data stored as a single parquet file in S3. This dataset contains exercise interaction data including user responses, timing, and performance metrics.

Parameters:

  • sql_query
    (required): SQL query string to execute against the 'exercises_data' table

Returns: A data.frame containing the query results

Examples:

# Get all exercises for a specific user user_exercises <- query_tutor_exercises( "SELECT * FROM exercises_data WHERE user_id = 123 ORDER BY event_time DESC LIMIT 1000" ) # Count unique users in a date range user_count <- query_tutor_exercises( "SELECT COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31'" ) # Get exercise response statistics with aggregation response_stats <- query_tutor_exercises( "SELECT exercise_id, COUNT(*) as attempts, AVG(score) as avg_score, COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id ORDER BY attempts DESC" ) # Analyze exercise completion rates completion_analysis <- query_tutor_exercises( "SELECT exercise_id, SUM(CASE WHEN completed = true THEN 1 ELSE 0 END) as completed_count, COUNT(*) as total_attempts, AVG(time_spent) as avg_time_seconds FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id" )

query_tutor_notes_and_highlights

Query OpenStax Tutor Notes and Highlights dataset from S3 parquet file using DuckDB.

Purpose: Executes SQL queries against OpenStax Tutor Notes and Highlights data stored as a single parquet file in S3. This dataset contains student notes and highlights from the Tutor platform.

Parameters:

  • sql_query
    (required): SQL query string to execute against the 'tutor_notes_highlights' table

Returns: A data.frame containing the query results

Examples:

# Get all highlights for a specific user user_highlights <- query_tutor_notes_and_highlights( "SELECT * FROM tutor_notes_highlights WHERE user_id = 123 ORDER BY created_at DESC" ) # Get highlights for a specific date range date_range_highlights <- query_tutor_notes_and_highlights( "SELECT * FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' AND created_at <= '2023-12-31' LIMIT 10000" ) # Count notes vs highlights by user note_type_counts <- query_tutor_notes_and_highlights( "SELECT user_id, SUM(CASE WHEN type = 'note' THEN 1 ELSE 0 END) as note_count, SUM(CASE WHEN type = 'highlight' THEN 1 ELSE 0 END) as highlight_count FROM tutor_notes_highlights GROUP BY user_id HAVING note_count > 0 OR highlight_count > 0" ) # Analyze highlight activity over time monthly_activity <- query_tutor_notes_and_highlights( "SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as total_annotations, COUNT(DISTINCT user_id) as active_users FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' GROUP BY month ORDER BY month" )

query_notes_and_highlights

Query OpenStax Notes and Highlights dataset via S3 parquet files using DuckDB.

Purpose: Executes SQL queries against OpenStax Notes and Highlights data stored as parquet files in S3. This is the recommended method for large queries and analytics workloads as it provides better performance and memory efficiency compared to direct database access.

Parameters:

  • sql_query
    (required): SQL query string to execute against the 'highlights' table

Returns: A data.frame containing the query results

Examples:

# Get highlights with basic filtering biology_highlights <- query_notes_and_highlights( "SELECT * FROM highlights WHERE book_id = 'biology-2e' LIMIT 10000" ) # Get aggregated statistics (recommended for large datasets) summary_stats <- query_notes_and_highlights( "SELECT book_id, COUNT(*) as total_highlights, AVG(LENGTH(content)) as avg_length, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE created_at > '2023-01-01' GROUP BY book_id ORDER BY total_highlights DESC" ) # Analyze highlight patterns by date daily_highlights <- query_notes_and_highlights( "SELECT DATE(created_at) as day, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as active_users FROM highlights WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01' GROUP BY day ORDER BY day" ) # Find most popular highlighted content popular_content <- query_notes_and_highlights( "SELECT content, COUNT(*) as times_highlighted, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE book_id = 'biology-2e' GROUP BY content HAVING COUNT(*) > 5 ORDER BY times_highlighted DESC LIMIT 100" )

query_notes_and_highlights_db

Query OpenStax Notes and Highlights dataset via PostgreSQL database.

Purpose: Connects to the OpenStax Notes and Highlights PostgreSQL database and executes a SQL query. This function provides direct database access for real-time queries and is suitable for smaller result sets or when you need the most current data.

Parameters:

  • sql_query
    (required): SQL query string to execute against the database tables

Returns: A data.frame containing the query results

Required Environment Variables:

  • HIGHLIGHTS_DB_HOST
    : Database hostname or IP address
  • HIGHLIGHTS_DB_PORT
    : Database port (typically 5432 for PostgreSQL)
  • HIGHLIGHTS_DB_USER
    : Database username with read access
  • HIGHLIGHTS_DB_PASSWORD
    : Database password
  • HIGHLIGHTS_DB_NAME
    : Database name

Examples:

# Get recent highlights (last 7 days) recent_highlights <- query_notes_and_highlights_db( "SELECT * FROM highlights WHERE created_at > CURRENT_DATE - INTERVAL '7 days' ORDER BY created_at DESC LIMIT 1000" ) # Count highlights by book book_counts <- query_notes_and_highlights_db( "SELECT book_id, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as unique_users FROM highlights GROUP BY book_id ORDER BY highlight_count DESC" ) # Get real-time user activity current_users <- query_notes_and_highlights_db( "SELECT user_id, COUNT(*) as highlights_today FROM highlights WHERE created_at >= CURRENT_DATE GROUP BY user_id ORDER BY highlights_today DESC" ) # Query specific highlight details highlight_details <- query_notes_and_highlights_db( "SELECT h.* FROM highlights h WHERE h.id IN (123, 456, 789)" )

query_event_capture

Query OpenStax Event Capture dataset from S3 with date partitioning.

Purpose: Queries OpenStax Event Capture data stored in S3 with date-based partitioning. This function constructs S3 paths for the specified date range and event type.

Parameters:

  • event
    (required): Event type to query (corresponds to S3 directory). Examples: "page_view", "highlight_created", "search_performed"
  • start_date
    (required): Start date for query range (YYYY-MM-DD format or Date object)
  • end_date
    (required): End date for query range (YYYY-MM-DD format or Date object)
  • sql_query
    (required): SQL query string to execute against the event data

Returns: Currently returns NULL (implementation is incomplete)

Note: This function is currently under development. The query execution portion is not yet implemented.

Examples:

# Query page view events for a specific week page_views <- query_event_capture( event = "page_view", start_date = "2023-01-01", end_date = "2023-01-07", sql_query = "SELECT * FROM events WHERE book_id = 'biology-2e'" ) # Get daily highlight creation counts for a month daily_highlights <- query_event_capture( event = "highlight_created", start_date = "2023-01-01", end_date = "2023-01-31", sql_query = "SELECT DATE(timestamp) as day, COUNT(*) as count FROM events GROUP BY DATE(timestamp) ORDER BY day" ) # Analyze search activity search_stats <- query_event_capture( event = "search_performed", start_date = as.Date("2024-01-01"), end_date = as.Date("2024-01-31"), sql_query = "SELECT search_term, COUNT(*) as search_count FROM events GROUP BY search_term ORDER BY search_count DESC LIMIT 100" )

Example Research Workflow

Complete workflow showing how to use these functions in a research container:

source("libraries/openstax.R") source("libraries/safeinsights_common.R") ############################################################################### # Initialize Research Container # DO NOT TOUCH initialize() ############################################################################### # Researcher: Insert query code here # 1. Get Tutor course data tutor_data <- query_tutor( "SELECT course_id, user_id, score, created_at FROM tutor_data WHERE created_at >= '2024-01-01'", start_date = "2024-01-01", end_date = "2024-12-31" ) # 2. Get exercise responses exercises <- query_tutor_exercises( "SELECT user_id, exercise_id, score, event_time FROM exercises_data WHERE event_time >= '2024-01-01' LIMIT 50000" ) # 3. Get highlights data highlights <- query_notes_and_highlights( "SELECT book_id, user_id, content, created_at FROM highlights WHERE created_at >= '2024-01-01' LIMIT 50000" ) ############################################################################### # Analyze and manipulate data # Merge datasets library(dplyr) combined_data <- tutor_data %>% left_join(exercises, by = "user_id") %>% left_join(highlights, by = "user_id") # Perform analysis summary_stats <- combined_data %>% group_by(course_id) %>% summarize( avg_score = mean(score, na.rm = TRUE), total_exercises = n_distinct(exercise_id), total_highlights = n_distinct(content), student_count = n_distinct(user_id) ) # Save results write.csv(summary_stats, "results.csv", row.names = FALSE) ############################################################################### # Upload results toa_results_upload("results.csv")

Installed R Packages and Versions

Core packages

  • httr 1.4.7
  • RPostgres 1.4.7
  • paws 0.7.0
  • sqldf 0.4-11
  • duckdb 1.1.1
  • DBI 1.2.3
  • here 1.0.1
  • qualtRics 1.4.0

Visualization and manipulation packages

  • ggplot2 3.4.3
  • dplyr 1.1.4
  • tidyr 1.3.1
  • readr 2.1.5
  • purrr 1.0.1
  • tibble 3.2.1
  • forcats 1.0.0
  • lubridate 1.9.2
  • jsonlite 1.8.8
  • XML 3.99-0.12
  • stringr 1.5.0
  • stringi 1.8.4

Additional visualization packages

  • ggvis 0.4.8
  • rgl 1.2.8
  • patchwork 1.2.0
  • janitor 2.2.0
  • magrittr 2.0.3
  • vcd 1.4-12
  • maps 3.4.2
  • ggmap 4.0.0
  • sp 2.1-4
  • maptools 1.1-8

Modeling packages

  • tidymodels 1.0.0
  • future 1.33.2
  • furrr 0.3.1
  • lme4 1.1-35.5
  • nlme 3.1-164
  • caret 6.0-94
  • survival 3.5-8
  • car 3.1-2
  • psych 2.4.6.26
  • mice 3.16.0
  • miceadds 3.17-44
  • glmnet 4.1-8

Reporting packages

  • rmarkdown 2.26
  • shiny 1.10.0
  • xtable 1.8-4
  • knitr 1.47
  • kableExtra 1.4.0
  • excluder 0.5.1
  • careless 1.2.2

Time series packages

  • zoo 1.8-12
  • xts 0.13-1
  • quantmod 0.4-20
  • fable 0.3.4
  • tsibble 1.1.4
  • urca 1.3-4
  • feasts 0.3.2

High performance and development packages

  • Rcpp 1.0.11
  • data.table 1.14.8
  • testthat 3.2.1.1
  • devtools 2.4.5
  • roxygen2 7.3.2
  • lintr 3.1.2
  • languageserver 0.3.16