OpenStax Python Research Image

https://github.com/safeinsights/openstax-research-image/python-3.12

Version: 3.12

OpenStax Dataset Connection Functions

This document provides comprehensive documentation for all Python functions in

openstax.py
that enable access to OpenStax educational datasets for research purposes.


Available Functions

  1. query_tutor
  2. query_tutor_exercises
  3. query_tutor_notes_and_highlights
  4. query_notes_and_highlights
  5. query_notes_and_highlights_db
  6. query_event_capture

Function Reference

query_tutor

Query OpenStax Tutor dataset from S3 parquet files using DuckDB.

Parameters:

  • sql_query
    (str, required): SQL query string to execute against the 'tutor_data' table
  • start_date
    (str, optional): Start date for file selection (YYYY-MM-DD format). If not provided, extracted from SQL query or defaults to 2024-01-01
  • end_date
    (str, optional): End date for file selection (YYYY-MM-DD format). If not provided, extracted from SQL query or defaults to 2024-12-31

Returns: pandas.DataFrame containing the query results

Examples:

from openstax import query_tutor # Query with explicit date range data_2024 = query_tutor( "SELECT * FROM tutor_data WHERE course_id = 123 LIMIT 1000", start_date="2024-01-01", end_date="2024-12-31" ) # Query with dates extracted from SQL WHERE clause result = query_tutor(""" SELECT * FROM tutor_data WHERE created_at >= '2023-02-01' AND created_at <= '2024-11-30' """) # Get course statistics course_stats = query_tutor(""" SELECT course_id, COUNT(DISTINCT user_id) as student_count, AVG(score) as avg_score FROM tutor_data WHERE created_at >= '2024-01-01' GROUP BY course_id """, start_date="2024-01-01", end_date="2024-12-31") # Query specific user activity user_activity = query_tutor(""" SELECT * FROM tutor_data WHERE user_id = 12345 AND created_at >= '2024-01-01' ORDER BY created_at """, start_date="2024-01-01", end_date="2024-12-31") # Get daily engagement metrics daily_metrics = query_tutor(""" SELECT DATE(created_at) as day, COUNT(DISTINCT user_id) as active_users, COUNT(*) as total_activities, AVG(score) as avg_score FROM tutor_data WHERE created_at >= '2024-01-01' GROUP BY DATE(created_at) ORDER BY day """, start_date="2024-01-01", end_date="2024-12-31")

query_tutor_exercises

Query OpenStax Tutor Exercises dataset from S3 parquet file using DuckDB.

Parameters:

  • sql_query
    (str, required): SQL query string to execute against the 'exercises_data' table

Returns: pandas.DataFrame containing the query results

Examples:

from openstax import query_tutor_exercises # Get all exercises for a specific user user_exercises = query_tutor_exercises(""" SELECT * FROM exercises_data WHERE user_id = 123 ORDER BY event_time DESC LIMIT 1000 """) # Count unique users in a date range user_count = query_tutor_exercises(""" SELECT COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31' """) # Get exercise response statistics with aggregation response_stats = query_tutor_exercises(""" SELECT exercise_id, COUNT(*) as attempts, AVG(score) as avg_score, COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id ORDER BY attempts DESC """) # Analyze exercise completion rates completion_analysis = query_tutor_exercises(""" SELECT exercise_id, SUM(CASE WHEN completed = true THEN 1 ELSE 0 END) as completed_count, COUNT(*) as total_attempts, AVG(time_spent) as avg_time_seconds FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id """) # Get exercise performance by difficulty difficulty_stats = query_tutor_exercises(""" SELECT difficulty_level, COUNT(*) as total_attempts, AVG(score) as avg_score, SUM(CASE WHEN score >= 0.7 THEN 1 ELSE 0 END) as passing_count FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY difficulty_level ORDER BY difficulty_level """)

query_tutor_notes_and_highlights

Query OpenStax Tutor Notes and Highlights dataset from S3 parquet file using DuckDB.

Parameters:

  • sql_query
    (str, required): SQL query string to execute against the 'tutor_notes_highlights' table

Returns: pandas.DataFrame containing the query results

Examples:

from openstax import query_tutor_notes_and_highlights # Get all highlights for a specific user user_highlights = query_tutor_notes_and_highlights(""" SELECT * FROM tutor_notes_highlights WHERE user_id = 123 ORDER BY created_at DESC """) # Get highlights for a specific date range date_range_highlights = query_tutor_notes_and_highlights(""" SELECT * FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' AND created_at <= '2023-12-31' LIMIT 10000 """) # Count notes vs highlights by user note_type_counts = query_tutor_notes_and_highlights(""" SELECT user_id, SUM(CASE WHEN type = 'note' THEN 1 ELSE 0 END) as note_count, SUM(CASE WHEN type = 'highlight' THEN 1 ELSE 0 END) as highlight_count FROM tutor_notes_highlights GROUP BY user_id HAVING note_count > 0 OR highlight_count > 0 """) # Analyze highlight activity over time monthly_activity = query_tutor_notes_and_highlights(""" SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as total_annotations, COUNT(DISTINCT user_id) as active_users FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' GROUP BY month ORDER BY month """) # Find most highlighted content sections popular_sections = query_tutor_notes_and_highlights(""" SELECT section_id, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as unique_users, AVG(LENGTH(content)) as avg_highlight_length FROM tutor_notes_highlights WHERE type = 'highlight' GROUP BY section_id HAVING COUNT(*) > 10 ORDER BY highlight_count DESC LIMIT 50 """)

query_notes_and_highlights

Query OpenStax Notes and Highlights dataset via S3 parquet files using DuckDB.

Parameters:

  • sql_query
    (str, required): SQL query string to execute against the 'highlights' table

Returns: pandas.DataFrame containing the query results

Examples:

from openstax import query_notes_and_highlights # Get highlights with basic filtering biology_highlights = query_notes_and_highlights(""" SELECT * FROM highlights WHERE book_id = 'biology-2e' LIMIT 10000 """) # Get aggregated statistics (recommended for large datasets) summary_stats = query_notes_and_highlights(""" SELECT book_id, COUNT(*) as total_highlights, AVG(LENGTH(content)) as avg_length, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE created_at > '2023-01-01' GROUP BY book_id ORDER BY total_highlights DESC """) # Analyze highlight patterns by date daily_highlights = query_notes_and_highlights(""" SELECT DATE(created_at) as day, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as active_users FROM highlights WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01' GROUP BY day ORDER BY day """) # Find most popular highlighted content popular_content = query_notes_and_highlights(""" SELECT content, COUNT(*) as times_highlighted, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE book_id = 'biology-2e' GROUP BY content HAVING COUNT(*) > 5 ORDER BY times_highlighted DESC LIMIT 100 """)

query_notes_and_highlights_db

Query OpenStax Notes and Highlights dataset via PostgreSQL database.

Purpose: Connects to the OpenStax Notes and Highlights PostgreSQL database and executes a SQL query. This function provides direct database access for real-time queries and is suitable for smaller result sets or when you need the most current data.

Parameters:

  • sql_query
    (str, required): SQL query string to execute against the database tables

Returns: pandas.DataFrame containing the query results

Required Environment Variables:

  • HIGHLIGHTS_DB_HOST
    : Database hostname or IP address
  • HIGHLIGHTS_DB_PORT
    : Database port (typically 5432 for PostgreSQL)
  • HIGHLIGHTS_DB_USER
    : Database username with read access
  • HIGHLIGHTS_DB_PASSWORD
    : Database password
  • HIGHLIGHTS_DB_NAME
    : Database name

Examples:

from openstax import query_notes_and_highlights_db # Get recent highlights (last 7 days) recent_highlights = query_notes_and_highlights_db(""" SELECT * FROM highlights WHERE created_at > CURRENT_DATE - INTERVAL '7 days' ORDER BY created_at DESC LIMIT 1000 """) # Count highlights by book book_counts = query_notes_and_highlights_db(""" SELECT book_id, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as unique_users FROM highlights GROUP BY book_id ORDER BY highlight_count DESC """) # Get real-time user activity current_users = query_notes_and_highlights_db(""" SELECT user_id, COUNT(*) as highlights_today FROM highlights WHERE created_at >= CURRENT_DATE GROUP BY user_id ORDER BY highlights_today DESC """) # Query specific highlight details highlight_details = query_notes_and_highlights_db(""" SELECT h.* FROM highlights h WHERE h.id IN (123, 456, 789) """)

query_event_capture

Query OpenStax Event Capture dataset from S3 with date partitioning.

Parameters:

  • event
    (str, required): Event type to query (corresponds to S3 directory). Examples: "page_view", "highlight_created", "search_performed"
  • start_date
    (str, required): Start date for query range (YYYY-MM-DD format)
  • end_date
    (str, required): End date for query range (YYYY-MM-DD format)
  • sql_query
    (str, required): SQL query string to execute against the event data

Returns: Currently returns None (implementation is incomplete)

Note: This function is currently under development. The query execution portion is not yet implemented.

Examples:

from openstax import query_event_capture # Query page view events for a specific week page_views = query_event_capture( event="page_view", start_date="2023-01-01", end_date="2023-01-07", sql_query="SELECT * FROM events WHERE book_id = 'biology-2e'" ) # Get daily highlight creation counts for a month daily_highlights = query_event_capture( event="highlight_created", start_date="2023-01-01", end_date="2023-01-31", sql_query=""" SELECT DATE(timestamp) as day, COUNT(*) as count FROM events GROUP BY DATE(timestamp) ORDER BY day """ ) # Analyze search activity search_stats = query_event_capture( event="search_performed", start_date="2024-01-01", end_date="2024-01-31", sql_query=""" SELECT search_term, COUNT(*) as search_count FROM events GROUP BY search_term ORDER BY search_count DESC LIMIT 100 """ )

Example Research Workflow

Complete workflow showing how to use these functions in a research container:

# Import required libraries from openstax import ( query_tutor, query_tutor_exercises, query_tutor_notes_and_highlights, query_notes_and_highlights ) from safeinsights_common import initialize, toa_results_upload import pandas as pd ############################################################################### # Initialize Research Container # DO NOT TOUCH initialize() ############################################################################### # Researcher: Insert query code here # 1. Get Tutor course data tutor_data = query_tutor(""" SELECT course_id, user_id, score, created_at FROM tutor_data WHERE created_at >= '2024-01-01' """, start_date="2024-01-01", end_date="2024-12-31") # 2. Get exercise responses exercises = query_tutor_exercises(""" SELECT user_id, exercise_id, score, event_time FROM exercises_data WHERE event_time >= '2024-01-01' LIMIT 50000 """) # 3. Get highlights data from S3 (recommended for large queries) highlights = query_notes_and_highlights(""" SELECT book_id, user_id, content, created_at FROM highlights WHERE created_at >= '2024-01-01' LIMIT 50000 """) ############################################################################### # Analyze and manipulate data # Merge datasets combined_data = ( tutor_data .merge(exercises, on='user_id', how='left') .merge(highlights, on='user_id', how='left') ) # Perform analysis summary_stats = combined_data.groupby('course_id').agg({ 'score': 'mean', 'exercise_id': 'nunique', 'content': 'nunique', 'user_id': 'nunique' }).rename(columns={ 'score': 'avg_score', 'exercise_id': 'total_exercises', 'content': 'total_highlights', 'user_id': 'student_count' }).reset_index() # Save results summary_stats.to_csv('results.csv', index=False) ############################################################################### # Upload results toa_results_upload('results.csv')

Error Handling

from openstax import query_tutor try: df = query_tutor( "SELECT * FROM tutor_data WHERE course_id = 123", start_date="2024-01-01", end_date="2024-12-31" ) if df.empty: print("No results found") else: print(f"Retrieved {len(df)} rows") except Exception as e: print(f"Query failed: {e}")

Installed Python Packages and Versions

Core Data Science Packages

  • pandas 2.2.2
  • numpy 1.26.4
  • scipy 1.13.0
  • matplotlib 3.8.4
  • seaborn 0.13.2

Database and Data Access

  • duckdb 1.1.1
  • psycopg2-binary 2.9.9
  • SQLAlchemy 2.0.30
  • pyarrow 16.1.0
  • boto3 1.34.103
  • botocore 1.34.103

Machine Learning

  • scikit-learn 1.5.0
  • statsmodels 0.14.2
  • xgboost 2.0.3
  • lightgbm 4.3.0
  • catboost 1.2.5

Deep Learning

  • torch 2.3.0
  • tensorflow 2.16.1
  • keras 3.3.3

Natural Language Processing

  • nltk 3.8.1
  • spacy 3.7.4
  • transformers 4.40.2
  • gensim 4.3.2

Network Analysis

  • networkx 3.3
  • igraph 0.11.5

Utilities

  • requests 2.32.2
  • beautifulsoup4 4.12.3
  • lxml 5.2.1
  • openpyxl 3.1.2
  • xlrd 2.0.1
  • python-dotenv 1.0.1
  • tqdm 4.66.4
  • joblib 1.4.2

Jupyter and Notebook

  • jupyter 1.0.0
  • jupyterlab 4.2.0
  • notebook 7.2.0
  • ipython 8.24.0
  • ipykernel 6.29.4
  • ipywidgets 8.1.2

Testing and Development

  • pytest 8.2.1
  • pytest-cov 5.0.0
  • black 24.4.2
  • flake8 7.0.0
  • mypy 1.10.0
  • pylint 3.1.0

Visualization

  • plotly 5.22.0
  • bokeh 3.4.1
  • altair 5.3.0
  • folium 0.16.0

Time Series

  • prophet 1.1.5
  • pmdarima 2.0.4
  • arch 6.3.0

Performance

  • numba 0.59.1
  • cython 3.0.10
  • polars 0.20.23