OpenStax Python Research Image
https://github.com/safeinsights/openstax-research-image/python-3.12
Version: 3.12
OpenStax Dataset Connection Functions
This document provides comprehensive documentation for all Python functions in
Available Functions
- query_tutor
- query_tutor_exercises
- query_tutor_notes_and_highlights
- query_notes_and_highlights
- query_notes_and_highlights_db
- query_event_capture
Function Reference
query_tutor
Query OpenStax Tutor dataset from S3 parquet files using DuckDB.
Parameters:
- (str, required): SQL query string to execute against the 'tutor_data' table
- (str, optional): Start date for file selection (YYYY-MM-DD format). If not provided, extracted from SQL query or defaults to 2024-01-01
- (str, optional): End date for file selection (YYYY-MM-DD format). If not provided, extracted from SQL query or defaults to 2024-12-31
Returns: pandas.DataFrame containing the query results
Examples:
from openstax import query_tutor # Query with explicit date range data_2024 = query_tutor( "SELECT * FROM tutor_data WHERE course_id = 123 LIMIT 1000", start_date="2024-01-01", end_date="2024-12-31" ) # Query with dates extracted from SQL WHERE clause result = query_tutor(""" SELECT * FROM tutor_data WHERE created_at >= '2023-02-01' AND created_at <= '2024-11-30' """) # Get course statistics course_stats = query_tutor(""" SELECT course_id, COUNT(DISTINCT user_id) as student_count, AVG(score) as avg_score FROM tutor_data WHERE created_at >= '2024-01-01' GROUP BY course_id """, start_date="2024-01-01", end_date="2024-12-31") # Query specific user activity user_activity = query_tutor(""" SELECT * FROM tutor_data WHERE user_id = 12345 AND created_at >= '2024-01-01' ORDER BY created_at """, start_date="2024-01-01", end_date="2024-12-31") # Get daily engagement metrics daily_metrics = query_tutor(""" SELECT DATE(created_at) as day, COUNT(DISTINCT user_id) as active_users, COUNT(*) as total_activities, AVG(score) as avg_score FROM tutor_data WHERE created_at >= '2024-01-01' GROUP BY DATE(created_at) ORDER BY day """, start_date="2024-01-01", end_date="2024-12-31")
query_tutor_exercises
Query OpenStax Tutor Exercises dataset from S3 parquet file using DuckDB.
Parameters:
- (str, required): SQL query string to execute against the 'exercises_data' table
Returns: pandas.DataFrame containing the query results
Examples:
from openstax import query_tutor_exercises # Get all exercises for a specific user user_exercises = query_tutor_exercises(""" SELECT * FROM exercises_data WHERE user_id = 123 ORDER BY event_time DESC LIMIT 1000 """) # Count unique users in a date range user_count = query_tutor_exercises(""" SELECT COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2023-01-01' AND event_time <= '2023-12-31' """) # Get exercise response statistics with aggregation response_stats = query_tutor_exercises(""" SELECT exercise_id, COUNT(*) as attempts, AVG(score) as avg_score, COUNT(DISTINCT user_id) as unique_users FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id ORDER BY attempts DESC """) # Analyze exercise completion rates completion_analysis = query_tutor_exercises(""" SELECT exercise_id, SUM(CASE WHEN completed = true THEN 1 ELSE 0 END) as completed_count, COUNT(*) as total_attempts, AVG(time_spent) as avg_time_seconds FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY exercise_id """) # Get exercise performance by difficulty difficulty_stats = query_tutor_exercises(""" SELECT difficulty_level, COUNT(*) as total_attempts, AVG(score) as avg_score, SUM(CASE WHEN score >= 0.7 THEN 1 ELSE 0 END) as passing_count FROM exercises_data WHERE event_time >= '2024-01-01' GROUP BY difficulty_level ORDER BY difficulty_level """)
query_tutor_notes_and_highlights
Query OpenStax Tutor Notes and Highlights dataset from S3 parquet file using DuckDB.
Parameters:
- (str, required): SQL query string to execute against the 'tutor_notes_highlights' table
Returns: pandas.DataFrame containing the query results
Examples:
from openstax import query_tutor_notes_and_highlights # Get all highlights for a specific user user_highlights = query_tutor_notes_and_highlights(""" SELECT * FROM tutor_notes_highlights WHERE user_id = 123 ORDER BY created_at DESC """) # Get highlights for a specific date range date_range_highlights = query_tutor_notes_and_highlights(""" SELECT * FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' AND created_at <= '2023-12-31' LIMIT 10000 """) # Count notes vs highlights by user note_type_counts = query_tutor_notes_and_highlights(""" SELECT user_id, SUM(CASE WHEN type = 'note' THEN 1 ELSE 0 END) as note_count, SUM(CASE WHEN type = 'highlight' THEN 1 ELSE 0 END) as highlight_count FROM tutor_notes_highlights GROUP BY user_id HAVING note_count > 0 OR highlight_count > 0 """) # Analyze highlight activity over time monthly_activity = query_tutor_notes_and_highlights(""" SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as total_annotations, COUNT(DISTINCT user_id) as active_users FROM tutor_notes_highlights WHERE created_at >= '2023-01-01' GROUP BY month ORDER BY month """) # Find most highlighted content sections popular_sections = query_tutor_notes_and_highlights(""" SELECT section_id, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as unique_users, AVG(LENGTH(content)) as avg_highlight_length FROM tutor_notes_highlights WHERE type = 'highlight' GROUP BY section_id HAVING COUNT(*) > 10 ORDER BY highlight_count DESC LIMIT 50 """)
query_notes_and_highlights
Query OpenStax Notes and Highlights dataset via S3 parquet files using DuckDB.
Parameters:
- (str, required): SQL query string to execute against the 'highlights' table
Returns: pandas.DataFrame containing the query results
Examples:
from openstax import query_notes_and_highlights # Get highlights with basic filtering biology_highlights = query_notes_and_highlights(""" SELECT * FROM highlights WHERE book_id = 'biology-2e' LIMIT 10000 """) # Get aggregated statistics (recommended for large datasets) summary_stats = query_notes_and_highlights(""" SELECT book_id, COUNT(*) as total_highlights, AVG(LENGTH(content)) as avg_length, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE created_at > '2023-01-01' GROUP BY book_id ORDER BY total_highlights DESC """) # Analyze highlight patterns by date daily_highlights = query_notes_and_highlights(""" SELECT DATE(created_at) as day, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as active_users FROM highlights WHERE created_at >= '2024-01-01' AND created_at < '2024-02-01' GROUP BY day ORDER BY day """) # Find most popular highlighted content popular_content = query_notes_and_highlights(""" SELECT content, COUNT(*) as times_highlighted, COUNT(DISTINCT user_id) as unique_users FROM highlights WHERE book_id = 'biology-2e' GROUP BY content HAVING COUNT(*) > 5 ORDER BY times_highlighted DESC LIMIT 100 """)
query_notes_and_highlights_db
Query OpenStax Notes and Highlights dataset via PostgreSQL database.
Purpose: Connects to the OpenStax Notes and Highlights PostgreSQL database and executes a SQL query. This function provides direct database access for real-time queries and is suitable for smaller result sets or when you need the most current data.
Parameters:
- (str, required): SQL query string to execute against the database tables
Returns: pandas.DataFrame containing the query results
Required Environment Variables:
- : Database hostname or IP address
- : Database port (typically 5432 for PostgreSQL)
- : Database username with read access
- : Database password
- : Database name
Examples:
from openstax import query_notes_and_highlights_db # Get recent highlights (last 7 days) recent_highlights = query_notes_and_highlights_db(""" SELECT * FROM highlights WHERE created_at > CURRENT_DATE - INTERVAL '7 days' ORDER BY created_at DESC LIMIT 1000 """) # Count highlights by book book_counts = query_notes_and_highlights_db(""" SELECT book_id, COUNT(*) as highlight_count, COUNT(DISTINCT user_id) as unique_users FROM highlights GROUP BY book_id ORDER BY highlight_count DESC """) # Get real-time user activity current_users = query_notes_and_highlights_db(""" SELECT user_id, COUNT(*) as highlights_today FROM highlights WHERE created_at >= CURRENT_DATE GROUP BY user_id ORDER BY highlights_today DESC """) # Query specific highlight details highlight_details = query_notes_and_highlights_db(""" SELECT h.* FROM highlights h WHERE h.id IN (123, 456, 789) """)
query_event_capture
Query OpenStax Event Capture dataset from S3 with date partitioning.
Parameters:
- (str, required): Event type to query (corresponds to S3 directory). Examples: "page_view", "highlight_created", "search_performed"
- (str, required): Start date for query range (YYYY-MM-DD format)
- (str, required): End date for query range (YYYY-MM-DD format)
- (str, required): SQL query string to execute against the event data
Returns: Currently returns None (implementation is incomplete)
Note: This function is currently under development. The query execution portion is not yet implemented.
Examples:
from openstax import query_event_capture # Query page view events for a specific week page_views = query_event_capture( event="page_view", start_date="2023-01-01", end_date="2023-01-07", sql_query="SELECT * FROM events WHERE book_id = 'biology-2e'" ) # Get daily highlight creation counts for a month daily_highlights = query_event_capture( event="highlight_created", start_date="2023-01-01", end_date="2023-01-31", sql_query=""" SELECT DATE(timestamp) as day, COUNT(*) as count FROM events GROUP BY DATE(timestamp) ORDER BY day """ ) # Analyze search activity search_stats = query_event_capture( event="search_performed", start_date="2024-01-01", end_date="2024-01-31", sql_query=""" SELECT search_term, COUNT(*) as search_count FROM events GROUP BY search_term ORDER BY search_count DESC LIMIT 100 """ )
Example Research Workflow
Complete workflow showing how to use these functions in a research container:
# Import required libraries from openstax import ( query_tutor, query_tutor_exercises, query_tutor_notes_and_highlights, query_notes_and_highlights ) from safeinsights_common import initialize, toa_results_upload import pandas as pd ############################################################################### # Initialize Research Container # DO NOT TOUCH initialize() ############################################################################### # Researcher: Insert query code here # 1. Get Tutor course data tutor_data = query_tutor(""" SELECT course_id, user_id, score, created_at FROM tutor_data WHERE created_at >= '2024-01-01' """, start_date="2024-01-01", end_date="2024-12-31") # 2. Get exercise responses exercises = query_tutor_exercises(""" SELECT user_id, exercise_id, score, event_time FROM exercises_data WHERE event_time >= '2024-01-01' LIMIT 50000 """) # 3. Get highlights data from S3 (recommended for large queries) highlights = query_notes_and_highlights(""" SELECT book_id, user_id, content, created_at FROM highlights WHERE created_at >= '2024-01-01' LIMIT 50000 """) ############################################################################### # Analyze and manipulate data # Merge datasets combined_data = ( tutor_data .merge(exercises, on='user_id', how='left') .merge(highlights, on='user_id', how='left') ) # Perform analysis summary_stats = combined_data.groupby('course_id').agg({ 'score': 'mean', 'exercise_id': 'nunique', 'content': 'nunique', 'user_id': 'nunique' }).rename(columns={ 'score': 'avg_score', 'exercise_id': 'total_exercises', 'content': 'total_highlights', 'user_id': 'student_count' }).reset_index() # Save results summary_stats.to_csv('results.csv', index=False) ############################################################################### # Upload results toa_results_upload('results.csv')
Error Handling
from openstax import query_tutor try: df = query_tutor( "SELECT * FROM tutor_data WHERE course_id = 123", start_date="2024-01-01", end_date="2024-12-31" ) if df.empty: print("No results found") else: print(f"Retrieved {len(df)} rows") except Exception as e: print(f"Query failed: {e}")
Installed Python Packages and Versions
Core Data Science Packages
- pandas 2.2.2
- numpy 1.26.4
- scipy 1.13.0
- matplotlib 3.8.4
- seaborn 0.13.2
Database and Data Access
- duckdb 1.1.1
- psycopg2-binary 2.9.9
- SQLAlchemy 2.0.30
- pyarrow 16.1.0
- boto3 1.34.103
- botocore 1.34.103
Machine Learning
- scikit-learn 1.5.0
- statsmodels 0.14.2
- xgboost 2.0.3
- lightgbm 4.3.0
- catboost 1.2.5
Deep Learning
- torch 2.3.0
- tensorflow 2.16.1
- keras 3.3.3
Natural Language Processing
- nltk 3.8.1
- spacy 3.7.4
- transformers 4.40.2
- gensim 4.3.2
Network Analysis
- networkx 3.3
- igraph 0.11.5
Utilities
- requests 2.32.2
- beautifulsoup4 4.12.3
- lxml 5.2.1
- openpyxl 3.1.2
- xlrd 2.0.1
- python-dotenv 1.0.1
- tqdm 4.66.4
- joblib 1.4.2
Jupyter and Notebook
- jupyter 1.0.0
- jupyterlab 4.2.0
- notebook 7.2.0
- ipython 8.24.0
- ipykernel 6.29.4
- ipywidgets 8.1.2
Testing and Development
- pytest 8.2.1
- pytest-cov 5.0.0
- black 24.4.2
- flake8 7.0.0
- mypy 1.10.0
- pylint 3.1.0
Visualization
- plotly 5.22.0
- bokeh 3.4.1
- altair 5.3.0
- folium 0.16.0
Time Series
- prophet 1.1.5
- pmdarima 2.0.4
- arch 6.3.0
Performance
- numba 0.59.1
- cython 3.0.10
- polars 0.20.23