OpenStax Sample Research Container

https://github.com/safeinsights/openstax-sample-research-container

The OpenStax Sample Research Container is designed to improve the researcher expierence by giving a researcher a starting place to generate research code that will access the OpenStax data in their Secure Enclave.

A research container needs to take four actions:

  1. Connect to the desired data sources for the study
  2. Read the desired data
  3. Perform analysis on the retrieved data
  4. Send result to the Trusted Output App for review

The goal of this repo it to perform steps 1, 2 and 4 for the researcher to enable them to focus only on their specific analysis code.

Today, high level

In the repo today, there are two directories:

base
, and
example
.

Base Directory

Most programming languages maintain a container for the supported versions of their code. These containers only have the minimum installation of the programming language to allow the end user to build a container specific to their needs by installing only the dependencies they need, and none they do not. This creates a container with high level of security, and is easier to maintain over time.

In this OpenStax Example Research Container repo, the

base
directory contains the code needed to start with the languages published image, and installed additional dependencies and files that will be needed in the enclave. This creates a better starting point for each of the example research containers, and makes the building of a research container the faster.

The

base
directory contains a subdirectory for each language supported at this time, R and Python today. In each language's directory, there is only a Dockerfile used to create the image. The Dockerfile will start with versioned base image published by the language, and install the dependencies.

Base R Container

Version: 4.4.1

R packages installed from cran.rstudio.com:

  • httr
  • jsonlite
  • future
  • RPostgres
  • readr
  • arrow
  • paws
  • furrr
  • dplyr

Operating system tools installed:

  • curl
  • pg

Example Directory

The

example
directory contains working example analysis code that can run in the OpenStax Enclave. Each subdirectory in
example
is based on the type of data that will be connect to in the OpenStax Secure Enclave, parquet files and Postgres database.

In each subdirectory, there are two files, the Dockerfile required to build this image, and the R file that performs the analysis on the dataset. The Dockerfile for each example starts with the previously created base R container, and copies the R analysis code into the container.

Parquet Analysis

The Parquet analysis is design to connect to an AWS S3 bucket, navigate to the specified directory and then descend though each subdirectory listing and then reading all the parquet files. Once they are read, they are processed in

furrr.batch_size=1000
chunks to be read in parallel. Once in this format, the analysis can be run against the data.

Today the analysis does nothing except write some data to from parquet format to CSV format.

The CSV file containing the results is them sent to the Trusted Output App to be reviewed.

Postgres Database Analysis

The Postgres Database analysis is designed to connect to an AWS RDS instance and allow the researcher to run SQL queries against the database. The response from each query can then be used in the analysis.

Today, the analysis does nothing except select the number of rows of a specific table and return that as a CSV.

The CSV file containing the results is them sent to the Trusted Output App to be reviewed.

To Do

  • Create a R Connection Library for different data sources
  • Create a R Data read library for different data types
  • Create a R Trusted Output App library to push data to a member's specific TOA