OpenStax Sample Research Container
https://github.com/safeinsights/openstax-sample-research-container
The OpenStax Sample Research Container is designed to improve the researcher expierence by giving a researcher a starting place to generate research code that will access the OpenStax data in their Secure Enclave.
A research container needs to take four actions:
- Connect to the desired data sources for the study
- Read the desired data
- Perform analysis on the retrieved data
- Send result to the Trusted Output App for review
The goal of this repo it to perform steps 1, 2 and 4 for the researcher to enable them to focus only on their specific analysis code.
Today, high level
In the repo today, there are two directories:
Base Directory
Most programming languages maintain a container for the supported versions of their code. These containers only have the minimum installation of the programming language to allow the end user to build a container specific to their needs by installing only the dependencies they need, and none they do not. This creates a container with high level of security, and is easier to maintain over time.
In this OpenStax Example Research Container repo, the
The
Base R Container
Version: 4.4.1
R packages installed from cran.rstudio.com:
- httr
- jsonlite
- future
- RPostgres
- readr
- arrow
- paws
- furrr
- dplyr
Operating system tools installed:
- curl
- pg
Example Directory
The
In each subdirectory, there are two files, the Dockerfile required to build this image, and the R file that performs the analysis on the dataset. The Dockerfile for each example starts with the previously created base R container, and copies the R analysis code into the container.
Parquet Analysis
The Parquet analysis is design to connect to an AWS S3 bucket, navigate to the specified directory and then descend though each subdirectory listing and then reading all the parquet files. Once they are read, they are processed in
Today the analysis does nothing except write some data to from parquet format to CSV format.
The CSV file containing the results is them sent to the Trusted Output App to be reviewed.
Postgres Database Analysis
The Postgres Database analysis is designed to connect to an AWS RDS instance and allow the researcher to run SQL queries against the database. The response from each query can then be used in the analysis.
Today, the analysis does nothing except select the number of rows of a specific table and return that as a CSV.
The CSV file containing the results is them sent to the Trusted Output App to be reviewed.
To Do
- Create a R Connection Library for different data sources
- Create a R Data read library for different data types
- Create a R Trusted Output App library to push data to a member's specific TOA