agora-data-tools

Intro
Running the pipeline
Testing Github Workflow
Unit Tests
Config

Intro

A place for Agora's ETL, data testing, and data analysis

This configuration-driven data pipeline uses a config file - which is easy for engineers, analysts, and project managers to understand - to drive the entire ETL process. The code in src/agoradatatools uses parameters defined in a config file to determine which kinds of extraction and transformations a particular dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository.

In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file, and run the scripts.

This src/agoradatatools implementation was influenced by the "Modern Config Driven ELT Framework for Building a Data Lake" talk given at the Data + AI Summit of 2021.

Python notebooks that describe the custom logic for various datasets are located in /data_analysis/notebooks.

Running the pipeline

The json files generated by src/agoradatatools are written to folders in the Agora Synapse project by default, although you can modify the destination Synapse folder in the config file.

Note that running the pipeline does not automatically update the Agora database in any environment. Ingestion of generated json files into the Agora databases is handled by agora-data-manager.

You can run the pipeline in any of the following ways:

Nextflow Tower is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client.
Locally requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
Docker requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT.

When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo:

test_config.yaml places the transformed datasets in the Agora Testing Data folder in synapse; write files to this folder to perform data validation.
config.yaml places the transformed datasets the Agora Live Data synapse folder; write files to this folder once you've validated that the ETL process is generating files suitable for release. Note that files in the Agora Live Data folder are not automatically released, so if 'bad' file versions do get written to this folder it's not the end of the world. A releasable manifest file can be generated by a subsequent ETL processing run into the folder, or manually if necessary.

You may also create a custom config file to use locally to target specific dataset(s) or transforms of interest, and/or to write the generated json files to a different Synapse location. See the config file section for additional information.

Nextflow Tower

This pipeline can be executed without any local installation, permissions, or credentials; the Sage Bionetworks Nextflow Tower workspace is configured to use Agora's Synapse credentials, which can be found in LastPass in the "Shared-Agora" Folder.

The instructions to trigger the workflow can be found at Sage-Bionetworks-Workflows/nf-agora

Configuring Synapse Credentials

Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend here. If you see a green unlocked lock icon, then you should be good to go.
Obtain write access to the destination Synapse project, e.g. Agora Synapse project
Create a Synapse personal access token (PAT)
Set up your Synapse Python client locally

Your configured Synapse credentials can be used to run this package both locally and using Docker, as outlined below.

Locally

Perform the following one-time steps to set up your local environment and obtain the required Synapse permissions:

This package uses Python, if you have not already, please install pyenv to manage your Python versions. Versions supported by this package are all versions >=3.7 and <3.11. If you do not install pyenv make sure that Python and pip are installed correctly and have been added to your PATH by running python3 --version and pip3 --version. If your installation was successful, your terminal will return the versions of Python and pip that you installed. Note: If you have pyenv it will install a specific version of Python for you.
Install pipenv by running pip install pipenv.
Install git if you have not done so already using these instructions
Clone this Github Repository to your local machine by opening your terminal, navigating to the directory that you want this repository to be cloned and running git clone https://github.com/Sage-Bionetworks/agora-data-tools.git. After cloning is complete, navigate into the newly created agora-data-tools directory.

Install agoradatatools locally using pipenv:

pipenv

pipenv install
# To develop locally you want to add --dev
# pipenv install --dev
pipenv shell

You can check if the package was isntalled correctly by running adt --help in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired config file as an argument. The following example command will execute the pipeline using test_config.yaml:
```
adt test_config.yaml
```

Docker

There is a publicly available GHCR repository automatically built via GitHub Actions. That said, you may want to develop using Docker locally on a feature branch.

If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time step to set up your Docker environment and obtain the required Synapse permissions:

Install Docker.

Once you have completed the one-time setup step outlined above, execute the pipeline by running the following command and providing your PAT and the desired config file as an argument. The following example command will execute the pipeline in Docker using test_config.yaml:

# This creates a local Docker image
docker build -t agora-data-tools .
docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools adt test_config.yaml

Testing Github Workflow

In order to test the GitHub Actions workflow locally:

install act and Docker
create a .secrets file in the root directory of the folder with a SYNAPSE_USER and a SYNAPSE_PASS value*

Then run:

act -v --secret-file .secrets

The repository is currently using Agora's credentials for Synapse. Those can be found in LastPass in the "Shared-Agora" Folder.

Unit Tests

Unit tests can be run by calling pytest from the command line.

python -m pytest

Config

Parameters:

destination: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basis
staging_path: Defines the location of the staging folder that the generated json files are written to
gx_folder: Defines the Synapse ID of the folder that generated GX reports are written to
datasets/<dataset>: Each generated json file is named <dataset>.json
datasets/<dataset>/files: A list of source files for the dataset
- name: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration)
- id: Synapse id of the file
- format: The format of the source file
datasets/<dataset>/provenance: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity")
datasets/<dataset>/destination: Override the default destination for a specific dataset by specifying a synID, or use *dest to use the default destination
datasets/<dataset>/column_rename: Columns to be renamed prior to data transformation
datasets/<dataset>/agora_rename: Columns to be renamed after data transformation, but prior to json serialization
datasets/<dataset>/custom_transformations: The list of additional transformations to apply to the dataset; a value of 1 indicates the default transformation

sage-bionetworks / agora-data-tools Goto Github PK

agora-data-tools's Introduction

agora-data-tools

Intro

Running the pipeline

Nextflow Tower

Configuring Synapse Credentials

Locally

Docker

Testing Github Workflow

Unit Tests

Config

agora-data-tools's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent