Giter Site home page Giter Site logo

agora-data-tools's Introduction

agora-data-tools

Intro

A place for Agora's ETL, data testing, and data analysis

This configuration-driven data pipeline uses a config file - which is easy for engineers, analysts, and project managers to understand - to drive the entire ETL process. The code in src/agoradatatools uses parameters defined in a config file to determine which kinds of extraction and transformations a particular dataset needs to go through before the resulting data is serialized as json files that can be loaded into Agora's data repository.

In the spirit of importing datasets with the minimum amount of transformations, one can simply add a dataset to the config file, and run the scripts.

This src/agoradatatools implementation was influenced by the "Modern Config Driven ELT Framework for Building a Data Lake" talk given at the Data + AI Summit of 2021.

Python notebooks that describe the custom logic for various datasets are located in /data_analysis/notebooks.

Running the pipeline

The json files generated by src/agoradatatools are written to folders in the Agora Synapse project by default, although you can modify the destination Synapse folder in the config file.

Note that running the pipeline does not automatically update the Agora database in any environment. Ingestion of generated json files into the Agora databases is handled by agora-data-manager.

You can run the pipeline in any of the following ways:

  1. Nextflow Tower is the simplest, but least flexible, way to run the pipeline; it does not require Synapse permissions, creating a Synapse PAT, or setting up the Synapse Python client.
  2. Locally requires installing Python and Pipenv, obtaining the required Synapse permissions, creating a Synpase PAT, and setting up the Synapse Python client.
  3. Docker requires installing Docker, obtaining the required Synapse permissions, and creating a Synpase PAT.

When running the pipeline, you must specify the config file that will be used. There are two config files that are checked into this repo:

  • test_config.yaml places the transformed datasets in the Agora Testing Data folder in synapse; write files to this folder to perform data validation.
  • config.yaml places the transformed datasets the Agora Live Data synapse folder; write files to this folder once you've validated that the ETL process is generating files suitable for release. Note that files in the Agora Live Data folder are not automatically released, so if 'bad' file versions do get written to this folder it's not the end of the world. A releasable manifest file can be generated by a subsequent ETL processing run into the folder, or manually if necessary.

You may also create a custom config file to use locally to target specific dataset(s) or transforms of interest, and/or to write the generated json files to a different Synapse location. See the config file section for additional information.

Nextflow Tower

This pipeline can be executed without any local installation, permissions, or credentials; the Sage Bionetworks Nextflow Tower workspace is configured to use Agora's Synapse credentials, which can be found in LastPass in the "Shared-Agora" Folder.

The instructions to trigger the workflow can be found at Sage-Bionetworks-Workflows/nf-agora

Configuring Synapse Credentials

  1. Obtain download access to all required source files in Synapse, including accepting the terms of use on the AD Knowledge Portal backend here. If you see a green unlocked lock icon, then you should be good to go.
  2. Obtain write access to the destination Synapse project, e.g. Agora Synapse project
  3. Create a Synapse personal access token (PAT)
  4. Set up your Synapse Python client locally

Your configured Synapse credentials can be used to run this package both locally and using Docker, as outlined below.

Locally

Perform the following one-time steps to set up your local environment and obtain the required Synapse permissions:

  1. This package uses Python, if you have not already, please install pyenv to manage your Python versions. Versions supported by this package are all versions >=3.7 and <3.11. If you do not install pyenv make sure that Python and pip are installed correctly and have been added to your PATH by running python3 --version and pip3 --version. If your installation was successful, your terminal will return the versions of Python and pip that you installed. Note: If you have pyenv it will install a specific version of Python for you.

  2. Install pipenv by running pip install pipenv.

  3. Install git if you have not done so already using these instructions

  4. Clone this Github Repository to your local machine by opening your terminal, navigating to the directory that you want this repository to be cloned and running git clone https://github.com/Sage-Bionetworks/agora-data-tools.git. After cloning is complete, navigate into the newly created agora-data-tools directory.

  5. Install agoradatatools locally using pipenv:

    • pipenv
      pipenv install
      # To develop locally you want to add --dev
      # pipenv install --dev
      pipenv shell
  6. You can check if the package was isntalled correctly by running adt --help in the terminal. If it returns instructions about how to use the CLI, installation was successful and you can run the pipeline by providing the desired config file as an argument. The following example command will execute the pipeline using test_config.yaml:

    adt test_config.yaml

Docker

There is a publicly available GHCR repository automatically built via GitHub Actions. That said, you may want to develop using Docker locally on a feature branch.

If you don't want to deal with Python paths and dependencies, you can use Docker to run the pipeline. Perform the following one-time step to set up your Docker environment and obtain the required Synapse permissions:

  1. Install Docker.

Once you have completed the one-time setup step outlined above, execute the pipeline by running the following command and providing your PAT and the desired config file as an argument. The following example command will execute the pipeline in Docker using test_config.yaml:

# This creates a local Docker image
docker build -t agora-data-tools .
docker run -e SYNAPSE_AUTH_TOKEN=<your PAT> agora-data-tools adt test_config.yaml

Testing Github Workflow

In order to test the GitHub Actions workflow locally:

  • install act and Docker
  • create a .secrets file in the root directory of the folder with a SYNAPSE_USER and a SYNAPSE_PASS value*

Then run:

act -v --secret-file .secrets

The repository is currently using Agora's credentials for Synapse. Those can be found in LastPass in the "Shared-Agora" Folder.

Unit Tests

Unit tests can be run by calling pytest from the command line.

python -m pytest

Config

Parameters:

  • destination: Defines the default target location (folder) that the generated json files are written to; this value can be overridden on a per-dataset basis
  • staging_path: Defines the location of the staging folder that the generated json files are written to
  • gx_folder: Defines the Synapse ID of the folder that generated GX reports are written to
  • datasets/<dataset>: Each generated json file is named <dataset>.json
  • datasets/<dataset>/files: A list of source files for the dataset
    • name: The name of the source file (this name is the reference the code will use to retrieve a file from the configuration)
    • id: Synapse id of the file
    • format: The format of the source file
  • datasets/<dataset>/provenance: The Synapse id of each entity that the dataset is derived from, used to populate the generated file's Synapse provenance. (The Synapse API calls this "Activity")
  • datasets/<dataset>/destination: Override the default destination for a specific dataset by specifying a synID, or use *dest to use the default destination
  • datasets/<dataset>/column_rename: Columns to be renamed prior to data transformation
  • datasets/<dataset>/agora_rename: Columns to be renamed after data transformation, but prior to json serialization
  • datasets/<dataset>/custom_transformations: The list of additional transformations to apply to the dataset; a value of 1 indicates the default transformation

agora-data-tools's People

Contributors

bwmac avatar j-hendrickson-sage avatar jaclynbeck-sage avatar jessterb avatar mfazza avatar thomasyu888 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.