Giter Site home page Giter Site logo

d3b-center / d3b-lib-pfb-exporter Goto Github PK

View Code? Open in Web Editor NEW
0.0 6.0 0.0 797 KB

🏭Transform and export data from a relational database to a PFB (Portable Format for Bioinformatics) Avro file

License: Apache License 2.0

Python 99.10% Shell 0.90%
python3 avro sqlalchemy

d3b-lib-pfb-exporter's Introduction

Logo for The Center for Data Driven Discovery

🏭 PFB Exporter

Transform and export data from a relational database into a PFB (Portable Format for Bioinformatics) file.

A PFB file is special kind of Avro file, suitable for capturing and reconstructing relational data. Read Background for more information.

NOTE: This is still a 🚧 prototype as its only been tested on the Kids First PostgreSQL database

Quickstart

$ git clone [email protected]:d3b-center/d3b-lib-pfb-exporter.git
$ cd d3b-lib-pfb-exporter
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .
$ pfbe -h

Try it out:

# List commands and docs
$ pfbe --help

# Create a PFB file from the given data and SQLAlchemy models
$ pfbe export tests/data/input -m tests/data/models.py -o tests/data/pfb_export

# Create a PFB file from the given data and generate SQLAlchemy models from db
$ pfbe export tests/data/input -d $MY_DB_CONN_URL -m tests/data/models.py -o tests/data/pfb_export

# Create just the PFB schema from the given SQLAlchemy models
$ pfbe create_schema -m tests/data/models.py -o tests/data/pfb_export

# Create just the PFB schema but first generate the SQLAlchemy models from db
$ pfbe create_schema -d $MY_DB_CONN_URL -m tests/data/models.py -o tests/data/pfb_export

Outputs

The output contains the generated PFB file, logs, and other files for debugging

tests/data/pfb_export
β”œβ”€β”€ logs
β”‚Β Β  └── pfb_export.log  -> Log file containing log statements from console
β”œβ”€β”€ metadata.json       -> PFB Metadata Entity
β”œβ”€β”€ models.py           -> Generated SQLAlchemy model classes if run with -d CLI option
β”œβ”€β”€ orm_models.json     -> Serialized SQLAlchemy model classes
β”œβ”€β”€ pfb.avro            -> The PFB file
└── pfb_schema.json     -> The PFB file schema

Supported Databases

Theoretically, any of the databases supported by SQLAlchemy but this has only been tested on a single PostgreSQL database.

Developers

Follow Quickstart instructions first. Then install dev requirements:

$ pip install -r dev-requirements.txt

Background

What is an Avro File?

A file with data records (JSON) and a schema (JSON) to describe each data record. Avro files can be serialized into a binary format and compressed.

Read more about Avro.

What is a PFB File?

A PFB file is special kind of Avro file, suitable for capturing and reconstructing biomedical relational data.

A PFB file is an Avro file with a particular Avro schema that represents a relational database. We call this schema the PFB Schema

The data in a PFB file contains a list of JSON objects called PFB Entity objects. There are 2 types of PFB Entities. One (Metadata) captures information about the relational database and the other (Table Row) captures a row of data from a particular table in the database.

The data records in a PFB file are produced by transforming the original data from a relational database into PFB Entity objects. Each PFB Entity object conforms to its Avro schema.

Vanilla Avro vs PFB

Let's say a client receives an Avro file. It reads in the Avro data. Now a client has the Avro schema and all of the data that conforms to that schema in a big JSON blob. It can do what it wants. Maybe it wants to construct some data input forms. It has everything it needs to do this since the schema has all of the entities, attributes, and types for those attributes defined.

Now what happens if the client wants to reconstruct a relational database from the data? How does it know what tables to create, and what the relationships are between those tables? Which relationships are required vs not? This is one of the problems PFB addresses.

How PFB Exporter CLI Works

How PFB Exporter Works

PFB File Creation

  1. Create the Avro schemas for PFB Entity types and the PFB File
  2. Transform the JSON objects representing rows of data from the relational database into PFB Entities
  3. Add the Avro schemas to the PFB Avro file
  4. Add the PFB Entities to the Avro file

PFB Schema Creation

The PFB File schema is created from SQLAlchemy declarative base classes in a file or directory. If the classes are not provided, they are generated by inspecting the database's schema using the sqlacodegen library.

d3b-lib-pfb-exporter's People

Contributors

znatty22 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

d3b-lib-pfb-exporter's Issues

Update README

  • Add badges and logo
  • Add doc on create-schema command
  • Add doc on where outputs go
  • Protoype disclaimer, only tested with kids first Postgres db

Reconstruct database from PFB file

Running this:

$ pfbe reconstruct <path to pfb_export dir> --db_conn_url=<url>

Would recreate the database schema from the previously generated SQLAlchemy models,
apply it to the database, parse the original data out from the PFB file, and insert it into the database

OR if its possible (but likely not, since PFB -> SQLAlchemy models might be lossy):

$ pfbe reconstruct <path to PFB avro file> --db_conn_url=<url>

Recreate the SQLAlchemy models from the PFB file, recreate the database schema from the models, apply it to the database, parse the original data from the PFB file, and insert it into the database

Fetch/download data from database

User should be able to provide a SQL script that fetches the data they want from the database.

$ pfbe download --type=biospecimen --db_conn_url=<url> --sql_file=biospecimens_by_study.sql --output_dir=<output dir>

Would output a JSON ND file <output dir>/biospecimen.ndjson containing:

"biospecimen"
{ biospecimen JSON object 0 }
...
{ biospecimen JSON object N}
$ pfbe download --db_conn_url=<url> --output_dir=<output dir>

Would download the entire database and output a set of JSON ND files, 1 for each table in the database.

Refactor how many-to-many relationships are captured

The PFB file is currently a direct representation of a relational database schema so all tables are captured, including the intermediate tables which form the many-to-many relationships between two tables.

Refactor the PFB file construction so that intermediate tables are not captured directly. Instead any entities with many-to-many relationships should be linked directly rather than through the intermediate table.

For example, there is a many-to-many relationship between the biospecimen table and the genomic_file table. The biospecimen_genomic_file association table is the intermediate table with foreign keys to biospecimen and genomic_file table.

When building the PFB file do not build PFB Entities to represent the rows in biospecimen_genomic_filetable. Instead just link the appropriate biospecimen PFB Entities to the genomic_file PFB Entities.

Import data from PFB file into database

Running this:

pfbe import <path to pfb avro file> --db_conn_url=<url> --operation=[insert, update, upsert]

Would parse out the original source data from the PFB file and persist it in the database using the supplied operation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.