Giter Site home page Giter Site logo

etlite's Introduction

etlite's People

Contributors

billchang711 avatar charlescpan avatar danny38888 avatar elau1004 avatar jamesyhuang avatar maryzenjim avatar palaciom1 avatar penggithubsite2019 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

etlite's Issues

Add a ETL control database to track job executions and statistics

We need a standalone SQLite DB with the option to swap it out for a Client-Server DB such as PostgreSQL or MS-SQL Server express.

This database shall have the job definition and the following functionalities:

  • Manage the job configuration.
  • Manage the sequential run ID for all the jobs.
  • Manage the logs for all the ran jobs.
  • Manage dataset statistics from the profiled data.
  • Manage validation rules.

Use Case:

  1. etlite init db
    This step shall run the SQLAlchemy DB migration scripts. It shall run the DB connection string from a config file and connect to the host server. SQLite should be the default DB and should be the first supported DB.
    Initially, we have to use SQL to manipulate until we have the console application online.
  2. etlite run Code
    This step shall use the unique code to lookup the configuration in the DB. Internally, it will determine and collect metrics about the data that flowed through and store in the DB. The collected metrics shall be validated against a set of rules that are also stored in the DB. The results of the validation shall also be stored in the DB.

Build the core engine

The core engine shall execute a directed acyclic graph (DAG) of jobs. It need to provide the best throughput experience.

The engine shall the following features:

  1. Able to execute parallel jobs.
  2. Minimal memory pressure that is proportional to the number of concurrent jobs.
  3. Able to support REST api, DB connections, and file transfers. REST is the priority.
  4. Monitor duration of an executing job. Alert if it is slower than usual.
  5. Alert notification of failures.
  6. Able to be run in a serverless environment.

Use Case:

  • A single job is synonymous to a single entry list.
  • Each job entry at the root level shall be spawn out to dedicated CPU core.
    • This would be a typically dedicating a CPU core to a vendor.
  • Each job children jobs shall be co-executed on the same CPU core asynchronously.
  • Auto discover the job objects from the Python scripts in the sub-directories.
  • Have a dialog with the instantiated object to conditionally execute the choreographed workflow.
  • Each job REST api call must handle:
    • The returned response as a stream. Do NOT materialize the entire returned data into memory.
    • Support pagination of the data if the vendor supports it. If offset and limit is provided, make concurrent calls to the endpoint for the data.
    • Throttle up or down for different size of the data if the request time out.
    • Retry after failure. Use a Fibonacci increment.

Certify the profiled data

After the profile result is persisted, we need to validate its shape to determine if the profile is within a configured threshold.
Threshold can the following limits:

  • Absolute scala
  • Average scala
  • Standard deviation
  • The rules shall be inheritable to over write defaults settings.
  • The following are some base assertion rules to be inherited and implemented by each specific job:
    • Missed scheduled rule.
    • Average files count
    • Average file size
    • Average record count
    • Error count
    • Expect versus Actual
    • Average Job run count
    • Average Job run duration
    • No gap in chronological data
    • Period over period average
    • Foreign key existance
    • Expected versus imported
    • Count data in column
    • Null data in column
    • Distinct data in column
    • Median data in column
    • Average data in column
    • Minimum data in column
    • Maximum data in column

If the profiled data fails any of the certification rules, an alert need to sent.

Use Case:

  1. Used the job run ID to profile the batch of data.
    • Can recertify previously profiled data set via CLI.
  2. Query stored SQL validation rules.
    • Spawn concurrent task to run each validation rule.
  3. Query used should be able to execute across different DB engines.
  4. Persist the result into the DB.

Deployment

This framework need to be package and be deployed out to the internet for the world to download and install. It should be able to be install using pip.

Use Case:

  1. Run the unit test to make sure nothing breaks.
    Read: https://code.visualstudio.com/docs/python/testing
  2. Package the project. Create setup.py and wheel files.
    Read: https://packaging.python.org/overview/
    Read: https://packaging.python.org/tutorials/installing-packages/
    Read: https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56

If all goes well, I as a Data Engineer, should be able to install the package using:
pip install etlite

Add Rest API workflow methods to base class

Each media, especially Rest API, should have a default workflow methods to be invoke by the framework. returning a None value means this method is to be skipped.

  1. get_authentication_uri()
  2. set_authentication_resp()
  3. get_data_request_uri()
  4. set_data_request_resp()
  5. get_next_data_page_uri()
  6. set_next_data_age_resp()

Add a thick GUI interface to manage the ETL control database

We should have a interactive GUI front end to interface with the database. We should NOT implement this as a web application for the user reach is not required.

Use the following colors for status:

  • Blue: Pending
  • Light Blue: Processing
  • Orange: Warning with minor issues
  • Red: Failed processing with major issue.
  • Green: Successfully processed.

Add Command Line Interface (CLI)

We should be able to invoke this framework from the command line or from the scheduler or from a serverless trigger.

Use Case:
As a data engineer, for me to develop a new ETL, I would typically perform the following steps:

  1. pipenv shell
    Start my virtual environment.

  2. pip install etlite
    The above will install this framework we are building from PyPi.

  3. etlite init
    Once the Python framework is installed, we initialize a new default directory structure (which should be very similar to out project directory structure). If a directory exist, it should warn and prompt for an over-write.

  4. etlite init db
    This step needs to be run once to initialize the Job Control DB where we store our job configuration and other collected metrics as we run our job. If we change in the config file to point to a different DB, a new set of DB objects shall be create there. Re-running this step should not corrupt your DB.

  5. etlite new Code "Job Description"
    This step register a new unique job code and a description in the DB. Furthermore, a templated code for our new job shall be created that requires further modification. If a file exist, it should warn and prompt for an over-write.

  6. etlite list TABLE
    This step list the rows in the ETL Control DB. The default is to list the rows in ascending ID order. The value for table need to be case in-sensitive. This command can have options parameter that will paginate the display. The default would be:

    • Offset: 0
    • Limit: 20
  7. etlite run Code
    This step run the "jobCode" ETL. The default generated ETL script should run without error but does nothing until we enhance it. This command can have many options parameters that will be validated, such as:

    • From Date: YYYMMDD-hhmi
    • Upto Date: YYYMMDD-hhmi
    • Frequency: MONTHLY, WEEKLY, DAILY, HOURLY

This command should also provide task(certain part of the job) based options

  • tasks such as:
    • extractData Code
    • profileData Code
    • certifyData Code
    • loadData Code
  1. etlite run '[ J1,J2,J3 ,{C1,C2} ]'
    We should also be able to provide a directed acyclic graph (DAG) represented in a JSON structure.

Add profiling of the dataset

A fresh set of data that is process should be profiled. Each field should be aggregated with:

  • COUNT()
  • COUNT( DISTINCT )
  • MIN()
  • MAX()
  • AVG()

The profiled data results should be persisted into the database.
Ideally, we should collected these stats are we process the data rather than post ingestion.

Use Case:

  1. Used the job run ID to profile the batch of data.
    • Can reprofile a previously profiled data set via CLI.
  2. Query the the ETL object for the data columns to profile.
  3. Spawn concurrent SQL query to aggregate againsts each of the columns that is of interested.
    • String fields should be based on its length.
    • Query used should be able to execute across different DB engines.
  4. Persist the result into the DB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.