elau1004 / etlite Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 0.0 525 KB

A lightweight framework to host your ETL data-pipeline

License: MIT License

Python 98.92% Mako 0.95% Shell 0.13%

engineering etl datapipeline

etlite's Introduction

ETLite

A lightweight framework to host your ETL data-pipeline

This is still under construction.

GitHub: https://github.com/elau1004/ETLite E-Mail: [email protected]

project

etlite's People

Contributors

Stargazers

Watchers

etlite's Issues

Add a ETL control database to track job executions and statistics

We need a standalone SQLite DB with the option to swap it out for a Client-Server DB such as PostgreSQL or MS-SQL Server express.

This database shall have the job definition and the following functionalities:

Manage the job configuration.
Manage the sequential run ID for all the jobs.
Manage the logs for all the ran jobs.
Manage dataset statistics from the profiled data.
Manage validation rules.

Use Case:

etlite init db
This step shall run the SQLAlchemy DB migration scripts. It shall run the DB connection string from a config file and connect to the host server. SQLite should be the default DB and should be the first supported DB.
Initially, we have to use SQL to manipulate until we have the console application online.
etlite run Code
This step shall use the unique code to lookup the configuration in the DB. Internally, it will determine and collect metrics about the data that flowed through and store in the DB. The collected metrics shall be validated against a set of rules that are also stored in the DB. The results of the validation shall also be stored in the DB.

Build the core engine

The core engine shall execute a directed acyclic graph (DAG) of jobs. It need to provide the best throughput experience.

The engine shall the following features:

Able to execute parallel jobs.
Minimal memory pressure that is proportional to the number of concurrent jobs.
Able to support REST api, DB connections, and file transfers. REST is the priority.
Monitor duration of an executing job. Alert if it is slower than usual.
Alert notification of failures.
Able to be run in a serverless environment.

Use Case:

A single job is synonymous to a single entry list.
Each job entry at the root level shall be spawn out to dedicated CPU core.
- This would be a typically dedicating a CPU core to a vendor.
Each job children jobs shall be co-executed on the same CPU core asynchronously.
Auto discover the job objects from the Python scripts in the sub-directories.
Have a dialog with the instantiated object to conditionally execute the choreographed workflow.
Each job REST api call must handle:
- The returned response as a stream. Do NOT materialize the entire returned data into memory.
- Support pagination of the data if the vendor supports it. If offset and limit is provided, make concurrent calls to the endpoint for the data.
- Throttle up or down for different size of the data if the request time out.
- Retry after failure. Use a Fibonacci increment.

Add onboarding documentation

We need the public documentation to be on this project wiki
https://github.com/elau1004/ETLite/wiki
The front page README.MD need to be keep current.

Certify the profiled data

After the profile result is persisted, we need to validate its shape to determine if the profile is within a configured threshold.
Threshold can the following limits:

Absolute scala
Average scala
Standard deviation

The rules shall be inheritable to over write defaults settings.
The following are some base assertion rules to be inherited and implemented by each specific job:
- Missed scheduled rule.
- Average files count
- Average file size
- Average record count
- Error count
- Expect versus Actual
- Average Job run count
- Average Job run duration
- No gap in chronological data
- Period over period average
- Foreign key existance
- Expected versus imported
- Count data in column
- Null data in column
- Distinct data in column
- Median data in column
- Average data in column
- Minimum data in column
- Maximum data in column

If the profiled data fails any of the certification rules, an alert need to sent.

Use Case:

Used the job run ID to profile the batch of data.
- Can recertify previously profiled data set via CLI.
Query stored SQL validation rules.
- Spawn concurrent task to run each validation rule.
Query used should be able to execute across different DB engines.
Persist the result into the DB.

Add ETL control DB

https://dbdiagram.io/d/5e2d0bd99e76504e0ef0a81a

Deployment

This framework need to be package and be deployed out to the internet for the world to download and install. It should be able to be install using pip.

Use Case:

Run the unit test to make sure nothing breaks.
Read: https://code.visualstudio.com/docs/python/testing
Package the project. Create setup.py and wheel files.
Read: https://packaging.python.org/overview/
Read: https://packaging.python.org/tutorials/installing-packages/
Read: https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56

If all goes well, I as a Data Engineer, should be able to install the package using:
pip install etlite

Add Rest API workflow methods to base class

Each media, especially Rest API, should have a default workflow methods to be invoke by the framework. returning a None value means this method is to be skipped.

get_authentication_uri()
set_authentication_resp()
get_data_request_uri()
set_data_request_resp()
get_next_data_page_uri()
set_next_data_age_resp()

Add a thick GUI interface to manage the ETL control database

We should have a interactive GUI front end to interface with the database. We should NOT implement this as a web application for the user reach is not required.

Use the following colors for status:

Blue: Pending
Light Blue: Processing
Orange: Warning with minor issues
Red: Failed processing with major issue.
Green: Successfully processed.

Add Command Line Interface (CLI)

We should be able to invoke this framework from the command line or from the scheduler or from a serverless trigger.

Use Case:
As a data engineer, for me to develop a new ETL, I would typically perform the following steps:

pipenv shell
Start my virtual environment.
pip install etlite
The above will install this framework we are building from PyPi.
etlite init
Once the Python framework is installed, we initialize a new default directory structure (which should be very similar to out project directory structure). If a directory exist, it should warn and prompt for an over-write.
etlite init db
This step needs to be run once to initialize the Job Control DB where we store our job configuration and other collected metrics as we run our job. If we change in the config file to point to a different DB, a new set of DB objects shall be create there. Re-running this step should not corrupt your DB.
etlite new Code "Job Description"
This step register a new unique job code and a description in the DB. Furthermore, a templated code for our new job shall be created that requires further modification. If a file exist, it should warn and prompt for an over-write.
etlite list TABLE
This step list the rows in the ETL Control DB. The default is to list the rows in ascending ID order. The value for table need to be case in-sensitive. This command can have options parameter that will paginate the display. The default would be:
- Offset: 0
- Limit: 20
etlite run Code
This step run the "jobCode" ETL. The default generated ETL script should run without error but does nothing until we enhance it. This command can have many options parameters that will be validated, such as:
- From Date: YYYMMDD-hhmi
- Upto Date: YYYMMDD-hhmi
- Frequency: MONTHLY, WEEKLY, DAILY, HOURLY

This command should also provide task(certain part of the job) based options

tasks such as:
- extractData Code
- profileData Code
- certifyData Code
- loadData Code

etlite run '[ J1,J2,J3 ,{C1,C2} ]'
We should also be able to provide a directed acyclic graph (DAG) represented in a JSON structure.

Add profiling of the dataset

A fresh set of data that is process should be profiled. Each field should be aggregated with:

COUNT()
COUNT( DISTINCT )
MIN()
MAX()
AVG()

The profiled data results should be persisted into the database.
Ideally, we should collected these stats are we process the data rather than post ingestion.

Use Case:

Used the job run ID to profile the batch of data.
- Can reprofile a previously profiled data set via CLI.
Query the the ETL object for the data columns to profile.
Spawn concurrent SQL query to aggregate againsts each of the columns that is of interested.
- String fields should be based on its length.
- Query used should be able to execute across different DB engines.
Persist the result into the DB.

elau1004 / etlite Goto Github PK

etlite's Introduction

ETLite

This is still under construction.

etlite's People

Contributors

Stargazers

Watchers

etlite's Issues

Recommend Projects

Recommend Topics

Recommend Org