Giter Site home page Giter Site logo

betl's Introduction

betl

Better ETL like "beatle"

An intial take on writing the skeleton of a small "Extract-Transform-Load" (ETL) application that is able to:

  • Retrieve data from various locations using standard custom workflows
  • Perform some amount of processing on that data
  • Load the processed data to various locations using standard custom workflows

Our interfaces for defining the workflows of extraction, transforming, and loading data are:

  • Extractor
  • Transformer
  • Loader

Usage

Writing a new ETL job will require the following:

  • A config file specifying what kind of extraction and loader processes will be used
  • A callable (e.g function, method, etc.) that will accept all the extracted datasources and return one or more datasets

We'll interact with our tooling using a simple command line interface (CLI)

betl -j <name-of-job>

Configuration File:

The configuration file will define the kind of extraction process used and the names of the datasets at each step in the process

extract:
    iris:
        type: CsvExtractor
        location: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

load:
    processed_data:
        type: CsvLoader
        location: output/iris_melt.csv

The dataset names (e.g. "iris" above) will be the argument names passed to the next step of an ETL job. So the related Transformer of this ETL job should be ready to accept iris as an argument.

Data Interfaces

Data being passed between steps should one of the following:

  • Path
  • Iterable[Path]
  • pandas.DataFrame
  • Iterable[pandas.DataFrame]

This supports common ways of processing data either from files or loading data into memory immediately using pandas. Using Iterables supports processing data in a pipeline where all extracted data does not need to be loaded into memory at once.

betl's People

Contributors

knoriega avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.