Giter Site home page Giter Site logo

zachaysan / activewarehouse-etl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from activewarehouse/activewarehouse-etl

1.0 3.0 0.0 208 KB

Extract-Transform-Load library from ActiveWarehouse

Home Page: http://activewarehouse.rubyforge.org/

License: MIT License

activewarehouse-etl's Introduction

Ruby Extract-Transform-Load (ETL) tool.

== Requirements

* Ruby 1.8.5 or higher
* Rubygems

== Online Documentation

Available at http://activewarehouse.rubyforge.org/docs/activewarehouse-etl.html

== Features

Current supported features:

* ETL Domain Specific Language (DSL) - Control files are specified in a Ruby-based DSL
* Multiple source types. Current supported types:
  * Fixed-width and delimited text files
  * XML files through SAX
  * Apache combined log format
* Multiple destination types - file and database destinations
* Support for extracting from multiple sources in a single job
* Support for writing to multiple destinations in a single job
* A variety of built-in transformations are included:
  * Date-to-string, string-to-date, string-to-datetime, string-to-timestamp
  * Type transformation supporting strings, integers, floats and big decimals
  * Trim
  * SHA-1
  * Decode from an external decode file
  * Default replacement for empty values
  * Ordinalize
  * Hierarchy lookup
  * Foreign key lookup
  * Ruby blocks
  * Any custom transformation class
* A variety of build-in row-level processors
  * Check exists processor to determine if the record already exists in the destination database
  * Check unique processor to determine whether a matching record was processed during this job execution
  * Copy field
  * Rename field
  * Hierarchy exploder which takes a tree structure defined through a parent id and explodes it into a hierarchy bridge table
  * Surrogate key generator including support for looking up the last surrogate key from the target table using a custom query
  * Sequence generator including support for context-sensitive sequences where the context can be defined as a combination of fields from the source data
  * New row-level processors can easily be defined and applied
* Pre-processing
  * Truncate processor
* Post-processing
  * Bulk import using native RDBMS bulk loader tools
* Virtual fields - Add a field to the destination data which doesn't exist in the source data
* Built in job and record meta data
* Support for type 1 and type 2 slowly changing dimensions
  * Automated effective date and end date time stamping for type 2
  * CRC checking

== Dependencies
ActiveWarehouse ETL depends on the following gems:
* ActiveSupport Gem
* ActiveRecord Gem
* FasterCSV Gem
* AdapterExtensions Gem

== Usage
Once the ActiveWarehouse ETL gem is installed jobs can be invoked using the
included `etl` script. The etl script includes several command line options
and can process multiple control files at a time.

Command line options:
* <tt>--help, -h</tt>: Display the usage message.
* <tt>--config, -c</tt>: Specify a database.yml configuration file to use.
* <tt>--limit, -l</tt>: Specify a limit to the number of rows to process. This option is currently only applicable to database sources.
* <tt>--offset, -o</tt>: Specify the start offset for reading from the source. This option is currently only applicable to database sources.
* <tt>--newlog, -n</tt>: Instruct the engine to create a new ETL log rather than append to the last ETL log.
* <tt>--skip-bulk-import, -s</tt>: Skip any bulk imports.
* <tt>--read-locally</tt>: Read from the local cache (skip source extraction)

== Control File Examples
Control file examples can be found in the examples directory.

== Running Tests

Current state:
- 11 failures on MySQL
- 1 failure on Postgres

The tests require:
- gem install shoulda
- gem install flexmock
- gem install pg (if you want to run the tests on pg)
- gem install spreadsheet

The tests subfolder contains examples database.yml for mysql and postgres.

To run the tests:
- rake test DB=postgresql (for postgres)
- otherwise just rake test

== Feedback
This is a work in progress. Comments should be made on the 
activewarehouse-discuss mailing list at the moment. Contributions are always
welcome.

activewarehouse-etl's People

Contributors

aeden avatar cdimartino avatar jayzes avatar jlecour avatar mainej avatar sasikumargn avatar smeyfroi avatar thbar avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.