Giter Site home page Giter Site logo

rodalgon / beam-nuggets Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mohaseeb/beam-nuggets

0.0 1.0 0.0 5.87 MB

Collection of transforms for the Apache beam python SDK.

Home Page: http://mohaseeb.com/beam-nuggets/

License: MIT License

Python 98.65% Shell 1.35%

beam-nuggets's Introduction

PyPI PyPI - Downloads

About

A collection of random transforms for the Apache beam python SDK . Many are simple transforms. The most useful ones are those for reading/writing from/to relational databases.

Installation

  • Using pip
pip install beam-nuggets
  • From source
git clone [email protected]:mohaseeb/beam-nuggets.git
cd beam-nuggets
pip install .

Supported transforms

IO

Others

Documentation

See here.

Usage

Write data to an SQLite table using beam-nugget's relational_db.Write transform.

# write_sqlite.py contents
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import relational_db

records = [
    {'name': 'Jan', 'num': 1},
    {'name': 'Feb', 'num': 2}
]

source_config = relational_db.SourceConfiguration(
    drivername='sqlite',
    database='/tmp/months_db.sqlite',
    create_if_missing=True  # create the database if not there 
)

table_config = relational_db.TableConfiguration(
    name='months',
    create_if_missing=True,  # automatically create the table if not there
    primary_key_columns=['num']  # and use 'num' column as primary key
)
    
with beam.Pipeline(options=PipelineOptions()) as p:  # Will use local runner
    months = p | "Reading month records" >> beam.Create(records)
    months | 'Writing to DB' >> relational_db.Write(
        source_config=source_config,
        table_config=table_config
    )

Execute the pipeline

python write_sqlite.py 

Examine the contents

sqlite3 /tmp/months_db.sqlite 'select * from months'
# output:
# 1.0|Jan
# 2.0|Feb

To write the same data to a PostgreSQL table instead, just create a suitable relational_db.SourceConfiguration as follows.

source_config = relational_db.SourceConfiguration(
    drivername='postgresql+pg8000',
    host='localhost',
    port=5432,
    username='postgres',
    password='password',
    database='calendar',
    create_if_missing=True  # create the database if not there 
)

Click here for more examples, including writing to PostgreSQL in Google Cloud Platform using the DataFlowRunner.

An example showing how you can use beam-nugget's relational_db.ReadFromDB transform to read from a PostgreSQL database table.

from __future__ import print_function
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import relational_db

with beam.Pipeline(options=PipelineOptions()) as p:
    source_config = relational_db.SourceConfiguration(
        drivername='postgresql+pg8000',
        host='localhost',
        port=5432,
        username='postgres',
        password='password',
        database='calendar',
    )
    records = p | "Reading records from db" >> relational_db.ReadFromDB(
        source_config=source_config,
        table_name='months',
        query='select num, name from months'  # optional. When omitted, all table records are returned. 
    )
    records | 'Writing to stdout' >> beam.Map(print)

See here for more examples.

Development

  • Install
git clone [email protected]:mohaseeb/beam-nuggets.git
cd beam-nuggets
export BEAM_NUGGETS_ROOT=`pwd`
pip install -e .[dev]
  • Make changes on dedicated dev branches
  • Run tests
cd $BEAM_NUGGETS_ROOT
python -m unittest discover -v
  • Generate docs
cd $BEAM_NUGGETS_ROOT
docs/generate_docs.sh
  • Create a PR against master.
  • After merging the accepted PR and updating the local master, upload a new build to pypi.
cd $BEAM_NUGGETS_ROOT
scripts/build_test_deploy.sh

Backlog

  • versioned docs?
  • Summarize the investigation of using Source/Sink Vs ParDo(and GroupBy) for IO
  • more nuggets: WriteToCsv
  • Investigate readiness of SDF ParDo, and possibility to use for relational_db.ReadFromDB
  • integration tests
  • DB transforms failures handling on IO transforms
  • more nuggets: Elasticsearch, Mongo
  • WriteToRelationalDB, logging

Contributers

mohaseeb, astrocox, 2514millerj

Licence

MIT

beam-nuggets's People

Contributors

mohaseeb avatar astrocox avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.