toothrot / postgres_to_redshift Goto Github PK

Copies postgres databases to redshift

License: MIT License

Ruby 87.94% Shell 12.06%

postgres_to_redshift's Introduction

PostgresToRedshift

This gem copies data from postgres to redshift. It's especially useful to copy data from postgres to redshift in heroku.

Installation

Add this line to your application's Gemfile:

gem 'postgres_to_redshift'

And then execute:

$ bundle

Or install it yourself as:

$ gem install postgres_to_redshift

Usage

Set your source and target databases, as well as your s3 intermediary.

export POSTGRES_TO_REDSHIFT_SOURCE_URI='postgres://username:password@host:port/database-name'
export POSTGRES_TO_REDSHIFT_TARGET_URI='postgres://username:password@host:port/database-name'
export POSTGRES_TO_REDSHIFT_TARGET_SCHEMA='testing-data'
export S3_DATABASE_EXPORT_ID='yourid'
export S3_DATABASE_EXPORT_KEY='yourkey'
export S3_DATABASE_EXPORT_BUCKET='some-bucket-to-use'

postgres_to_redshift

Contributing

Fork it ( https://github.com/kitchensurfing/postgres_to_redshift/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

postgres_to_redshift's People

Contributors

Stargazers

Watchers

Forkers

samanthamjohn sleighter dkaufman excid3 producthunt mobolive attribution carwow kpx-dev anshul ashawley waffle-iron precisionnutrition pathgather fundthrough cuulee jobtoday-archive jsquirrelz simbi-inc vydia chefsteps redguava dressipi dingn1 brightwheel rdeshpande lkbodine meetcleo grabr kidspassport swetonic origami-medical dbdarek saggitarianamit waldenapp alien2150 rygillette reservebar-ecom vignesh1411

postgres_to_redshift's Issues

failed to allocate memory (NoMemoryError)

I have a very large table in Postgres of size ~33GB. After an hour of copying the data down, the memory runs out. Seems to because StringIO is used with Zlib::GzipWriter:

$ postgres_to_redshift
Downloading scores
~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:92:in `write': failed to allocate memory (NoMemoryError)
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:92:in `write'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:92:in `block in copy_table'
        from /usr/local/share/ruby/gems/2.2/gems/pg-0.17.1/lib/pg/connection.rb:138:in `copy_data'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:90:in `copy_table'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:23:in `block in update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:20:in `each'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:20:in `update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/bin/postgres_to_redshift:5:in `<top (required)>'
        from ~/bin/postgres_to_redshift:23:in `load'
        from ~/bin/postgres_to_redshift:23:in `<main>'
zlib(finalizer): Zlib::GzipWriter object must be closed explicitly.
zlib(finalizer): the stream was freed prematurely.

Split imports into multiple files for faster importing

see http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html

Allow credentials to be provided by ENV

Create script and CloudFormation for testing

I need to be able to spin up and spin down some instances to test this without crushing my wallet.

Stream uploads to s3 instead of using a giant string

Or use a tempfile or something

Invalid connection! (PG::Error)

I get the error Invalid connection! (PG::Error) from the Postgres source database with a large table that is about 33GB. After about 40 minutes, it gives the following error:

$ postgres_to_redshift
INFO:  Relation "submissions" already exists and will be skipped
Downloading submissions
Uploading submissions.1
Uploading submissions.2
Uploading submissions.3
Uploading submissions.4
Uploading submissions.5
Uploading submissions.6
Uploading submissions.7
Uploading submissions.8
Uploading submissions.9
/usr/local/share/ruby/gems/2.2/gems/pg-0.17.1/lib/pg/connection.rb:140:in `cancel': Invalid connection! (PG::Error)
        from /usr/local/share/ruby/gems/2.2/gems/pg-0.17.1/lib/pg/connection.rb:140:in `rescue in copy_data'
        from /usr/local/share/ruby/gems/2.2/gems/pg-0.17.1/lib/pg/connection.rb:137:in `copy_data'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:97:in `copy_table'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:27:in `block in update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:24:in `each'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:24:in `update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/bin/postgres_to_redshift:5:in `<top (required)>'
        from ~/bin/postgres_to_redshift:23:in `load'
        from ~/bin/postgres_to_redshift:23:in `<main>'

This happens for a 16GB table, as well.

NOTE: That I am using the code from #15 to split files up and the code from #14 to avoid out-of memory.

Is the database connection getting lost? I've modified the following settings on the Postgres 9.3 RDS instance at AWS:

tcp_keepalives_count      10
tcp_keepalives_idle     1200
tcp_keepalives_interval 1200

Seems like it should allow 20 minutes of idle and check up to 10 times, but error keeps happening.

Configurable target schema

Hi, since we are importing several apps to the same redshift DB, we want to import them to separate schemas.

If you're interested I've worked on a quick implementation
carwow@688040b

Detect sortkeys

Maybe if the field ends in id for now?

getting zlib error when running the update_tables

Zlib::GzipFile::Error: closed gzip stream

here is the stack trace
/home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:121:in `close' /home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:121:in `ensure in copy_table' /home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:122:in `copy_table' /home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:27:in `block in update_tables' /home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:24:in `each' /home/deploy/apps/vydia/shared/bundle/ruby/2.1.0/bundler/gems/postgres_to_redshift-1cf00b46e712/lib/postgres_to_redshift.rb:24:in `update_tables'

on ruby 2.1.9 directly using the master branch code postgresql is on 9.4

handle PG::InternalError in input.

select last load error from redshift from stl_load_errors

ERROR: cannot drop table TABLE_NAME_updating because other objects depend on it

Version 0.1.2

We have been using PostgresToRedshift.update_tables method to copy tables from PG to Redshift every day since August 2017 and never had a problem. Suddenly we got this error:

ERROR: cannot drop table TABLE_NAME_updating because other objects depend on it
HINT: Use DROP ... CASCADE to drop the dependent objects too.

We are bit lost since no code changes mode for a long time and can't figure out what is causing this error.

Has any one encountered this?

ERROR: syntax error at or near "permissions" (PG::SyntaxError) LINE 1: CREATE TABLE IF NOT EXISTS public.permissions ("id" integer,...

I have a poorly named table in my schema called permissions that causes the command to choke because of a SQL syntax error:

~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:21:in `exec': ERROR:  syntax error at or near "permissions" (PG::SyntaxError)
LINE 1: CREATE TABLE IF NOT EXISTS public.permissions ("id" integer,...
                                          ^
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:21:in `block in update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:20:in `each'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/lib/postgres_to_redshift.rb:20:in `update_tables'
        from ~/.gem/ruby/2.2/gems/postgres_to_redshift-0.1.1/bin/postgres_to_redshift:5:in `<top (required)>'
        from ~/bin/postgres_to_redshift:23:in `load'
        from ~/bin/postgres_to_redshift:23:in `<main>'

Implement some kind of timeouts

Handle money fields as decimals

Right now, money fields are imported as character varying(max). These should be cast to decimal on export from Postgres, then imported as decimal into redshift.

Since the current export is a COPY * FROM table_name, this is nontrivial.