uswitch / blueshift Goto Github PK

View Code? Open in Web Editor NEW

119.0 119.0 19.0 183 KB

Automate copying data from S3 into Amazon Redshift

License: Eclipse Public License 1.0

Clojure 99.18% Dockerfile 0.82%

blueshift's People

Contributors

Stargazers

Watchers

Forkers

satishbasetty sravankreddy datasnap-io aleksandersumowski aaelony lucagrulla vgeshel attribution marseille adikteev cohere-io bbnsumanth flsafe jamshedmelik samvin anilktechie tradeswell

blueshift's Issues

App does not recognize 'manifest.edn' file

I had this software working a few weeks ago, but now it does not work.

The S3 bucket scanner finds my two directories with 'manifest.edn' files, but does not add them to the
"watched directories" list. The list stays length 0, and no files are uploaded to Postgres (Redshift).

The DEBUG printout shows that it finds the manifest.edn file. There is no printout about "I found a manifest.edn file but did not like it".

'No suitable driver' error raised when using Redshift's URL in the manifest

The JDBC URL that appears in Redshift's dashboard starts with jdbc:redshift://... . However, Blueshift expects jdbc:postgresql://... in the manifest file.

When using the first one, Blueshift raises this error when trying to connect to Redshift:
Error loading into blueshift java.sql.SQLException: No suitable driver found for jdbc:redshift://...

This is because Blueshift uses the Postgresql driver instead of the Redshift driver.

I think that this is a common problem that users might have. So it may be a good idea to check this in the manifest validation and alert the user with a custom error instead of raising the generic SQLException.
What do you think?

Security issues - does the JDBC password need to be in the S3 Manifest file(s)?

Great project. One concern about passwords. According to the design.md documentation, the manifest.edn file containing the JDBC connection username and password sits in S3.

Shouldn't this info be better placed in the config.edn file ?

-A

Handle manifest validation errors

Ensure blueshift handles manifests that fail validation errors. Should log the error and try again after the usual delay.

clojure.lang.ExceptionInfo: Invalid manifest. Check map for more details.
    at clojure.core$ex_info.invoke(core.clj:4403)
    at uswitch.blueshift.s3$validate.invoke(s3.clj:26)
    at uswitch.blueshift.s3.KeyWatcher$fn__8238$fn__8239.invoke(s3.clj:98)
    at uswitch.blueshift.s3.KeyWatcher$fn__8238.invoke(s3.clj:95)
    at clojure.core.async$thread_call$fn__5297$fn__5298.invoke(async.clj:390)
    at clojure.core.async$thread_call$fn__5297.invoke(async.clj:390)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

getting Serializable isolation violation on table

i seem to get these alot lately since our app is running alot of SELECT statement now and its always against the table and the time that blueshift is loading data against the table.

org.postgresql.util.PSQLException: ERROR: 1023 Detail: Serializable isolation violation on table

i have an open ticket with amazon and see suggestions on locking exclusive the table but blueshift is not doing that i noticed. does it need to or am i missing something?

Odd error and Data is not loaded yet data files are removed!

I am getting this error sometimes when running blueshift. It loads data "sometimes" but when it gets this error is does not BUT it still removes the data files form the bucket so the data does not get loaded.

Can someone help me troubleshoot this or point me to how to get more forensics info?

Thanks!

_rules_time_trigger_sunday,campaign_targeting_rules_communication_ids_idx_0,campaign_targeti
ng_rules_communication_ids_idx_1,campaign_targeting_rules_communication_ids_idx_2,campaign_t
argeting_rules_communication_ids_idx_3,campaign_targeting_rules_communication_ids_idx_4,data
snap_timestamp,datasnap_created,datasnap_txn_id,datasnap_customer_txn_id,datasnap_device_use
r_agent,datasnap_device_ip_address,datasnap_device_platform,datasnap_device_os_version,datas
nap_device_model,datasnap_device_manufacturer,datasnap_device_name,datasnap_device_vendor_id
,datasnap_device_carrier_name,datasnap_device_country_code,datasnap_device_network_code) FRO
M 's3://datasnap-events-staging-latest-flattened-redshift/c954b50a-29d4-4ce7-a623-91afe752e7
07.manifest' CREDENTIALS 'aws_access_key_id=AKIAJLGOZADWN3CH5NSA;aws_secret_access_key=EUxsq
wnJzdJUmSzfZ+ZlhupPkfaden23CL27mVXX' TIMEFORMAT 'auto' EMPTYASNULL BLANKSASNULL manifest
[async-thread-macro-3] ERROR uswitch.blueshift.redshift - ROLLBACK
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.j
ava:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:12
83)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:3
92)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Stateme
nt.java:330)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:3
21)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java
:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at uswitch.blueshift.redshift$execute$fn__10871.invoke(redshift.clj:103)
at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
at clojure.lang.RestFn.invoke(RestFn.java:482)
at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
at uswitch.blueshift.redshift.Loader$fn__10919$fn__10922$fn__10923.invoke(redshift.$lj:160)
at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
at com.codahale.metrics.Timer.time(Timer.java:99)
at uswitch.blueshift.redshift.Loader$fn__10919$fn__10922.invoke(redshift.clj:159)
at uswitch.blueshift.redshift.Loader$fn__10919.invoke(redshift.clj:159)
at clojure.core.async$thread_call$fn__6407$fn__6408.invoke(async.clj:390)
at clojure.core.async$thread_call$fn__6407.invoke(async.clj:390)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[async-thread-macro-3] ERROR uswitch.blueshift.redshift - Error loading into raw_events_fin$l_temp
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.$ava:1501)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1$83)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:$92)

Getting following error while importing into redshift

Getting following error while importing into redshift, although that it seems that all data were imported well.

Any clues why is this happens ?

[14:30:24.130] ERROR uswitch.blueshift.redshift - ROLLBACK
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1501)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:392)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:330)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:321)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
    at uswitch.blueshift.redshift$execute$fn__9063.invoke(redshift.clj:103)
    at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
    at clojure.lang.RestFn.invoke(RestFn.java:482)
    at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
    at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
    at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114$fn__9115.invoke(redshift.clj:160)
    at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:99)
    at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114.invoke(redshift.clj:159)
    at uswitch.blueshift.redshift.Loader$fn__9111.invoke(redshift.clj:159)
    at clojure.core.async$thread_call$fn__5301$fn__5302.invoke(async.clj:390)
    at clojure.core.async$thread_call$fn__5301.invoke(async.clj:390)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
[14:30:24.292] ERROR uswitch.blueshift.redshift - Error loading into CM_rev
java.sql.SQLException: ERROR: Mandatory url is not present in manifest file.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1501)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:186)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:392)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:330)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:321)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
    at uswitch.blueshift.redshift$execute$fn__9063.invoke(redshift.clj:103)
    at uswitch.blueshift.redshift$execute.doInvoke(redshift.clj:103)
    at clojure.lang.RestFn.invoke(RestFn.java:482)
    at uswitch.blueshift.redshift$merge_table.invoke(redshift.clj:117)
    at uswitch.blueshift.redshift$load_table.invoke(redshift.clj:138)
    at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114$fn__9115.invoke(redshift.clj:160)
    at uswitch.blueshift.redshift.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
    at com.codahale.metrics.Timer.time(Timer.java:99)
    at uswitch.blueshift.redshift.Loader$fn__9111$fn__9114.invoke(redshift.clj:159)
    at uswitch.blueshift.redshift.Loader$fn__9111.invoke(redshift.clj:159)
    at clojure.core.async$thread_call$fn__5301$fn__5302.invoke(async.clj:390)
    at clojure.core.async$thread_call$fn__5301.invoke(async.clj:390)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Collect import errors from stl_load_errors

When an import fails query the stl_load_errors table for the errors during that transaction, store them in a file on S3 so its easier to see when/why imports don't process- most likely due to a schema problem or a problem with the data itself.

Thoughts around new usage "modes" (?)

Wanted to get your thoughts.

In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.

The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.

Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.

Other modes could potentially be created if necessary or desired.

I haven't implemented this yet, but am considering it.

My questions are:
1. What are your thoughts on this in general?
2. Do you desire this kind of behavior (or something similar)?
3. Would you eventually want something like this merged back into Blueshift?
4. What are your thoughts around semantics of how to specify these Modes? Config file semantics? Manifest file semantics?

Thanks for your thoughts.

-Avram

java.lang.StackOverflowError problems at startup

I'm on Java 7 (1.0.79) on Linux Ubuntu 14.4 (Mint). I built Blueshift a few weeks ago and set it up. It ran fine.

But, now it does not. At startup I get this:
[2015-08-28 20:36:41,430] INFO uswitch.blueshift.main - Starting Blueshift with configuration elb.config.edn
[2015-08-28 20:36:41,447] INFO uswitch.blueshift.telemetry - Loading reporter uswitch.blueshift.telemetry/log-metrics-reporter for registry #<MetricRegistry com.codahale.metrics.MetricRegistry@c909572>
[2015-08-28 20:36:41,474] INFO uswitch.blueshift.s3 - Starting BucketWatcher. Polling edmodo.hadoop.export every 30 seconds for keys matching #"elblogs.*"
[2015-08-28 20:36:41,480] INFO uswitch.blueshift.s3 - Starting KeyWatcherSpawner
[2015-08-28 20:36:41,489] INFO uswitch.blueshift.telemetry - Starting Slf4j metrics reporter

This happens without any JVM stack memory arguments. Also, have tested various combinations of memory arguments. Here is a sample startup line:

java -Dlogback.configurationFile=./etc/logback.xml -Xss325k -Xmx10g -jar target/blueshift-0.1.0-SNAPSHOT-standalone.jar --config elb.config.edn

Do you have a time-tested combination of memory flags that will support 5-10 polled directories?

(I am filing an issue as there is no mailing list/google groups etc.)

multithreaded delete possible?

We have alot of small s3 files and the delete operation ins blueshift is the biggest bottleneck right now.
Any suggestions since i dont know implications of using something like pmap offhand?

(defrecord Cleaner [credentials bucket cleaner-ch]
Lifecycle
(start this
(thread
(loop [](when-let [m %28<!! cleaner-ch%29]
%28doseq [key %28:files m%29]
%28info "Deleting" %28str "s3://" bucket "/" key))
(delete-object credentials bucket key))
(recur))))
this)
(stop this
this))

s3 bucket eventual consistency errors

so over the weekend s3 eventual consistency on list operations because really slow and it seems the s3 list index was retuning files that were actually deleted, therefore i was getting COPY errors.

So i think ill need to run through the list and do a HEAD GET for each file to make sure its 100% there before adding to the manifest correct?

select recordtime,query, sliceid, substring(key from 1 for 20) as file,
substring(error from 1 for 300) as error,bucket,key
from stl_s3client_error

2015-04-27 20:12:18.131621 | 1498948 | 31 | staging/y=2015_m=4_d | s3-us-west-2.amazonaws.com 54.231.161.48 S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid C61A58E48CCA899D,ExtRid p5mfdOb0otnryj7yxur/iM/TS9RI+DfjqA58WL5GSlMZhEJsPJENIvHDRUFfCEkEA7VkQ7frq9U=,CanRetry 1,Abort: 5 5 5. | datasnap-events-prod-flattened-redshift-04-13-2015 | staging/y=2015_m=4_d=27_H=18_M=15_80deb2ee-2bc9-4c44-9a4f-b5f4b1e07e87.gz

Slow deletes during merge insert

Blueshift's default strategy is to merge new data with the existing. This follows Amazon's recommended method (see http://docs.aws.amazon.com/redshift/latest/dg/merge-examples.html for examples).

The delete statement seems to be very slow when run with tables that have primary keys that aren't also sort or distribution keys.

Import is making duplicate values instead of upsert

If I have multiple unimported csv files, and each has one change of same record, that record is written multiple times in database. Is there a way to tell blueshift to import file by file?

Posting Large number of files in s3 seem to get blueshift acting strange and ignoring large manifest file

we have a continuous stream of events coming into our API and placing small amounts of files in the s3 bucket and they are being processed fine.

Sometimes we import form customers very large amount of smaller files and when we do this we see a large manifest file being created and places at the root location.. but as the continous stream of events keep coming we notice that the initial COPY ANALYZE gets aborted and then it just starts importing the small files in a separate manifest and sortof "forgets" about the large one. Only when i restart blueshift does it reread the folder and create a new manifest and kick of the proper COPY.

is blueshift somehow aborting the large copy when it processes another smaller manifest and then forgets about it?

I am going to dig into the code a little myself and see if I can follow whats going on there.

Sometimes after restarting it will run the COPY then it aborts:

https://www.dropbox.com/s/a1qdmnc6dsnyjfk/Screenshot%202015-02-03%2012.14.06.png?dl=0