Comments (6)
Current thoughts on a new specification for manifest and config files to support the two modes of operation above. This is a spec before I code this up (the code in my repo doesn't necessarily support this yet).
Modes of Operation
Normal Mode
- Watches S3 buckets with a Manifest file that match the key pattern and data pattern.
- Performs upsert on new data when found.
- Deletes S3 files from bucket that were upserted into Redshift.
- Keeps watching the S3 bucket as per the Manifest file in the bucket.
Example config file (Normal mode)
{:s3 {:credentials {:access-key "***"
:secret-key "***"}
:bucket "your-bucket"
:key-pattern ".*"
:poll-interval {:seconds 30}
:server-side-encryption "AES256" ;; optional
}
:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
:redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
{:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
]
}
Example manifest.edn file (Normal mode)
{:table "mydata_fact"
:pk-columns ["id" "blah"]
:columns ["id" "blah" "timestamp" ]
:database :dw1 ;; must match the config to resolve the jdbc-url
:options ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
:data-pattern ".*blah.*.gz$"
:keep-data-pattern-files-on-import false
:keep-manifest-upon-import true
}
Alternate Mode
- Watches S3 buckets with a Manifest file that match the key pattern and data pattern.
- Performs upsert on new data when found.
- Does not delete S3 file data.
- If the data loads successfully, deletes the Blueshift manifest file in the S3 bucket to stop watching the bucket.
Example config file (Alternate mode)
{:s3 {:credentials {:access-key "***"
:secret-key "***"}
:bucket "your-bucket"
:key-pattern ".*"
:poll-interval {:seconds 30}
:server-side-encryption "AES256" ;; optional
}
:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
:redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
{:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
]
}
Example manifest.edn file (Alternate mode)
{:table "mydata_fact"
:pk-columns ["id" "blah"]
:columns ["id" "blah" "timestamp" ]
:database :dw1 ;; must match the config to resolve the jdbc-url
:options ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
:data-pattern ".*blah.*.gz$"
:keep-data-pattern-files-on-import true
:keep-manifest-upon-import false
}
Would be great to get uswitch thoughts on this.
-A
from blueshift.
Did you implement this? I don't see it in your fork.
from blueshift.
Hello, yes I have this implemented and I've used it for at least a year. I will definitely make it available once I get the chance to look into it (it's been a while) :)
from blueshift.
update: found the code, will carve out time to create a branch (hopefully today)...
from blueshift.
Okay, it's merged to master branch now. Lmk how it goes... ;)
from blueshift.
Thank you! Have you looked into a mode that doesn't upsert?
If the data doesn't have a primary key, I'd like Blueshift to just do a simple insert.
from blueshift.
Related Issues (15)
- Collect import errors from stl_load_errors
- Posting Large number of files in s3 seem to get blueshift acting strange and ignoring large manifest file HOT 7
- getting Serializable isolation violation on table HOT 7
- s3 bucket eventual consistency errors HOT 5
- java.lang.StackOverflowError problems at startup HOT 5
- App does not recognize 'manifest.edn' file HOT 2
- 'No suitable driver' error raised when using Redshift's URL in the manifest
- multithreaded delete possible?
- Import is making duplicate values instead of upsert
- Getting following error while importing into redshift HOT 3
- Handle manifest validation errors HOT 1
- Security issues - does the JDBC password need to be in the S3 Manifest file(s)? HOT 19
- Odd error and Data is not loaded yet data files are removed! HOT 7
- Slow deletes during merge insert HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blueshift.