Comments (19)
Hi,
We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.
Do you have a different use case that would make this more valuable/necessary :)?
from blueshift.
Hi, it may be our use case, but I'm told that it is not uncommon that another team accidentally flips the switch that makes all S3 buckets globally readable - perhaps temporarily - and any embedded credentials in bucket folders might exacerbate the situation. The alternate line of thought says that well it's all Amazon anyways (s3, redshift, ec2,..) but I tend to prefer the more modular siloed approach with layers that would limit the scope of any such breach - just in case.
Was I persuasive?
On Dec 17, 2014, at 2:25 AM, Paul Ingles [email protected] wrote:
Hi,
We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.
Do you have a different use case that would make this more valuable/necessary :)?
—
Reply to this email directly or view it on GitHub.
from blueshift.
I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)
Would it make sense to add support for system variables in the manifest
-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL
) or were you thinking having a :default-jdbc-url
in the manifest
file would be a better solution for your team?
I'm open to pull-requests. Paul might have another opinion :)
from blueshift.
I'm not sure what the best approach would be, but am currently partial to a :redshift-db-name "my_db" in the manifest, and the config.edn file containing the conceptual link, perhaps something like :jdbc-connections [ {:redshift-db-name "my_db" :jdbc-url "urlgoeshere" }] ...
Then the redshift password would be as secure as (and in the same place as) the credentials mechanism.
Any drawbacks to this suggestion?
On Dec 17, 2014, at 12:03 PM, "Thomas G. Kristensen" [email protected] wrote:
I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)
Would it make sense to add support for system variables in the manifest-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL) or were you thinking having a :default-jdbc-url in the manifest file would be a better solution for your team?
I'm open to pull-requests. Paul might have another opinion :)
—
Reply to this email directly or view it on GitHub.
from blueshift.
interesting! how about using {:redshift-connection :some-identifier ...
inside the manifest, then inside config.edn:
{:redshift-connections {:some-identifier {:jdbc-url "" ...}}
Blueshift can then use the specified connection or one configured inside the manifest file?
from blueshift.
To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.
from blueshift.
Very elegant!:) This sounds great.
Just so I understand correctly, you mean (keys (read-string (slurp config))) would return (:s3 :redshift-connections :telemetry), right? and keys on :redshift-connections would return all keywords refering to existing redshift database jdbc urls... e.g. ( :redshift-db-1 :redshift-db-2) etc... Then, manifests in each s3 bucket that contain something like {:table "table1" :database :redshift-db-1 } ... would be eligible for loading.
I've actually never done a pull request (I'm embarassed to say). I hope to put something together and you can accept/reject/improve-on proposed changes.
Cheers,
Avram
On Dec 18, 2014, at 8:46 AM, Paul Ingles [email protected] wrote:
To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.
—
Reply to this email directly or view it on GitHub.
from blueshift.
You've got it :) We're eagerly awaiting your pull request ;)
from blueshift.
Oh so sorry, I was still wrapping my mind around the existing code base when the holidays hit.
Still with limited computer access for a bit until past the new year but there are still a couple things I didn't yet fully grasp and get time to tinker with when I left off:
- use of core.async, having never studied or actually used it,
- confirming what happens when 2 files with the same exact data but different filenames occur (hopefully upsert behavior means no update or insert changes), and
- confirming what happens when a new file is intended to clobber existing data.
Long story short it will take me some time to get back to it but hope to within the next 2-3 weeks.
sent from my iphone
On Dec 31, 2014, at 2:38 AM, "Thomas G. Kristensen" [email protected] wrote:
You've got it :) We're eagerly awaiting your pull request ;)
—
Reply to this email directly or view it on GitHub.
from blueshift.
A quick update... I'm back at this now and got blueshift running out-of-the-box with a few changes and thoughts to present. Again, this is a great project and I'm learning quite a bit by digging into the code. Had not used core.async, schema, or component until now.
One big reason things didn't work initially was that I needed to add {:server-side-encryption "AES256"} to put-object call in put-manifest function. Perhaps that should be an option in the future? That and my misconceptions as to what exactly was meant by :key-pattern and :data-pattern (which for an s3 path is understood to be s3://my-bucket-name/key/part/goes/here/file.txt) took me a while to figure out. I used to think of S3 bucket concept as the full S3 directory path (not anymore).
Now that I've resolved the issues with sse and my understanding above, I'm about to go ahead with the jdbc-url changes discussed previously.
Component takes/took me a lot of getting used to... especially when trying to understand why things didn't work initially as expected. That we're not passing a big map around in function arguments will likely be a bit annoying when particular args aren't available locally in the upcoming changes but that's ok.
I like your aws-censor and wrote a similar jdbc-url-censor function.
Also, another issue I'm wrangling with is that files in s3 that are successfully imported into Redshift are deleted (currently) by Blueshift. We use s3 buckets for storage, so we would want to disable that default deletion behavior. What are your thoughts on this?
cheers, -A
from blueshift.
Hi, I have a number of cool changes ready for review.
Here is a glimpse... by the way I have tested this on my setup and everything works (!!):
Config file
{:s3 {:credentials {:access-key "***"
:secret-key "***"}
:bucket "your-bucket"
:key-pattern ".*"
:poll-interval {:seconds 30}
:server-side-encryption "AES256"
:keep-s3-files-on-import true
}
:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
:redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
{:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
]
}
Manifest file
{:table "mydata_fact"
:pk-columns ["id" "blah"]
:columns ["id" "blah" "timestamp" ]
:database :dw1 ;; must match the config to resolve the jdbc-url
:options ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
:data-pattern ".*blah.*.gz$"
}
What is the best way to accomplish the pull request? new branch? I'm new to pull requests so feel free to send git commands that I can execute on my end if need be.
cheers,
Avram
from blueshift.
I ended up forking to https://github.com/aaelony/blueshift and creating a branch called "aaa". Please review and let me know... :)
from blueshift.
Awesome- thanks for all your hard work! We're going to try and get some time today/tomorrow to take a look and get it merged. We'll keep you posted :)
from blueshift.
Hi - Just wanted to let you know that I'm still cleaning up my code, but started thinking about other "modes" of use that I'm considering. I will open a new ticket to discuss this so as not to conflate this thread.
from blueshift.
Hi there, just letting you know that the "aaa" branch (aaelony repo) now contains a number of enhancements, allowing more control as to whether the data files in the bucket and/or the manifest in the bucket are desired to be kept or deleted.
This is now in the bucket manifest so it can differ for each bucket! :)
We've also experienced a nicety from the recent change of putting the the alias to the jdbc-url in the bucket manifest rather than hardcorded in the manifest when we upgraded our Redshift cluster and the only thing we needed to edit was the config file jdbc-url :)
from blueshift.
Sounds good! Sorry for not going through it yet - things have been a bit busy at work I'm afraid :-) I'm preparing to talk about Blueshift at the first Clojure experience report meetup in London on Tuesday, but after that I hope to have some time to go through what you have done.
Would it be possible to structure your changes as smaller individual improvements? It makes for a cleaner integration. I appreciate all the hard work you've put in, so I totally get if you'd rather now ;-)
from blueshift.
Cool that you'll be talking about Blueshift! It's been busy on my end as well. At some point I would like to have a work flow that makes it easier to have smaller individual improvements, for now though that's not the case. I apologize for this. Hopefully at some point. The main differences are that the bucket manifest needs to have the 2 keys specifying whether or not to keep (1) the data files that match the data pattern and (2) the manifest in the bucket itself (which signals whether to keep or stop watching the bucket). I was also made aware that begin/end transaction wrappers and autocommit set to on would improve redshift load performance. (Ran into an issue where on a smallish cluster, Blueshift was spamming redshift with loads it couldn't handle faster enough, heheh.) I also made it easier to track down redshift manifests (by naming them less randomly) and also plan to name the staging tables with a UUID (not implemented yet). Want to get to a lot of things, but many other projects which rely on the data once it's in redshift which blueshift is enabling. Very cool stufff this. :)
from blueshift.
Wow! You seem to get even more use out of it than we do ;-)
from blueshift.
maybe so... :) Clojure is awesome.
On Jan 23, 2015, at 11:51 AM, "Thomas G. Kristensen" [email protected] wrote:
Wow! You seem to get even more use out of it than we do ;-)
—
Reply to this email directly or view it on GitHub.
from blueshift.
Related Issues (15)
- Collect import errors from stl_load_errors
- Posting Large number of files in s3 seem to get blueshift acting strange and ignoring large manifest file HOT 7
- getting Serializable isolation violation on table HOT 7
- s3 bucket eventual consistency errors HOT 5
- java.lang.StackOverflowError problems at startup HOT 5
- App does not recognize 'manifest.edn' file HOT 2
- 'No suitable driver' error raised when using Redshift's URL in the manifest
- multithreaded delete possible?
- Import is making duplicate values instead of upsert
- Getting following error while importing into redshift HOT 3
- Handle manifest validation errors HOT 1
- Odd error and Data is not loaded yet data files are removed! HOT 7
- Thoughts around new usage "modes" (?) HOT 6
- Slow deletes during merge insert HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from blueshift.