Giter Site home page Giter Site logo

Comments (19)

pingles avatar pingles commented on June 14, 2024

Hi,

We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.

Do you have a different use case that would make this more valuable/necessary :)?

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Hi, it may be our use case, but I'm told that it is not uncommon that another team accidentally flips the switch that makes all S3 buckets globally readable - perhaps temporarily - and any embedded credentials in bucket folders might exacerbate the situation. The alternate line of thought says that well it's all Amazon anyways (s3, redshift, ec2,..) but I tend to prefer the more modular siloed approach with layers that would limit the scope of any such breach - just in case.

Was I persuasive?

On Dec 17, 2014, at 2:25 AM, Paul Ingles [email protected] wrote:

Hi,

We did think about putting passwords into either configuration on the blueshift instance or in environment variables. In the end we decided against it given all our data on S3 is readable only by our user and it makes it easier to have a single Blueshift instance interoperating with many different Redshift clusters.

Do you have a different use case that would make this more valuable/necessary :)?


Reply to this email directly or view it on GitHub.

from blueshift.

tgk avatar tgk commented on June 14, 2024

I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)

Would it make sense to add support for system variables in the manifest-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL) or were you thinking having a :default-jdbc-url in the manifest file would be a better solution for your team?

I'm open to pull-requests. Paul might have another opinion :)

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

I'm not sure what the best approach would be, but am currently partial to a :redshift-db-name "my_db" in the manifest, and the config.edn file containing the conceptual link, perhaps something like :jdbc-connections [ {:redshift-db-name "my_db" :jdbc-url "urlgoeshere" }] ...

Then the redshift password would be as secure as (and in the same place as) the credentials mechanism.

Any drawbacks to this suggestion?

On Dec 17, 2014, at 12:03 PM, "Thomas G. Kristensen" [email protected] wrote:

I completely get what you mean. I agree, S3's GUI can make it very easy to do fatal things like that very easily. When we built Blueshift, we didn't expect any other teams to tinker with the bucket we use, so we considered ourselves safe :)

Would it make sense to add support for system variables in the manifest-files (such as $PRODUCTION_REDSHIFT_1_JDBC_URL) or were you thinking having a :default-jdbc-url in the manifest file would be a better solution for your team?

I'm open to pull-requests. Paul might have another opinion :)


Reply to this email directly or view it on GitHub.

from blueshift.

pingles avatar pingles commented on June 14, 2024

interesting! how about using {:redshift-connection :some-identifier ... inside the manifest, then inside config.edn:

{:redshift-connections {:some-identifier {:jdbc-url "" ...}}

Blueshift can then use the specified connection or one configured inside the manifest file?

from blueshift.

pingles avatar pingles commented on June 14, 2024

To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Very elegant!:) This sounds great.

Just so I understand correctly, you mean (keys (read-string (slurp config))) would return (:s3 :redshift-connections :telemetry), right? and keys on :redshift-connections would return all keywords refering to existing redshift database jdbc urls... e.g. ( :redshift-db-1 :redshift-db-2) etc... Then, manifests in each s3 bucket that contain something like {:table "table1" :database :redshift-db-1 } ... would be eligible for loading.

I've actually never done a pull request (I'm embarassed to say). I hope to put something together and you can accept/reject/improve-on proposed changes.

Cheers,
Avram

On Dec 18, 2014, at 8:46 AM, Paul Ingles [email protected] wrote:

To be honest, this would probably also make it a little easier to cleanup some of the repetition in manifests. I'd definitely welcome a pull request if I/we don't get around to implementing this soon.


Reply to this email directly or view it on GitHub.

from blueshift.

tgk avatar tgk commented on June 14, 2024

You've got it :) We're eagerly awaiting your pull request ;)

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Oh so sorry, I was still wrapping my mind around the existing code base when the holidays hit.

Still with limited computer access for a bit until past the new year but there are still a couple things I didn't yet fully grasp and get time to tinker with when I left off:

  1. use of core.async, having never studied or actually used it,
    1. confirming what happens when 2 files with the same exact data but different filenames occur (hopefully upsert behavior means no update or insert changes), and
    2. confirming what happens when a new file is intended to clobber existing data.

Long story short it will take me some time to get back to it but hope to within the next 2-3 weeks.

sent from my iphone

On Dec 31, 2014, at 2:38 AM, "Thomas G. Kristensen" [email protected] wrote:

You've got it :) We're eagerly awaiting your pull request ;)


Reply to this email directly or view it on GitHub.

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

A quick update... I'm back at this now and got blueshift running out-of-the-box with a few changes and thoughts to present. Again, this is a great project and I'm learning quite a bit by digging into the code. Had not used core.async, schema, or component until now.

One big reason things didn't work initially was that I needed to add {:server-side-encryption "AES256"} to put-object call in put-manifest function. Perhaps that should be an option in the future? That and my misconceptions as to what exactly was meant by :key-pattern and :data-pattern (which for an s3 path is understood to be s3://my-bucket-name/key/part/goes/here/file.txt) took me a while to figure out. I used to think of S3 bucket concept as the full S3 directory path (not anymore).

Now that I've resolved the issues with sse and my understanding above, I'm about to go ahead with the jdbc-url changes discussed previously.

Component takes/took me a lot of getting used to... especially when trying to understand why things didn't work initially as expected. That we're not passing a big map around in function arguments will likely be a bit annoying when particular args aren't available locally in the upcoming changes but that's ok.

I like your aws-censor and wrote a similar jdbc-url-censor function.

Also, another issue I'm wrangling with is that files in s3 that are successfully imported into Redshift are deleted (currently) by Blueshift. We use s3 buckets for storage, so we would want to disable that default deletion behavior. What are your thoughts on this?

cheers, -A

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Hi, I have a number of cool changes ready for review.

Here is a glimpse... by the way I have tested this on my setup and everything works (!!):

Config file

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"
      :keep-s3-files-on-import true
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Manifest file

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 }

What is the best way to accomplish the pull request? new branch? I'm new to pull requests so feel free to send git commands that I can execute on my end if need be.

cheers,
Avram

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

I ended up forking to https://github.com/aaelony/blueshift and creating a branch called "aaa". Please review and let me know... :)

from blueshift.

pingles avatar pingles commented on June 14, 2024

Awesome- thanks for all your hard work! We're going to try and get some time today/tomorrow to take a look and get it merged. We'll keep you posted :)

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Hi - Just wanted to let you know that I'm still cleaning up my code, but started thinking about other "modes" of use that I'm considering. I will open a new ticket to discuss this so as not to conflate this thread.

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Hi there, just letting you know that the "aaa" branch (aaelony repo) now contains a number of enhancements, allowing more control as to whether the data files in the bucket and/or the manifest in the bucket are desired to be kept or deleted.

This is now in the bucket manifest so it can differ for each bucket! :)

We've also experienced a nicety from the recent change of putting the the alias to the jdbc-url in the bucket manifest rather than hardcorded in the manifest when we upgraded our Redshift cluster and the only thing we needed to edit was the config file jdbc-url :)

from blueshift.

tgk avatar tgk commented on June 14, 2024

Sounds good! Sorry for not going through it yet - things have been a bit busy at work I'm afraid :-) I'm preparing to talk about Blueshift at the first Clojure experience report meetup in London on Tuesday, but after that I hope to have some time to go through what you have done.

Would it be possible to structure your changes as smaller individual improvements? It makes for a cleaner integration. I appreciate all the hard work you've put in, so I totally get if you'd rather now ;-)

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

Cool that you'll be talking about Blueshift! It's been busy on my end as well. At some point I would like to have a work flow that makes it easier to have smaller individual improvements, for now though that's not the case. I apologize for this. Hopefully at some point. The main differences are that the bucket manifest needs to have the 2 keys specifying whether or not to keep (1) the data files that match the data pattern and (2) the manifest in the bucket itself (which signals whether to keep or stop watching the bucket). I was also made aware that begin/end transaction wrappers and autocommit set to on would improve redshift load performance. (Ran into an issue where on a smallish cluster, Blueshift was spamming redshift with loads it couldn't handle faster enough, heheh.) I also made it easier to track down redshift manifests (by naming them less randomly) and also plan to name the staging tables with a UUID (not implemented yet). Want to get to a lot of things, but many other projects which rely on the data once it's in redshift which blueshift is enabling. Very cool stufff this. :)

from blueshift.

tgk avatar tgk commented on June 14, 2024

Wow! You seem to get even more use out of it than we do ;-)

from blueshift.

aaelony avatar aaelony commented on June 14, 2024

maybe so... :) Clojure is awesome.

On Jan 23, 2015, at 11:51 AM, "Thomas G. Kristensen" [email protected] wrote:

Wow! You seem to get even more use out of it than we do ;-)


Reply to this email directly or view it on GitHub.

from blueshift.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.