Giter Site home page Giter Site logo

Comments (4)

iconara avatar iconara commented on June 16, 2024

Currently the loading strategy is not configurable. We've prepared for it to be configurable, that's why it's in ConnectionConfiguration. We just haven't figured out whether or not it should be configurable, and if it's going to be configurable, should it be possible to switch between the current default (bypass GetQueryResults and load straight from S3) and the GetQueryResults strategy, or if you're supposed to be able to plug in your own strategies.

The same goes for the polling strategy, which could be useful to be able to configure, but also useful to be able to replace with your own implementation in some cases.

What's your use case? Do you want to switch to using GetQueryResults or do you want to provide your own loading strategy?

from athena-jdbc.

DALDEI avatar DALDEI commented on June 16, 2024

First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned. when I went to look I discovered its not changeable --
Whats lead to this was trying to debug why results were seeming to not get cached even though I had set the query token.

`By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.

`

I was rerunnign the same query and specifying the query token but it still took as long.
What I believe now is I (and maybe you ? ) misunderstood what this does.
Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded.
IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.

So .. on the path to see if I could implement that -- I thought maybe if I could get the S3 URL to the results then I could fetch it myself -- and yes I can -- but its not so useful without the metadata -- and I dont want to parse that when you have a parser already --
But I cant use it without also issuing a new query ...

So back to step one -- there is no obvious way to cache query results without making another copy first -- I was hoping to deduce a way to reuse the polling or results code via the various configurable features but found that in fact they are not configurable.
But thats where I stopped -- I dont know if it is a short step or a long hike from there to be able to reuse previous query results, and didnt see an easy way to tell.


So to answer your question -- my use case is neither -- although maybe it might be both if they were configuable , hard to tell.

from athena-jdbc.

grddev avatar grddev commented on June 16, 2024

First it is unclear to me still what the 'current' loadings strategy is -- the docs and the implementation seem misaligned.

If there is a misalignment, it would be great to fix it, but as far as I can see, the documentation specifies that the default loading strategy is to load from S3, bypassing the Athena API, and the implementation selects the S3 loading strategy, that implements this behavior.

By setting a client request token on a query execution you can make Athena reuse a previous result set if the exact same query has already been run. If you run the same query multiple times this can save money and improve performance.

I was rerunnign the same query and specifying the query token but it still took as long. What I believe now is I (and maybe you ? ) misunderstood what this does.
Its not as it at first seemed -- a way to retrieve the results from cache from a prior query -- rather it seems more like a deduplication solution much like SQS receipt tokens -- to prevent accidentally issuing the same query due to failure on the client side when the server side had succeeded.
IN none of the cases I tried does it do what i hoped and bypass the query and go right to the results from the last one that worked.

The feature is indeed intended and described as to be used to ensure exactly-once processing, but we are actively using it for the caching benefit described in the documentation. Arguably, the README could be a bit clearer on what the original purpose of the token is, and that it can also be used for caching. We could probably also make it more clear that you must provide the same token every time you execute the query, including the first one. The way it is written now, it sounds as if you could provide a token to gain access to a previously executed query, which is not true.

from athena-jdbc.

dbarvitsky avatar dbarvitsky commented on June 16, 2024

Here is a specific use case for overloading configuration:

the application must assume a specific role to access Athena and S3 (which is different from the default role the process is running with).

The way to make it sort of work work with 4.0:

  • create custom class io.burt.athena.configuration.CustomConnectionConfigurationFactory extending ConnectionConfigurationFactory, overriding the createConnectionConfiguration method, and inlining the ConnectionConfiguration interface there.
  • create custom classio.burt.athena.CustomDataSource extending AthenaDataSource that takes ConnectionConfigurationFactory as an argument and passes it to super constructor.
  • now you can create CustomDataSource instead of AthenaDataSource and pass your custom connection configuration to it.

The default Athena driver, unfortunately, is auto-registering itself with default configuration upon class-load and therefore leaves no opportunity to inject a custom configuration. Original non-open-source Athena driver sort of dealt with this problem by having a configuration parameter that is a fully-qualified class name that would be doing configuration work. I'd argue this is pretty nasty and not a good way to do these things. There are many ways of dealing with configuration injection here, but none of them are decent. I'd say half-bad solution would be to have a base non-self-registering driver.

LMK if you want an MR for this.

from athena-jdbc.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.