Giter Site home page Giter Site logo

awr.athena's Introduction

HIA RYLAH

awr.athena's People

Contributors

daroczig avatar nfultz avatar tschutte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awr.athena's Issues

Issue with expired token

First of all: thanks a lot for this super useful package!

In general everything works well, but I have one issue that I have been trying to fix for a while, but couldn't figure out how...

I have aws credentials that expire every hour. As soon as they are refreshed, I get the following error in R when trying to query/reconnect:

Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.sql.SQLException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is expired [Execution ID not available]

The only work around I have now is to restart the R session, which is quite annoying. I was hoping to somehow refresh the credentials the AWR.Athena() function is using without having to restart the session. Is there a way to do this?

Dealing with fetchSize internally

It's probably a good idea to include the fix proposed at https://stackoverflow.com/questions/44379318/the-requested-fetchsize-is-more-than-the-allowed-value-in-athena-with-jdbc-dri to deal with "The requested fetchSize is more than the allowed value in Athena." issue.

The solution itself is quite trivial, however incorporating it into the package is a bit trickier, since one needs to reexport the appropriate S4 method. I have a working solution at hand, so let me know if you'd like a PR.

CRAN removal

Hi,
I note that the package was removed from CRAN:

Archived on 2023-04-13 at the request of the maintainer.

Is there a reason for this? If there's some fix required to keep it on CRAN, are you open to reviewing PR(s) that would address the issue(s)?

Thank you

we shld def collab

I had started https://github.com/hrbrmstr/metis a cpl weeks ago since one of the more overtly gnarly bits of athena is the auth (if you're not doing basic creds). i have no particular need to be the keeper of an Athena pkg (but i'd lobby for a quirky name like the greek god who was athena's helper :-). Maintaining my Apache Drill wrapper pkg is plenty of work on it's own. You folks jumping in here to fill a need is ++gd.

Package naming bias aside (it rly doesn't matter what the pkg is called) i do think the auth needs to be a seamless part of the pkg since there are so many ways to do so with AWS. it should also, likely, have support for non-default connection parameters (for some stuff at work I need to change some of them, hence my inclusion). Feel free to steal as little or as much from my scant populated nascent pkg start as you like.

CRAN is going to have a hissy fit abt the auto-dl of the driver. I'd suggest making (i was going to do this, too) either an athenajars pkg and keep it updated with the latest aws athena jar or an explicit message on failure to find driver paired with a downloader function.

i was also going to head down a path of dplyr integration, but one thing that needs to be taken care of for that is the schema naming inclusion and i'm not sure it's worth the effort in the short term.

when i'm back from vacation i'll gladly join in here with any PRs/help/etc.

Notes from a newbie regarding credentials configuration

Heya here, thanks for the package.

I'm just writing down notes regarding configuration in case it can be added somewhere in the docs / can help another newbie (I read issues to understand that environment variables didn't mean R environment variables ๐Ÿ™‚ )

First note for newbies: to get the key and key ID go into your "My security credentials" settings from the drop-down menu at the top right of the console where your username is.

I was hoping to be able to use keyring and Sys.setenv() but since environment variables are not read by R and I seem to be too unaware of Java/Ubuntu I ended up installing AWS CLI and running aws configure which was fairly straightforward and now I can open a connection. :-)

I suppose that because R doesn't handle the credentials there's no way to check the configuration worked from R without trying to establish a connection?

Provide for AWR::Athena() a path variable to AthenaJDBC41_2.0.7.jar

Description

Thanks for this great tool. I have the problem, that I can't figure out where to put the AthenaJDBC41_2.0.7.jar to avoid downloading it on Docker run.

Rationale

I would like to use my Docker Container on AWS Fargate. AWS Fargate is natively stateless Docker service. So, I would like to provide the AthenaJDBC41_2.0.7.jar by default within the Docker Container to increase the spin up time of my Docker Container. Therefore, it would be nice to have a path variable within AWR::Athena() which points to the file.

Use case

I need a high scalable BI app. Until now it takes up to 5 seconds to download the AthenaJDBC41_2.0.7.jar within my Container, which is definitely a long time. If the up app gets terminated, this happen every time.

Driver doesn't accept environment variables

Trying out the package on Windows 10x64 and I can't initialize a connection using system environment variables. I'm just goofing around right now so I don't have a file with these pre-stored. I installed from github directly. I've replaced my personal info with dummy levels.

Sys.setenv("AWS_ACCESS_KEY_ID"="id",
                 "AWS_SECRET_ACCESS_KEY"="key")
library(DBI)
con <- dbConnect(AWR.Athena::Athena(), region='us-east1', s3_staging_dir='s3://bucket', schema_name='schema')

Yields this error:

Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1],  : 
  java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.athena.jdbc.shaded.com.amazonaws.ClientConfiguration

Using IAM Role with MFA for secure access

I am trying now for hours to access my Athena database (which works charmingly via PyAthena in Python).

My setup looks as follows:

  • I use AWR.Athena version 2.0.7.0
  • I use a user profile in ~/.aws/credentials
  • I use a role defined in IAM. This role has full Athena access
  • I use the R package aws.iam to set up a session for the user with the role. As the user needs an MFA, i provde the MFA device ARN together with the code
  • I setup an environment Variable AWS_SESSION_TOKEN to hold the session token I got.
  • I call then dbConnect with the Athena driver and get the following error message: Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.sql.SQLException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is invalid [Execution ID not available]

Here is my code:

library( DBI )
library( AWR.Athena )
library( aws.iam )
aws.signature::use_credentials( profile = 'research' )
session = get_session_token( id  = 'arn:aws:iam::<YourAccountNumber>:mfa/<YourIAMUserName>', 
                             code = 258028,
                             role = 'arn:aws:iam::<YourAccountNumber>:role/<YourRoleName>', 
                             duration = 3600 )
Sys.setenv( AWS_SESSION_TOKEN = session$SessionToken )

con <- dbConnect( AWR.Athena::Athena(), 
                  region = 'eu-central-1',
                  S3OutputLocation = 's3://your-output-location/', 
                  Schema = 'default',
                  Workgroup = '<YourWorkGroup>' )
# Query a list of all tables.
dbListTables( con )

I also tried the dbConnect() command with aws_credentials_provider_class = "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider", but I had no success, either.

I think this is an important topic as secure access to Athena from R will be needed as soon as a production setting is considered.

For reference: In Python I use the following to set up a connection with Role and MFA using a profile:

conn = connect( s3_staging_dir        = 's3://your-output-location/',
                region_name           = 'eu-central-1',
                work_group            = '<YourWorkGroup>',
                profile_name          = '<YourProfileName>'  )

Does anyone has an idea on this? Very appreciated :)

Use async queries API instead

So I've been doing some googling about PyAthenaJDBC while trying to triage #16, came across this by @davoscollective:

https://medium.com/@davedecahedron/ive-tested-both-and-pyathenajdbc-is-a-lot-slower-i-suppose-partly-because-it-is-using-the-athena-fdf56a9b715

I have tested with various clients (Tableau, DBeaver, and basic Java app) and retrieving data is a lot slower than it should be. When you run a query with the AWS Athena console, a results csv is written very quickly to S3. I did a test with a single table of 15m rows. The Athena query and csv file to S3 completes in less than 2 minutes. I can then download it to my local machine in less than a minute. To do the same operation via the JDBC driver takes over 2 hours. I am now thinking a better strategy is to query the metadata then start the Athena query asynchronously, poll it until completion and then download the csv file directly from the s3 staging directory and combine with metadata for correct types.

this is just one anecdote, and he is writing about the 1.x version of the driver, but it may be worth exploring.

Handling multiple AWS credentials

I need to frequently switch between multiple AWS accounts for querying Athena.

Right now my workaround is to create multiple environment variables in my .Renviron file and then overwrite the default environment variables like so

Sys.setenv(AWS_SECRET_ACCESS_KEY = Sys.getenv("AWS_SECRET_ACCESS_KEY_2"),
           AWS_ACCESS_KEY_ID = Sys.getenv("AWS_ACCESS_KEY_ID_2"))

con <- dbConnect(AWR.Athena::Athena(), 
                 region='eu-west-1', 
                 Schema='my_schema')

It's obviously not a big issue obviously but I'm wondering if there is a way to directly pass the credentials to the driver? I'm guessing the ~/.aws/credentials file is also not an option for me as it will always go for the default one? Happy to help working on a feature if you think it worthwhile, just need some pointers in the right direction

Problem with credentials

Hi, thanks a lot for the package!
I've tried to play around it recenlty, but I keep getting error:

con <- dbConnect(AWR.Athena::Athena(), region='eu-west-2',
                 s3_staging_dir='s3://aws-athena-query-results-XXXXX',
                 schema_name='default')

Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1],  : 
  com.amazonaws.athena.jdbc.shaded.com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain

Even though I tried to set my credentials in many different ways. In ~/.aws/credentials I've got:

AWS_ACCESS_KEY_ID="accessaccess"
AWS_SECRET_ACCESS_KEY="secretsecret"

Similarly in R:

Sys.setenv("AWS_ACCESS_KEY_ID" = "accessaccess",
           "AWS_SECRET_ACCESS_KEY" = "secretsecret")

None of above worked for me. Any guesses what could go wrong?

Can't execute "MSK Repair table" with AWR.athena version 2.0.6.1

Hi,
i try to update library AWR.athena version 2.0.6-1 but some R commands don't work anymore :
With the oldest version i can execute this command : dbExecute(con, "MSCK REPAIR TABLE db.mytable") without errors
With the new version i got this error message : Unable to retrieve JDBC result set for MSCK REPAIR TABLE db.mytable([Simba]JDBC A ResultSet was expected but not generated from query "MSCK REPAIR TABLE db.mytable". Query not executed. )
I try also dbSendQuery, dbSendStatement but it doesn't work.
i use AthenaJDBC42_2.0.7.jar driver.
Thanks for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.