nfultz / awr.athena Goto Github PK
View Code? Open in Web Editor NEWShort R Wrapper for Athena JDBC connections
Short R Wrapper for Athena JDBC connections
First of all: thanks a lot for this super useful package!
In general everything works well, but I have one issue that I have been trying to fix for a while, but couldn't figure out how...
I have aws credentials that expire every hour. As soon as they are refreshed, I get the following error in R when trying to query/reconnect:
Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.sql.SQLException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is expired [Execution ID not available]
The only work around I have now is to restart the R session, which is quite annoying. I was hoping to somehow refresh the credentials the AWR.Athena() function is using without having to restart the session. Is there a way to do this?
It's probably a good idea to include the fix proposed at https://stackoverflow.com/questions/44379318/the-requested-fetchsize-is-more-than-the-allowed-value-in-athena-with-jdbc-dri to deal with "The requested fetchSize is more than the allowed value in Athena." issue.
The solution itself is quite trivial, however incorporating it into the package is a bit trickier, since one needs to reexport the appropriate S4 method. I have a working solution at hand, so let me know if you'd like a PR.
Hi,
I note that the package was removed from CRAN:
Archived on 2023-04-13 at the request of the maintainer.
Is there a reason for this? If there's some fix required to keep it on CRAN, are you open to reviewing PR(s) that would address the issue(s)?
Thank you
I had started https://github.com/hrbrmstr/metis a cpl weeks ago since one of the more overtly gnarly bits of athena is the auth (if you're not doing basic creds). i have no particular need to be the keeper of an Athena pkg (but i'd lobby for a quirky name like the greek god who was athena's helper :-). Maintaining my Apache Drill wrapper pkg is plenty of work on it's own. You folks jumping in here to fill a need is ++gd.
Package naming bias aside (it rly doesn't matter what the pkg is called) i do think the auth needs to be a seamless part of the pkg since there are so many ways to do so with AWS. it should also, likely, have support for non-default connection parameters (for some stuff at work I need to change some of them, hence my inclusion). Feel free to steal as little or as much from my scant populated nascent pkg start as you like.
CRAN is going to have a hissy fit abt the auto-dl of the driver. I'd suggest making (i was going to do this, too) either an athenajars
pkg and keep it updated with the latest aws athena jar or an explicit message on failure to find driver paired with a downloader function.
i was also going to head down a path of dplyr integration, but one thing that needs to be taken care of for that is the schema naming inclusion and i'm not sure it's worth the effort in the short term.
when i'm back from vacation i'll gladly join in here with any PRs/help/etc.
Heya here, thanks for the package.
I'm just writing down notes regarding configuration in case it can be added somewhere in the docs / can help another newbie (I read issues to understand that environment variables didn't mean R environment variables ๐ )
First note for newbies: to get the key and key ID go into your "My security credentials" settings from the drop-down menu at the top right of the console where your username is.
I was hoping to be able to use keyring
and Sys.setenv()
but since environment variables are not read by R and I seem to be too unaware of Java/Ubuntu I ended up installing AWS CLI and running aws configure
which was fairly straightforward and now I can open a connection. :-)
I suppose that because R doesn't handle the credentials there's no way to check the configuration worked from R without trying to establish a connection?
Thanks for this great tool. I have the problem, that I can't figure out where to put the AthenaJDBC41_2.0.7.jar to avoid downloading it on Docker run.
I would like to use my Docker Container on AWS Fargate. AWS Fargate is natively stateless Docker service. So, I would like to provide the AthenaJDBC41_2.0.7.jar by default within the Docker Container to increase the spin up time of my Docker Container. Therefore, it would be nice to have a path variable within AWR::Athena()
which points to the file.
I need a high scalable BI app. Until now it takes up to 5 seconds to download the AthenaJDBC41_2.0.7.jar within my Container, which is definitely a long time. If the up app gets terminated, this happen every time.
Trying out the package on Windows 10x64 and I can't initialize a connection using system environment variables. I'm just goofing around right now so I don't have a file with these pre-stored. I installed from github directly. I've replaced my personal info with dummy levels.
Sys.setenv("AWS_ACCESS_KEY_ID"="id",
"AWS_SECRET_ACCESS_KEY"="key")
library(DBI)
con <- dbConnect(AWR.Athena::Athena(), region='us-east1', s3_staging_dir='s3://bucket', schema_name='schema')
Yields this error:
Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.athena.jdbc.shaded.com.amazonaws.ClientConfiguration
I am trying now for hours to access my Athena database (which works charmingly via PyAthena in Python).
My setup looks as follows:
AWR.Athena
version 2.0.7.0
Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], : java.sql.SQLException: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. The security token included in the request is invalid [Execution ID not available]
Here is my code:
library( DBI )
library( AWR.Athena )
library( aws.iam )
aws.signature::use_credentials( profile = 'research' )
session = get_session_token( id = 'arn:aws:iam::<YourAccountNumber>:mfa/<YourIAMUserName>',
code = 258028,
role = 'arn:aws:iam::<YourAccountNumber>:role/<YourRoleName>',
duration = 3600 )
Sys.setenv( AWS_SESSION_TOKEN = session$SessionToken )
con <- dbConnect( AWR.Athena::Athena(),
region = 'eu-central-1',
S3OutputLocation = 's3://your-output-location/',
Schema = 'default',
Workgroup = '<YourWorkGroup>' )
# Query a list of all tables.
dbListTables( con )
I also tried the dbConnect()
command with aws_credentials_provider_class = "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider"
, but I had no success, either.
I think this is an important topic as secure access to Athena from R will be needed as soon as a production setting is considered.
For reference: In Python I use the following to set up a connection with Role and MFA using a profile:
conn = connect( s3_staging_dir = 's3://your-output-location/',
region_name = 'eu-central-1',
work_group = '<YourWorkGroup>',
profile_name = '<YourProfileName>' )
Does anyone has an idea on this? Very appreciated :)
So I've been doing some googling about PyAthenaJDBC while trying to triage #16, came across this by @davoscollective:
I have tested with various clients (Tableau, DBeaver, and basic Java app) and retrieving data is a lot slower than it should be. When you run a query with the AWS Athena console, a results csv is written very quickly to S3. I did a test with a single table of 15m rows. The Athena query and csv file to S3 completes in less than 2 minutes. I can then download it to my local machine in less than a minute. To do the same operation via the JDBC driver takes over 2 hours. I am now thinking a better strategy is to query the metadata then start the Athena query asynchronously, poll it until completion and then download the csv file directly from the s3 staging directory and combine with metadata for correct types.
this is just one anecdote, and he is writing about the 1.x version of the driver, but it may be worth exploring.
I am looking for a parameter or an option wherein while executing the query or while creating the connection I can pass the workgroup using which queries should be executed using a particular workgroup. For reference:
https://docs.aws.amazon.com/athena/latest/ug/workgroups.html
Can someone please suggest or advise how this can be accomplished?
I need to frequently switch between multiple AWS accounts for querying Athena.
Right now my workaround is to create multiple environment variables in my .Renviron file and then overwrite the default environment variables like so
Sys.setenv(AWS_SECRET_ACCESS_KEY = Sys.getenv("AWS_SECRET_ACCESS_KEY_2"),
AWS_ACCESS_KEY_ID = Sys.getenv("AWS_ACCESS_KEY_ID_2"))
con <- dbConnect(AWR.Athena::Athena(),
region='eu-west-1',
Schema='my_schema')
It's obviously not a big issue obviously but I'm wondering if there is a way to directly pass the credentials to the driver? I'm guessing the ~/.aws/credentials file is also not an option for me as it will always go for the default one? Happy to help working on a feature if you think it worthwhile, just need some pointers in the right direction
Hi, thanks a lot for the package!
I've tried to play around it recenlty, but I keep getting error:
con <- dbConnect(AWR.Athena::Athena(), region='eu-west-2',
s3_staging_dir='s3://aws-athena-query-results-XXXXX',
schema_name='default')
Error in .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], :
com.amazonaws.athena.jdbc.shaded.com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain
Even though I tried to set my credentials in many different ways. In ~/.aws/credentials
I've got:
AWS_ACCESS_KEY_ID="accessaccess"
AWS_SECRET_ACCESS_KEY="secretsecret"
Similarly in R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "accessaccess",
"AWS_SECRET_ACCESS_KEY" = "secretsecret")
None of above worked for me. Any guesses what could go wrong?
Hi,
i try to update library AWR.athena version 2.0.6-1 but some R commands don't work anymore :
With the oldest version i can execute this command : dbExecute(con, "MSCK REPAIR TABLE db.mytable") without errors
With the new version i got this error message : Unable to retrieve JDBC result set for MSCK REPAIR TABLE db.mytable([Simba]JDBC A ResultSet was expected but not generated from query "MSCK REPAIR TABLE db.mytable". Query not executed. )
I try also dbSendQuery, dbSendStatement but it doesn't work.
i use AthenaJDBC42_2.0.7.jar driver.
Thanks for your help.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.