awsdocs / amazon-athena-user-guide Goto Github PK

The open source version of the Amazon Athena documentation. To submit feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.

License: Other

aws athena documentation

amazon-athena-user-guide's Introduction

NOTICE

This repository is archived, read-only, and no longer updated. For more information, read the announcement on the AWS News Blog.

You can find up-to-date AWS technical documentation on the AWS Documentation website, where you can also submit feedback and suggestions for improvement.

amazon-athena-user-guide's People

Contributors

Stargazers

Watchers

Forkers

shigemk2 rchamarthi jelder richiverse adamchainz iamed2 ianeber oritalul-amazon harshpatel991 shabe iconara lakshmikm sqladmin arambadk kevintrannz buehling danielsyoder rajeshthallam henrypan changlees sangeeyeah zzl0 brahmamtadepalli grusy pank14 theogevezes mikeplavsky dwtcourses thomcost theghosthucodes martint mark-edf geremycohen sengelbert urbien ogfunkycold honiewelle aroyrev szczeles bbuneci ahmedacheick salman0 seittema joaocarabetta michaelstewart nagu4dwh raghuvar nungster sandipseal gytisz karamchand-cldcvr asmeyatsky kanga333 akdasari1980 natshakeico tyrone321 iamlearning1 adityavs genedragon wolruf d3v3l0 janekdb swatijp29 jevyanj lavinia-k athena-pallada sandeepveldi cbsmith bookia20 kokes cloudingla hkaushal mcekeith abhishek-pingale zabullet justintm patrick-muller zviklausner fduartef nickpattison playphil rahulsing owenprough-sift roeezab danielcompton szymonkochanski imlaoji spathik ravinirmal-alation rajasgujarathi pmarkert darthbear trojan782 rwgnr jeffgardnerdev seiji skeggse numairmansur mlorek pvij

amazon-athena-user-guide's Issues

Help wanted: examples of setting up Glue Crawler to work with Athena (folder structure)

Hello dear community,
We know from this documentation https://docs.aws.amazon.com/en_us/athena/latest/ug/glue-best-practices.html#schema-crawlers-data-sources, that:
"When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler. "
Often, many of us who use Athena with Glue, need to know how to set up folders, so that the integration with the Amazon Glue Crawler works seamlessly. If you've tackled this, and have a recommendation, please share and help! Much appreciate.

Getting started misses the workgroup needed information

I may be missing something so I still create the issue to avoid forgetting about it.

I believe the Getting Started guide is missing a step with the newly launched Workgroup feature:
https://docs.aws.amazon.com/en_pv/athena/latest/ug/getting-started.html

When using Athena for the first time, one now needs to use Workgroup settings to define a S3 bucket else any SQL query will fail with a 400 error:

No output location provided. An output location is required either through the Workgroup result configuration setting or as an API input. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a15349df-ef33-4753-97b7-xxxxx).

I will try to repro with a completely clean account, it might also be a leftover.

Help Wanted – Athena Troubleshooting, Tips, and Tricks

Please share your experience! If you've hit a problem and solved it, I’d love to hear from you. I am also interested in any tips or tricks you’ve learned along the way.

Submit a pull request (instructions here) or send me a note via the Comments below. If your suggestion is likely to help others, I'll add it to the Athena User Guide.

Thank you for your time. I look forward to hearing from you!

Athena CTAS - incremental load

I'd like to be able to use a CTAS and incrementally load more partitions into the data.

An example would be:

CREATE TABLE event_data
WITH (
      external_location = 's3://bucket/event_data/',
      format = 'Parquet',
      parquet_compression = 'SNAPPY',
      partitioned_by = ARRAY['year', 'month', 'day'])
AS SELECT stock_item_id, advertiser_id, event, site, channel, referrer, year, month, day
FROM "raw_events"."event_data" 
WHERE year='2018'
AND month='11'
AND day='01'

As more "raw" data comes in this gives me the ability to add additional partitions via a scheduler.
Is this possible?

Likely missing permission "glue:GetDatabases"

Describe

We are trying to make use of the Connecting to Amazon Athena with ODBC to connect to AWS DynamoDB via AWS Athena

When we are trying to provide limited permission(s) to the AWS IAM User we are referring to the Fine-grained access to databases and tables in the AWS Glue Data Catalog but we encounter the following error

Error

FAILED!

[Simba][Athena] (1040) An error has been thrown from the AWS Athena client. Error Message: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: AWS_IAM_User_ARN is not authorized to perform: glue:GetDatabases on resource: arn:aws:glue:*:*:catalog because no identity-based policy allows the glue:GetDatabases action (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: ; Proxy: null)) [Execution ID: ]

Proposal

In the documentation, Fine-grained access to databases and tables in the AWS Glue Data Catalog, glue:GetDatabases needs to be present in addition to existing permissions.

Can someone help confirm?

Compression formats: Firehose compatibility correction

Currently, compression-formats.md says:

Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.

This seems to be incorrect. We are using ORC files with Snappy compression in Firehose and Athena can read it. Not sure if it applies to Parquet as well though.

Querying ALB logs with partition

The guide at https://github.com/awsdocs/amazon-athena-user-guide/blob/main/doc_source/application-load-balancer-logs.md refers to a 3rd party solution in order to partition a table for ALB logs, which involves data transformation via a custom Lambda function. However, it is possible to add partitioning to the schema in the same manner as the "Querying AWS WAF Logs" guide at https://github.com/awsdocs/amazon-athena-user-guide/blob/main/doc_source/waf-logs.md describes:

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
  ...
)

PARTITIONED BY
(
 day STRING
)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  ...
)
LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'

TBLPROPERTIES
(
 "projection.enabled" = "true",
 "projection.day.type" = "date",
 "projection.day.range" = "2022/01/01,NOW",
 "projection.day.format" = "yyyy/MM/dd",
 "projection.day.interval" = "1",
 "projection.day.interval.unit" = "DAYS",
 "storage.location.template" = "s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${day}"
)

In this query, the PARTITIONED BY and TBLPROPERTIES statements are added to the "create table" query from the guide, the rest is the same. Querying the table with e.g. select * from alb_logs where day = '2022/04/08' then works as expected.

Is there a reason this is not mentioned in the guide? Why does it refer to a more complicated solution that involves a Lambda function?

Thanks!

use Partition Projection for AWS Service Logs examples

the examples in "Querying AWS Service Logs" create either a non-partitioned table or are using manual partitioning (e.g. CloudTrail/VPC flow logs), it would be great to have copy-paste-able templates for AWS service logs with commonly used partitions (region, date, ELB name/prefix)

No CREATE TABLE with partition projection for VPC flow logs

It would probably be beneficial to include a section with an example of CREATE TABLE with partition projection in VPC flow logs page.

Upper limit on the number of partitions in a single "DROP PARTITION" request?

When I try dropping several partitions (= 35) in a single query, the operation fails with the following error:

Error message:
FAILED: SemanticException [Error 10006]: Partition not found

The drop request does have the "IF EXISTS" clause in it.
ALTER TABLE alpha_events_tracker DROP IF EXISTS PARTITION

Sending a smaller number of partitions in a single request succeeds.

Sample Query Id: e7ef9cf5-46a4-4c62-8695-f0f8506f8d3a
What could be the reason? Is there an upper limit on the bytes that can be sent in a single request?

Using PrivateLink endpoints with ODBC

When using ODBC, the documentation mentions that you need to make sure that port 444 is open outbound for QueryResultStreaming. When connecting to Athena using a PrivateLink endpoint, the Security Group attached to that PrivateLink endpoint needs to be open to port 444 inbound or the query will fail with a "Connection Reset" error. I would recommend mentioning the need for the the inbound rule on the PrivateLink endpoint on the JDBC page:

https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html

Button "Edit this page on GitHub" doesn't work at "Changing Athena Engine Versions" page.

"Edit this page on GitHub" button on "Changing Athena Engine Versions" page points at 404.

Views.md and subpages missing.

views.md is missing along with all the subpages. Not sure if this is not "live" or something.

Add Partitioning config for cloudtrail logs

https://github.com/awsdocs/amazon-athena-user-guide/blob/master/doc_source/cloudtrail-logs.md#create-cloudtrail-table should include the following performance improvement by adding Partitioning:

PARTITIONED BY (region string, year string, month string, day string)

Help Wanted: Update the EMR tutorial for Athena with the recent EMR version

Hello,
This topic in the Athena User Guide shows how to convert data to columnar storage using Amazon EMR. However, the tutorial relies on EMR version 4.7. Please help us update it to the recent version 5.x.
https://docs.aws.amazon.com/en_us/athena/latest/ug/convert-to-columnar.html
Thank you for helping us maintain Athena documentation and helping the community.

Incorrect Documentation for ignore.malformed.json

Hi,
Option ignore.malformed.json is applicable only for Openx JSON Serde. However in the documentation, it is
given as an option for Hive JSON SerDe.

https://docs.aws.amazon.com/athena/latest/ug/json.html#hivejson

typo in documentation for setting up elb.

doc_source/elasticloadbalancer-classic-logs.md

input.regex has extra character in the documentation. I spent couple of hours in troubleshooting the problem. Please remove the extra character.

Actual value for the input.regex:
'input.regex' = '([^ ]) ([^ ]) ([^ ]):([0-9]) ([^ ]):- ([-.0-9]) ([-.0-9]) ([-.0-9]) (|[-0-9]) (-|[-0-9]) ([-0-9]) ([-0-9]) \"([^ ]) ([^ ]) (- |[^ ])\" ("[^\"]") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )

It should have been, without extra ")" at the end of the value.
'input.regex' = '([^ ]) ([^ ]) ([^ ]):([0-9]) ([^ ]):- ([-.0-9]) ([-.0-9]) ([-.0-9]) (|[-0-9]) (-|[-0-9]) ([-0-9]) ([-0-9]) \"([^ ]) ([^ ]) (- |[^ ])\" ("[^\"]") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'

Distribute JDBC drivers on the Maven Central repository

Please make the JDBC drivers available on the Maven Central Repository so that developers can incorporate them into applications using build systems like Maven and Gradle.

JDBC Jars Corrupted.

The jdbc jars for athena listed in this page https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html are corrupted.
- https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.0.29.1000/AthenaJDBC41.jar
- https://s3.amazonaws.com/athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.0.29.1000/AthenaJDBC42.jar

Both the above JARs are missing the driver class com.simba.athena.jdbc.Driver.

WAF logs partitioning doesn't match Firehose partitioning scheme

It says here, that Kinesis Data Firehose stores data in Amazon S3 in the following path format:

s3://bucket/folder/yyyy/MM/dd/HH/file.extension

This coincides with my experience.

Correct me if I am wrong, but it seems to be default partitioning scheme for Firehose. And Firehose is default (the only?) way for WAF logs to be ingested to S3. So, if my assumptions are correct, partitioning scheme in CREATE EXTERNAL TABLE query should match Kinesis Firehose example:

CREATE EXTERNAL TABLE my_table
(
...
)
...
PARTITIONED BY
(
 datehour STRING
)
LOCATION "s3://bucket/table-name/"
TBLPROPERTIES
(
 "projection.enabled" = "true",
 "projection.datehour.type" = "date",
 "projection.datehour.range" = "2018/01/01/00,NOW",
 "projection.datehour.format" = "yyyy/MM/dd/HH",
 "projection.datehour.interval" = "1",
 "projection.datehour.interval.unit" = "HOURS",
 "storage.location.template" = "s3://bucket/table-name/${datehour}"
)

ST_Point does not work with WKT

In the list of supported geospatial functions you can either instantiate a point as a pair of doubles or a wkt string using the function ST_POINT.

However, when I try to run the example,

SELECT ST_POINT('point (-74.006801 40.705220)');

I get the following error:

An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 1:8: Unexpected parameters (varchar(28)) for function st_point. Expected: st_point(double, double)

On the other hand, if I try to run the same with ST_POLYGON,

SELECT ST_POLYGON('point (-74.006801 40.705220)');

it seems to work fine.

So, I don't know if that is a problem in the documentation or Athena is not behaving as expected.

Querying Classic Load Balancer Logs: 504 log entries are omitted

When following the instructions for Querying Classic Load Balancer Logs, log entries with a 504 elb_response_code are not included in the table. This is because the single log field backend:port is split into two table fields (backend_ip and backend_port) by splitting the log field on the colon character, but that field may not always have a colon character.

From the documentation:

Field Description

backend:port The IP address and port of the registered instance that processed this request.

If the load balancer can't send the request to a registered instance, or if the instance closes the connection before a response can be sent, this value is set to -.

This value can also be set to - if the registered instance does not respond before the idle timeout.

Field	Description
backend:port	The IP address and port of the registered instance that processed this request. If the load balancer can't send the request to a registered instance, or if the instance closes the connection before a response can be sent, this value is set to -. This value can also be set to - if the registered instance does not respond before the idle timeout.

So, for requests where the ELB responds with a 504, those log entries do not get included in the table because the backend:port field does not contain a colon, so it does not match the regular expression.

STRING omitted from list of data_types

The example declares the data type of several columns as STRING. However, this is not listed above as one of the possible data_types

[docs] add `db_name` in `SELECT * FROM "table_name$partitions" ORDER BY column_name`

as-is

SELECT * FROM "table_name$partitions" ORDER BY column_name

to-be

SELECT * FROM db_name."table_name$partitions" ORDER BY column_name

No arm64 macOS ODBC driver?

This page lists ODBC drivers available for Athena: https://github.com/awsdocs/amazon-athena-user-guide/blob/main/doc_source/connect-with-odbc.md

The macOS driver (labelled OS X) appears to be an x_86 build. It would be useful to have a native arm64 build for the newer Apple Silicon Macs.

ST_Point does not work with column names

I have the columns a_latitude, a_longitude, b_latitude, b_longitude. Where each row represents the location of a and b. It is not obvious to me how to pass these column names to the functions given my Athena to calculate ST_Distance.

Here is what I tried:

ST_Distance(ST_Point(a_latitude, a_longitude), ST_Point(b_latitude, b_longitude))

Here is my error:

Unexpected parameters (varchar, varchar) for function st_point. Expected: st_point(double, double)

How can I pass columns to ST_Distance?

Unable to run the example at geospatial-example-queries

https://docs.aws.amazon.com/athena/latest/ug/geospatial-example-queries.html

I create the two tables and then I run the query, but the query fails with the error.

Your query has the following error(s):

SYNTAX_ERROR: line 5:7: Unexpected parameters (varbinary, Geometry) for function st_contains. Expected: st_contains(Geometry, Geometry) , st_contains(varchar, varchar) , st_contains(Geometry, varchar) , st_contains(varchar, Geometry)

This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 0b7713dd-77db-45aa-b7cc-5d5e83017245.

I will do some debugging, but I imagine this to be a bug in the documentation of the example.

Thanks!

Latest docs not in this repo

I want to help improve the docs, but the latest docs are not in this repo. If I click "Edit on GitHub" on any of the pages about views I just get a 404 (try https://docs.aws.amazon.com/athena/latest/ug/when-to-use-views.html that points to https://github.com/awsdocs/amazon-athena-user-guide/tree/master/doc_source/when-to-use-views.md, for example).

[docs] Add port open guide in JDBC & ODBC

I connected Tableau and Athena Connector by JDBC drivers
I opened port (443, 444) about athena, glue, sts.

So, I added this guide.
If you don't understand my words.
Please modify my documents. (more readability)

Documentation misrepresents syntax for bucket_by & partition_by in create table as

Extant documentation looks like this:

partitioned_by = ARRAY( [col_name,…])
Optional. An array list of columns by which the CTAS table will be partitioned. Verify that the names of partitioned columns are listed last in the list of columns in the SELECT statement.

bucketed_by( [bucket_name,…])
An array list of buckets to bucket data. If omitted, Athena does not bucket your data in this query.

But in fact the syntax for partition_by and bucketed_by clauses are the same and no use of parenthesis.