Giter Site home page Giter Site logo

awsdocs / amazon-athena-user-guide Goto Github PK

View Code? Open in Web Editor NEW
86.0 37.0 104.0 1.77 MB

The open source version of the Amazon Athena documentation. To submit feedback & requests for changes, submit issues in this repository, or make proposed changes & submit a pull request.

License: Other

aws athena documentation

amazon-athena-user-guide's Introduction

amazon-athena-user-guide's People

Contributors

adamchainz avatar arambadk avatar cloudingla avatar gordonhgraham avatar harshpatel991 avatar honiewelle avatar iconara avatar jelder avatar joshbean avatar jpeddicord avatar jsperson avatar juliamalkin avatar kanga333 avatar mcekeith avatar pahtoe avatar playphil avatar rajasgujarathi avatar richiverse avatar roeezab avatar sandeepveldi avatar seiji avatar seittema avatar sengelbert avatar seunggabi avatar shabe avatar szymonkochanski avatar taammann avatar theghosthucodes avatar zabullet avatar zviklausner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-athena-user-guide's Issues

Help wanted: examples of setting up Glue Crawler to work with Athena (folder structure)

Hello dear community,
We know from this documentation https://docs.aws.amazon.com/en_us/athena/latest/ug/glue-best-practices.html#schema-crawlers-data-sources, that:
"When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler. "
Often, many of us who use Athena with Glue, need to know how to set up folders, so that the integration with the Amazon Glue Crawler works seamlessly. If you've tackled this, and have a recommendation, please share and help! Much appreciate.

Getting started misses the workgroup needed information

I may be missing something so I still create the issue to avoid forgetting about it.

I believe the Getting Started guide is missing a step with the newly launched Workgroup feature:
https://docs.aws.amazon.com/en_pv/athena/latest/ug/getting-started.html

When using Athena for the first time, one now needs to use Workgroup settings to define a S3 bucket else any SQL query will fail with a 400 error:

No output location provided. An output location is required either through the Workgroup result configuration setting or as an API input. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a15349df-ef33-4753-97b7-xxxxx).

I will try to repro with a completely clean account, it might also be a leftover.

Help Wanted – Athena Troubleshooting, Tips, and Tricks

Please share your experience! If you've hit a problem and solved it, I’d love to hear from you. I am also interested in any tips or tricks you’ve learned along the way.

Submit a pull request (instructions here) or send me a note via the Comments below. If your suggestion is likely to help others, I'll add it to the Athena User Guide.

Thank you for your time. I look forward to hearing from you!

Athena CTAS - incremental load

I'd like to be able to use a CTAS and incrementally load more partitions into the data.

An example would be:

CREATE TABLE event_data
WITH (
      external_location = 's3://bucket/event_data/',
      format = 'Parquet',
      parquet_compression = 'SNAPPY',
      partitioned_by = ARRAY['year', 'month', 'day'])
AS SELECT stock_item_id, advertiser_id, event, site, channel, referrer, year, month, day
FROM "raw_events"."event_data" 
WHERE year='2018'
AND month='11'
AND day='01'

As more "raw" data comes in this gives me the ability to add additional partitions via a scheduler.
Is this possible?

Likely missing permission "glue:GetDatabases"

Describe

We are trying to make use of the Connecting to Amazon Athena with ODBC to connect to AWS DynamoDB via AWS Athena

When we are trying to provide limited permission(s) to the AWS IAM User we are referring to the Fine-grained access to databases and tables in the AWS Glue Data Catalog but we encounter the following error

Error

FAILED!

[Simba][Athena] (1040) An error has been thrown from the AWS Athena client. Error Message: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:User: AWS_IAM_User_ARN is not authorized to perform: glue:GetDatabases on resource: arn:aws:glue:*:*:catalog because no identity-based policy allows the glue:GetDatabases action (Service: AmazonDataCatalog; Status Code: 400; Error Code: AccessDeniedException; Request ID: ; Proxy: null)) [Execution ID: ]

Proposal

In the documentation, Fine-grained access to databases and tables in the AWS Glue Data Catalog, glue:GetDatabases needs to be present in addition to existing permissions.

Can someone help confirm?

Compression formats: Firehose compatibility correction

Currently, compression-formats.md says:

Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only compatible format.

This seems to be incorrect. We are using ORC files with Snappy compression in Firehose and Athena can read it. Not sure if it applies to Parquet as well though.

Querying ALB logs with partition

The guide at https://github.com/awsdocs/amazon-athena-user-guide/blob/main/doc_source/application-load-balancer-logs.md refers to a 3rd party solution in order to partition a table for ALB logs, which involves data transformation via a custom Lambda function. However, it is possible to add partitioning to the schema in the same manner as the "Querying AWS WAF Logs" guide at https://github.com/awsdocs/amazon-athena-user-guide/blob/main/doc_source/waf-logs.md describes:

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
  ...
)

PARTITIONED BY
(
 day STRING
)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  ...
)
LOCATION 's3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/'

TBLPROPERTIES
(
 "projection.enabled" = "true",
 "projection.day.type" = "date",
 "projection.day.range" = "2022/01/01,NOW",
 "projection.day.format" = "yyyy/MM/dd",
 "projection.day.interval" = "1",
 "projection.day.interval.unit" = "DAYS",
 "storage.location.template" = "s3://your-alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/${day}"
)

In this query, the PARTITIONED BY and TBLPROPERTIES statements are added to the "create table" query from the guide, the rest is the same. Querying the table with e.g. select * from alb_logs where day = '2022/04/08' then works as expected.

Is there a reason this is not mentioned in the guide? Why does it refer to a more complicated solution that involves a Lambda function?

Thanks!

use Partition Projection for AWS Service Logs examples

the examples in "Querying AWS Service Logs" create either a non-partitioned table or are using manual partitioning (e.g. CloudTrail/VPC flow logs), it would be great to have copy-paste-able templates for AWS service logs with commonly used partitions (region, date, ELB name/prefix)

Upper limit on the number of partitions in a single "DROP PARTITION" request?

When I try dropping several partitions (= 35) in a single query, the operation fails with the following error:

Error message:
FAILED: SemanticException [Error 10006]: Partition not found

The drop request does have the "IF EXISTS" clause in it.
ALTER TABLE alpha_events_tracker DROP IF EXISTS PARTITION

Sending a smaller number of partitions in a single request succeeds.

Sample Query Id: e7ef9cf5-46a4-4c62-8695-f0f8506f8d3a
What could be the reason? Is there an upper limit on the bytes that can be sent in a single request?

Using PrivateLink endpoints with ODBC

When using ODBC, the documentation mentions that you need to make sure that port 444 is open outbound for QueryResultStreaming. When connecting to Athena using a PrivateLink endpoint, the Security Group attached to that PrivateLink endpoint needs to be open to port 444 inbound or the query will fail with a "Connection Reset" error. I would recommend mentioning the need for the the inbound rule on the PrivateLink endpoint on the JDBC page:

https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html

typo in documentation for setting up elb.

doc_source/elasticloadbalancer-classic-logs.md

input.regex has extra character in the documentation. I spent couple of hours in troubleshooting the problem. Please remove the extra character.

Actual value for the input.regex:
'input.regex' = '([^ ]) ([^ ]) ([^ ]):([0-9]) ([^ ]):- ([-.0-9]) ([-.0-9]) ([-.0-9]) (|[-0-9]) (-|[-0-9]) ([-0-9]) ([-0-9]) \"([^ ]) ([^ ]) (- |[^ ])\" ("[^\"]") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )

It should have been, without extra ")" at the end of the value.
'input.regex' = '([^ ]) ([^ ]) ([^ ]):([0-9]) ([^ ]):- ([-.0-9]) ([-.0-9]) ([-.0-9]) (|[-0-9]) (-|[-0-9]) ([-0-9]) ([-0-9]) \"([^ ]) ([^ ]) (- |[^ ])\" ("[^\"]") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'

WAF logs partitioning doesn't match Firehose partitioning scheme

It says here, that Kinesis Data Firehose stores data in Amazon S3 in the following path format:

s3://bucket/folder/yyyy/MM/dd/HH/file.extension

This coincides with my experience.

Correct me if I am wrong, but it seems to be default partitioning scheme for Firehose. And Firehose is default (the only?) way for WAF logs to be ingested to S3. So, if my assumptions are correct, partitioning scheme in CREATE EXTERNAL TABLE query should match Kinesis Firehose example:

CREATE EXTERNAL TABLE my_table
(
...
)
...
PARTITIONED BY
(
 datehour STRING
)
LOCATION "s3://bucket/table-name/"
TBLPROPERTIES
(
 "projection.enabled" = "true",
 "projection.datehour.type" = "date",
 "projection.datehour.range" = "2018/01/01/00,NOW",
 "projection.datehour.format" = "yyyy/MM/dd/HH",
 "projection.datehour.interval" = "1",
 "projection.datehour.interval.unit" = "HOURS",
 "storage.location.template" = "s3://bucket/table-name/${datehour}"
)

ST_Point does not work with WKT

In the list of supported geospatial functions you can either instantiate a point as a pair of doubles or a wkt string using the function ST_POINT.

However, when I try to run the example,

SELECT ST_POINT('point (-74.006801 40.705220)');

I get the following error:

An error has been thrown from the AWS Athena client. SYNTAX_ERROR: line 1:8: Unexpected parameters (varchar(28)) for function st_point. Expected: st_point(double, double)

On the other hand, if I try to run the same with ST_POLYGON,

SELECT ST_POLYGON('point (-74.006801 40.705220)');

it seems to work fine.

So, I don't know if that is a problem in the documentation or Athena is not behaving as expected.

Querying Classic Load Balancer Logs: 504 log entries are omitted

When following the instructions for Querying Classic Load Balancer Logs, log entries with a 504 elb_response_code are not included in the table. This is because the single log field backend:port is split into two table fields (backend_ip and backend_port) by splitting the log field on the colon character, but that field may not always have a colon character.

From the documentation:

Field Description
backend:port The IP address and port of the registered instance that processed this request.

If the load balancer can't send the request to a registered instance, or if the instance closes the connection before a response can be sent, this value is set to -.

This value can also be set to - if the registered instance does not respond before the idle timeout.

So, for requests where the ELB responds with a 504, those log entries do not get included in the table because the backend:port field does not contain a colon, so it does not match the regular expression.

ST_Point does not work with column names

I have the columns a_latitude, a_longitude, b_latitude, b_longitude. Where each row represents the location of a and b. It is not obvious to me how to pass these column names to the functions given my Athena to calculate ST_Distance.

Here is what I tried:

ST_Distance(ST_Point(a_latitude, a_longitude), ST_Point(b_latitude, b_longitude))

Here is my error:

Unexpected parameters (varchar, varchar) for function st_point. Expected: st_point(double, double)

How can I pass columns to ST_Distance?

Unable to run the example at geospatial-example-queries

https://docs.aws.amazon.com/athena/latest/ug/geospatial-example-queries.html

I create the two tables and then I run the query, but the query fails with the error.

Your query has the following error(s):

SYNTAX_ERROR: line 5:7: Unexpected parameters (varbinary, Geometry) for function st_contains. Expected: st_contains(Geometry, Geometry) , st_contains(varchar, varchar) , st_contains(Geometry, varchar) , st_contains(varchar, Geometry)

This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 0b7713dd-77db-45aa-b7cc-5d5e83017245.

I will do some debugging, but I imagine this to be a bug in the documentation of the example.

Thanks!

[docs] Add port open guide in JDBC & ODBC

I connected Tableau and Athena Connector by JDBC drivers
I opened port (443, 444) about athena, glue, sts.

So, I added this guide.
If you don't understand my words.
Please modify my documents. (more readability)

Documentation misrepresents syntax for bucket_by & partition_by in create table as

Extant documentation looks like this:

partitioned_by = ARRAY( [col_name,…])
Optional. An array list of columns by which the CTAS table will be partitioned. Verify that the names of partitioned columns are listed last in the list of columns in the SELECT statement.

bucketed_by( [bucket_name,…])
An array list of buckets to bucket data. If omitted, Athena does not bucket your data in this query.

But in fact the syntax for partition_by and bucketed_by clauses are the same and no use of parenthesis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.