tomme / dbt-athena Goto Github PK

View Code? Open in Web Editor NEW

140.0 140.0 79.0 54 KB

The athena adapter plugin for dbt (https://getdbt.com)

License: Apache License 2.0

Python 98.32% Makefile 0.85% Shell 0.83%

dbt-athena's People

Contributors

Stargazers

Watchers

Forkers

stumelius cultureamp dandandan aaronsteers djlemkes jordanhamill ankuradhikari08 sujeetpillai tuan-seek viacars muttdata udacity koid jlsouza23 teslahenry courentin timnon sacundim schuppo misc-repo datalytics1 northvolt cal859 rohitvaishnav26 sdevanapally dgkatz collinprather stevenayers roy-ht aut0clave ojaved-equip ministryofjustice route06 matthewdubbert-wf dverzal bensatori antauri nialloriordan nulledexceptions justasce lmarcondes-farm luke-falvey sovalinux kyohei3 julianste michelekarine iasoon emobilitydv fal-ai silvioluiz constantinoschillebeeckx flywheel-jp chiouben jcgarciam-signavio dsaxton-1password vdfaller s-ducks hiboyang livingsecurity grufino w0rks74710n nischaybikramthapa janjagusch dot1258 datarwejoe kiranvasudev avinash-1394 vlaurenzano adswizz epoch8

dbt-athena's Issues

Source with different database than target database not working

Hi, I am wondering if the database property for sources is defined for dbt athena? After setting the database property in sources and running dbt run, it tries to build tables using my target database and not source database. I hope you can help with this, I have attached my sources file. Thanks

To select from the source I call:
from {{source('fleetdog', 'raw_public_trips')}}

My source is configured like this:

version:  2

sources:
  - name: fleetdog
    database: dataalpha
    quoting:
      database: true
    tables:
      - name: raw_public_trips

Error:

07:41:39  Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
07:41:39    SYNTAX_ERROR: line 15:6: Table awsdatacatalog.fleetdog.raw_public_trips does not exist.

Zero-downtime for table recalculation within dbt run

Hi
I'm sorry if I'm missing something - I'm new in dbt and just trying to switch the project from custom data model management to dbt

Does dbt-athena supports table recalculation without downtime while it's being recalculated?

What workaround I used to do in my own script: to give every CTAS an unique name (i.e. _) and have a helper view with a fixed name () and simple query select * from <name>_<random> - so once new version of ctas is calculated I switch a view to the new version and drop the old one. All usages reference only view and the whole thing works without downtime

Will appreciate your thoughts and comments!

How to pass IAM credentials?

Is there any possibility for dbt/dbt_athena to run using IAM instead of the aws_profile_name settings (which requires access/secret key). The dbt/dbt_athena we are running is inside a dagster @op with proper kube2iam annotations. The invocation of dbt however using context.resources.dbt.run (vars = payload) from dagster doesn't seem to take the IAM role associated (we are sure the role works as we've tested it directly with PyAthena queries and simple s3 actions at the Dagster @op level). It just doesn't work when we invoke dbt/dbt_athena from the @op ... Ideas?

Compression algorithm '' is not supported for file format 'parquet'

We're getting an InvalidRequestException when specifying format 'parquet' without specifying the compression algorithm, in which case the default compression algorithms should be used (gzip in case of parquet)

Exception:

botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Compression algorithm '' is not supported for file format 'parquet'.

Seems that in the create_table_as.sql no default for write_compression is defined, which results in an empty string '' (instead of just not specifying write_compression at all which would result in the default (gzip) being used.)
Should be fixed by this PR

Stacktrace:

Failed to execute query.
Traceback (most recent call last):
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/pyathena/common.py", line 307, in _execute
    query_id = retry_api_call(
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/pyathena/util.py", line 84, in retry_api_call
    return retry(func, *args, **kwargs)
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 409, in call
    do = self.iter(retry_state=retry_state)
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 356, in iter
    return fut.result()
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 412, in call
    result = fn(*args, **kwargs)
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Compression algorithm '' is not supported for file format 'parquet'.

dbt generate docs hangs when database has > 100 tables

While adopting dbt-athena we noticed that for some of our databases dbt docs generate hangs indefinitely.

After some investigation we found that it happens when using the athena__get_catalog macro for a db with more than 100 tables. This is not a dbt-athena issue itself, rather an AWS Athena internal issues. We ended up talking with AWS support and they mentioned that an issue most likely related to this is something they are working on, but have no timeline for when it's going to be fixed.

In the mean time the workaround that we were advised to use is to also filter by tables. After some manual testing we found that splitting the catalog query to filter by databases AND tables (in batches of 100) solves the issue. Sample of the query:

select *
from (
		(
			with tables as (
				...
			),
			columns as (
				...
			),
			catalog as (
				...
			)
			select *
			from catalog
			where lower("table_schema") = lower('very_large_database')
				and (
					"table_name" = lower('table_1')
					or "table_name" = lower('table_2')
                                         ....
					or "table_name" = lower('table_100')
					)
			union all
			select *
			from catalog
			where lower("table_schema") = lower('very_large_database')
				and (
					"table_name" = lower('table_101')
					or "table_name" = lower('table_102')
                                         ....
					or "table_name" = lower('table_200')
					)
.....
	)

These tables can be retrieved from the manifest as part of the adapter get_catalog().

AWS profile not respected by boto3 calls

If a AWS profile is configured it is passed as expected to pyathena but all direct calls to boto3 ignore it. This issue got exacerbated by 83f7756 since boto3 is going to be called on every dbt run.

Introducing tagged releases and uploading them to pypi

Hi @Tomme !

Great work on the adapter, it works like a charm and it's awesome to see you're continuing to add new features!

However, the non-standard way of installing your adapter directly from github makes deterministic and verified installs of the adapter a little cumbersome, and makes us hesitant of bringing it into production, even though our Analytics Engineers would love to use dbt to build and maintain Athena tables.

I'm wondering, do you have plans to introduce tagged releases and publish the package to pypi?
This would make using and integrating the adapter into our python environments much easier, since users can easily specify pinned sets of python dependencies and test against them, or revert to older version sets if things no longer work/

It seems the pypi package that @lsaletti published under to the name dbt-athena has been abandoned, and its git repository has been deleted, however the info page on pypi links to your project with a recommendation to use it.
Have the two of you been in contact or talked about a potential transfer of the pypi project?

Even if a transfer of the pypi project is not possible or desired, I believe most users would be totally fine with installing your adapter using a new name, if it becomes possible to install versioned releases from pypi. For example dbt-athena-adapter is still free as of now.

Please let me know if I can support in any way.

Assume lowercase database object names

From https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html

Athena accepts mixed case in DDL and DML queries, but lower cases the names when it executes the query.

Macros athena__list_relations_without_caching and athena__get_columns_in_relation wrap references to information_schema.table_schema and information_schema.table_name in LOWER(), which I think is unnecessary.

In my testing, the queries executed by these macros perform significantly faster when LOWER() is removed from the information_schema references.

Macro	Seconds to run with LOWER()	Seconds to run without LOWER()
list_relations_without_caching	29.6	12.3
get_columns_in_relation	17.5	0.5

Support dbt 1.0

Hello, dbt-core 1.0 has been rolled out and I'd to make this adapter compatible.

I'm starting this issue to document what should be done for v1 readiness. Do not hesitate to comment, I can update this issue.

Use dbt event module see https://github.com/dbt-labs/dbt-core/blob/HEAD/core/dbt/events/README.md#adapter-maintainers
Fix #47
Create a profile_template.yml see https://docs.getdbt.com/reference/commands/init#profile_templateyml
Pin minor dbt-core versions instead of fix see https://blog.getdbt.com/getting-ready-for-v1-0/#why-we-care-about-version-numbers-and-why-you-should-too-
Restructure the macros directory see dbt-labs/dbt-core#4154

I'm wondering if we should create a branch v1.0 to ease testing?

`on_schema_change` is not available despite supporting `0.21.0`

The README specifies that this plugin supports version 0.21.0 but on_schema_change is not available. This was one of the main releases for version 0.21.0: https://github.com/dbt-labs/dbt-core/releases/tag/v0.21.0

Can the documentation be updated to show how on_schema_change is supported or specify that it is not available?

Failed on the second run

Hi,

This is my model sql.

{{
    config(
        materialized="view",
    )
}}

SELECT org_feature_name
FROM test.org_feature_v1
GROUP BY org_feature_name;

For the first time run, it works.
However, on the second time run, it failed to replace the view with this log.
Am I missing something?

19:24:08  Running with dbt=1.0.3
19:24:08  Found 1 model, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
19:24:08  
19:24:14  Concurrency: 1 threads (target='dev')
19:24:14  
19:24:14  1 of 1 START view model ft_analytics_dev.org_feature_name....................... [RUN]
19:24:14  1 of 1 ERROR creating view model ft_analytics_dev.org_feature_name.............. [ERROR in 0.02s]
19:24:15  
19:24:15  Finished running 1 view model in 6.52s.
19:24:15  
19:24:15  Completed with 1 error and 0 warnings:
19:24:15  
19:24:15  Compilation Error in model org_feature_name (models/exp/org_feature_name.sql)
19:24:15    When searching for a relation, dbt found an approximate match. Instead of guessing 
19:24:15    which relation to use, dbt will move on. Please delete ft_analytics_dev.org_feature_name, or rename it to be less ambiguous.
19:24:15    Searched for: ft_analytics_dev.org_feature_name
19:24:15    Found: ft_analytics_dev.org_feature_name
19:24:15    
19:24:15    > in macro create_or_replace_view (macros/materializations/models/view/create_or_replace_view.sql)
19:24:15    > called by macro materialization_view_athena (macros/materializations/models/view/view.sql)
19:24:15    > called by model org_feature_name (models/exp/org_feature_name.sql)
19:24:15  
19:24:15  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

`endpoint_url` doesn't work, due to typo

#68 added support for endpoint_url configuration, but included a typo:

dbt-athena/dbt/adapters/athena/connections.py

Line 54 in 0cafe2c

    
           return "s3_staging_dir", "work_group", "region_name", "database", "schema", "poll_interval", "aws_profile_name", "endpoing_url"

"endpoing_url" should be "endpoint_url"

When fixing, consider also updating the README to document the endpoint_url configuration option.

dbt found more than one package

Hi everyone, first of all, Thank you so much for developing this plugin to connect DBT with Athena.
I am facing an issue which says -

Compilation Error
DBT found more than one package with the name "my_first_project" included in this project. Package names must be unique in a project. Please rename one of these packages.

I have checked the entire project and I have only mentioned this name once,

Does anyone know why I am facing this issue or what part of the setup am I missing?

Thanks in advance!

Explore the subsitution of pyathena for awswrangler

Hey, really cool plugin!

I noticed that you rely on pyathena a lot - that is a beta package that has lots of issues in my experience. Why not try to use the offical python package from AWS - awswrangler - https://github.com/awslabs/aws-data-wrangler

Seeds fail to update when already applied

If seeds have already been then subsequent runs fail with a unhelpful error

The error that DBT shows is

14:51:05  1 of 4 START seed file dbt_test.cohorts ........................................ [RUN]
14:51:06  1 of 4 ERROR loading seed file dbt_test.cohorts ................................ [ERROR in 0.28s]
14:51:06  2 of 4 START seed file dbt_test.diagnosis ...................................... [RUN]
14:51:06  2 of 4 ERROR loading seed file dbt_test.diagnosis .............................. [ERROR in 0.15s]
14:51:06  3 of 4 START seed file dbt_test.parameters ..................................... [RUN]
14:51:06  3 of 4 ERROR loading seed file dbt_test.parameters ............................. [ERROR in 0.13s]
14:51:06  4 of 4 START seed file dbt_test.valid_ranges ................................... [RUN]
14:51:06  4 of 4 ERROR loading seed file dbt_test.valid_ranges ........................... [ERROR in 0.13s]
14:51:06  
14:51:06  Finished running 4 seeds in 12.70s.
14:51:06  
14:51:06  Completed with 4 errors and 0 warnings:
14:51:06  
14:51:06  Runtime Error in seed cohorts (seeds/meta/cohorts.csv)
14:51:06    local variable 'table' referenced before assignment
14:51:06  
14:51:06  Runtime Error in seed diagnosis (seeds/meta/diagnosis.csv)
14:51:06    local variable 'table' referenced before assignment
14:51:06  
14:51:06  Runtime Error in seed parameters (seeds/meta/parameters.csv)
14:51:06    local variable 'table' referenced before assignment
14:51:06  
14:51:06  Runtime Error in seed valid_ranges (seeds/meta/valid_ranges.csv)
14:51:06    local variable 'table' referenced before assignment

and if you dig into the logs you can find

13:42:45.927212 [info ] [Thread-1  ]: 1 of 4 START seed file dbt.cohorts ............................................. [RUN]
13:42:45.927659 [debug] [Thread-1  ]: Acquiring new athena connection "seed.emr.cohorts"
13:42:45.928175 [debug] [Thread-1  ]: Began compiling node seed.emr.cohorts
13:42:45.928367 [debug] [Thread-1  ]: finished collecting timing info
13:42:45.928537 [debug] [Thread-1  ]: Began executing node seed.emr.cohorts
13:42:45.959518 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
13:42:46.214376 [debug] [Thread-1  ]: Athena adapter: Error running SQL: macro drop_relation
13:42:46.214740 [debug] [Thread-1  ]: finished collecting timing info
13:42:46.214926 [debug] [Thread-1  ]: On seed.emr.cohorts: Close
13:42:46.215248 [debug] [Thread-1  ]: Runtime Error in seed cohorts (seeds/meta/cohorts.csv)
  local variable 'table' referenced before assignment

The issue as far as I can tell is this line Athena adapter: Error running SQL: macro drop_relation

DBT hangs when awsdatacatalog contains many databases

Hi!
Whenever any DBT command first runs then the following query is run to check the existence of views and tables on the entire catalog.
However, when the default catalog is large then it really hangs for a while.
Is there a way to speed this up or avoid this?
Is there a reason that it has to run over the whole catalog rather than just the relevant schema?

Thanks

WITH views AS (
      select
        table_catalog as database,
        table_name as name,
        table_schema as schema
      from "awsdatacatalog".INFORMATION_SCHEMA.views
      where table_schema = LOWER('*******')
    ), tables AS (
      select
        table_catalog as database,
        table_name as name,
        table_schema as schema

      from "awsdatacatalog".INFORMATION_SCHEMA.tables
      where table_schema = LOWER('********')

      -- Views appear in both `tables` and `views`, so excluding them from tables
      EXCEPT 

      select * from views
    )
    select views.*, 'view' AS table_type FROM views
    UNION ALL
    select tables.*, 'table' AS table_type FROM tables

Error on Full Refresh - Failure to match models

When trying to run dbt with full refresh, i receive the below error. Have never seen this before so I am not sure this if is a setup issue on my side or something else.

Any thoughts?

`aws_profile_name` not used for glue, s3 calls

What I expected to happen:

With aws_profile_name in my DBT profile, all connections to AWS would use the specified SSO credentials (assuming I have the necessary permissions).

What happened

Everything works fine for seeds, tests, materializing tables and views.

When trying to materialize an incremental, partitioned table in insert_overwrite mode, the task fails part way through with the error Unable to locate credentials. This appears to happen after the temporary table for incremental update has happened, but before it has been inserted into the original table.

Cause

It looks like the profile is not used when making calls directly to Glue/S3 (rather than Athena). This happens in clean_up_table, clean_up_partitions (which construct boto3.Client directly), which expects to find static credentials in a config file or the environment.

This can probably be fixed by constructing a boto3.Session with profile and passing that to both pyathena and using it to construct S3/Glue clients.

I'll try and write a PR for this.

[BUG] Error from boto3 API is not caught and create duplicated data

Hello all,

I have a dbt model with this config.

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partitioned_by=('dummy_partition',),
  )
}}

With this config, the expected behaviors is like this.

delete data ( S3 objects ) under the partition ( e.g. dummy_partition=default )
insert data

The problem I noticed was that there can be duplicated data if the deleting S3 objects fails ( 'cuzed by permission issue )

This is the code and return value from Boto3 S3. I think this line should check if the the returned value's Error is empty or not.
https://github.com/Tomme/dbt-athena/blob/master/dbt/adapters/athena/impl.py#L105

s3_bucket.objects.filter(Prefix=prefix).delete()

[{'ResponseMetadata': {'RequestId': 'S2M8Z423GKH25JZW', 'HostId': '+igjv+OblFmcETz1WWHkuhcaW0muPZef9bI=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '+igjv+OblFmcETz1WWHkuhcaW0muPZef9bI=', 'x-amz-request-id': 'S2M8Z423GJZW', 'date': 'Sat, 02 Apr 2022 00:53:39 GMT', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3', 'connection': 'close'}, 'RetryAttempts': 0}, 'Errors': [{'Key': 'dbt_v1/tables/3338-4afd-b918-ec7dbc7a2225/dummy_partition=default/20220402_003612_00013_t88vg_1685448d-3936-47ac-9a58-6640103ed4f3', 'Code': 'AccessDenied', 'Message': 'Access Denied'}, {'Key': 'dbt_v1/tables/3338-4afd-b918-ec7dbc7a2225/dummy_partition=default/20220402_003445_00029_9mcfm_a79200a7-f61e-476d-90ca-529139ec48bc', 'Code': 'AccessDenied', 'Message': 'Access Denied'}]}]

Tables not cleaned up if materialized = table and static external location

Looks like issue was caused by this merge: #43

If we have a static external location and materialized = table (for full overwrite), the prior data in the external location is not cleaned up.

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

pre and post hook execution

Hi there! Thanks so much for this project, it looks like it might be a perfect fit for my use-case: I have some Glue catalogs mounted in Redshift as external schemas. Some of the transforms I do on these schemas are compute intensive and are spilling to disk. I'd like to execute them with Athena (and this library!) instead.

I'm running into a few issues ...

executing is comparatively slow, dbt run spends a lot of time executing numerous queries to get a list of views and tables from "awsdatacatalog".INFORMATION_SCHEMA.tables (i.e. it's executing this macro). it's taking a list of all my redshift schemas and executing an athena query on each; all of them return no results because those schemas don't exist in athena

what is the purpose of this?
can I disable it?

as part of my dbt project, i'm running an on-end-run hook to set various schema permission, when it executes (it executes numerous sql statements) I get an error like

An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Only one sql statement is allowed

is there a way to disable execution of my hook?

Implement support for endpoint_url

Athena supports an optional "endpoint_url" parameter, in case the connection should go through a reverse proxy, or otherwise not the default URL.

Query isn't properly formatted when table is partitioned

Hi,
I'm trying to run

{{ config(partitioned_by='day') }}
select 
    a,
    b,
    c
 from random_table

I can see that in target/run the query is being compiled to

create table
    dev_datawarehouse.test

    with (
        partitioned_by=ARRAY'day',
        format='parquet'
    )
  as
    
select 
    a,
    b,
    c
 from random_table

Which gives the following error: SYNTAX_ERROR: Unknown type: ARRAY.
the compiled query should instead be

create table
    dev_datawarehouse.test

    with (
        partitioned_by=ARRAY['day'],
        format='parquet'
    )
  as
    
select 
    a,
    b,
    c
 from random_table

Which works just fine. Am I missing something obvious?

Retries on OperationalError

Currently dbt-athena retries all exceptions even those that are guaranteed to keep failing, such as syntax errors in your models SQL. I would expect that operational errors to be retried but not user errors.

BUG: dbt seed fails with CSV having column named "date"

Using dbt-athena from commit 192136a we get an error running dbt seed with the following seed data:

dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:20:58  Running with dbt=1.1.0
23:20:58  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:20:58  
23:21:04  Concurrency: 1 threads (target='dev')
23:21:04  
23:21:04  1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:22:27  1 of 1 ERROR loading seed file taxi_data.trip_counts ........................... [ERROR in 83.08s]
23:22:27  
23:22:27  Finished running 1 seed in 88.47s.
23:22:27  
23:22:27  Completed with 1 error and 0 warnings:
23:22:27  
23:22:27  Runtime Error in seed trip_counts (seeds/trip_counts.csv)
23:22:27    FAILED: ParseException line 2:67 cannot recognize input near 'integer' ')' 'stored' in column type
23:22:27  
23:22:27  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv 
date,trip_count
2020-01-01,1
2021-01-01,2

Renaming column date to trip_date causes it to work:

dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:23:39  Running with dbt=1.1.0
23:23:39  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:23:39  
23:23:45  Concurrency: 1 threads (target='dev')
23:23:45  
23:23:45  1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:23:49  1 of 1 OK loaded seed file taxi_data.trip_counts ............................... [INSERT 2 in 4.00s]
23:23:49  
23:23:49  Finished running 1 seed in 10.17s.
23:23:49  
23:23:49  Completed successfully
23:23:49  
23:23:49  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv 
trip_date,trip_count
2020-01-01,1
2021-01-01,2

Using different data for the second column but still using date as the first column name we get a new error message:

dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:26:51  Running with dbt=1.1.0
23:26:51  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:26:51  
23:26:57  Concurrency: 1 threads (target='dev')
23:26:57  
23:26:57  1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:28:24  1 of 1 ERROR loading seed file taxi_data.trip_counts ........................... [ERROR in 87.06s]
23:28:24  
23:28:24  Finished running 1 seed in 92.63s.
23:28:24  
23:28:24  Completed with 1 error and 0 warnings:
23:28:24  
23:28:24  Runtime Error in seed trip_counts (seeds/trip_counts.csv)
23:28:24    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support date. See HIVE-6384
23:28:24  
23:28:24  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv 
date,value
2020-01-01,hello
2021-01-01,world

Again renaming date to trip_date fixes the error:

dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:29:16  Running with dbt=1.1.0
23:29:16  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:29:16  
23:29:22  Concurrency: 1 threads (target='dev')
23:29:22  
23:29:22  1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:29:26  1 of 1 OK loaded seed file taxi_data.trip_counts ............................... [INSERT 2 in 4.02s]
23:29:26  
23:29:26  Finished running 1 seed in 9.55s.
23:29:26  
23:29:26  Completed successfully
23:29:26  
23:29:26  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv 
trip_date,value
2020-01-01,hello
2021-01-01,world

Error running model with external_location when table already exists

I'm experiencing a strange error when running a model with materialized='table' and external_location. It only happens when the table already exists in Athena. I thought #73 would fix it but it doesn't appear to.

portfolio_account.sql:

{{ config(
   materialized = 'table',
   external_location = 's3://<my bucket>/test/' + target.name + '/portfolio_account/',
   partitioned_by = ['_company_id'],
   format = 'orc',
   write_compression = 'zlib'
) }}
...

dbt run --models portfolio_account (relevant dbt.log snippet):

�[0m14:38:54.901965 [debug] [Thread-1 (]: Began running node model.lake.portfolio_account
�[0m14:38:54.901965 [info ] [Thread-1 (]: 1 of 1 START table model lake_dev.portfolio_account ............................ [RUN]
�[0m14:38:54.901965 [debug] [Thread-1 (]: Acquiring new athena connection "model.lake.portfolio_account"
�[0m14:38:54.901965 [debug] [Thread-1 (]: Began compiling node model.lake.portfolio_account
�[0m14:38:54.917542 [debug] [Thread-1 (]: Compiling model.lake.portfolio_account
�[0m14:38:54.917542 [debug] [Thread-1 (]: Writing injected SQL for node "model.lake.portfolio_account"
�[0m14:38:54.917542 [debug] [Thread-1 (]: finished collecting timing info
�[0m14:38:54.917542 [debug] [Thread-1 (]: Began executing node model.lake.portfolio_account
�[0m14:38:54.982451 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m14:38:55.492034 [debug] [Thread-1 (]: Athena adapter: Error running SQL: macro drop_relation
�[0m14:38:55.507660 [debug] [Thread-1 (]: finished collecting timing info
�[0m14:38:55.507660 [debug] [Thread-1 (]: On model.lake.portfolio_account: Close
�[0m14:38:55.507660 [debug] [Thread-1 (]: Runtime Error in model portfolio_account (models\portfolio_account.sql)
  local variable 'table' referenced before assignment
�[0m14:38:55.507660 [debug] [Thread-1 (]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'fedef137-2292-43ea-9d8f-27a910d27f59', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000027BE7F09F60>]}
�[0m14:38:55.507660 [error] [Thread-1 (]: 1 of 1 ERROR creating table model lake_dev.portfolio_account ................... [�[31mERROR�[0m in 0.61s]
�[0m14:38:55.507660 [debug] [Thread-1 (]: Finished running node model.lake.portfolio_account

Frequent nondeterministic errors in dbt run and dbt seed when using multiple threads

Using multiple threads to rebuild models is a huge performance win under Athena because it just scales up like crazy, but when I use it with this adapter I randomly get strange, nondescript errors that say things like credential_provider that sound like they're component or variable names in the AWS client library:

06:35:55 | 6 of 52 ERROR creating table model covid19_puerto_rico_model.hhs_hospitals [ERROR in 0.53s]

[further down]

Completed with 1 error and 0 warnings:

Runtime Error in model hhs_hospitals (models/hhs/staging/hhs_hospitals.sql)
  'credential_provider'

Workarounds

Don't use multithreading (which can be a lot slower)
Rerun just the affected models (dbt run --threads 1 --select hhs_hospitals+)

Support for ICEBERG type tables

AWS has iceberg support in Athena currently in preview. Will it be possible to extend dbt-athena to support ICEBERG table format.

ICEBERG supports ACID transactions and can help with building type 2, type 3 dimensions. Since DBT supports snapshots, can dtb-athena use iceberg table type to do snapshots.

Switching to Hive to Iceberg requires change in metastore. Since Athena handles the metastore, can this be leveraged by dbt-athena? Can we add iceberg table type format to dbt-athena?

Reference the adapter on dbt's available adapters

See

https://docs.getdbt.com/docs/available-adapters

Implementing snapshots

Hey @Tomme ,

First of all, thanks a ton for dbt-athena -- it works very nicely!

I was wondering would would it take to implement snapshots with Athena. Since it doesn't really do updates, it will be quite a challenge but from the outset it does not seem completely impossible.

Would you see it that way as well? It seems the Presto adapter is also not implementing it, so I was wondering what would your thoughts on something like this be.

Thanks!

[enhancement] There is no support for using IAM role for authentication

Currently you can only use aws_profile_name for authentication. The connection to athena is managed using pyathena which does support using iam role.

Support specifying compression type for Parquet and ORC models

It would be helpful if dbt-athena supported specifying the parquet_compression and orc_compression properties for models.

By default, Athena will use GZIP compression for Parquet and ORC tables, but supports several other compression formats (docs). Generally speaking, SNAPPY is faster to read/write, but GZIP yields better compression ratios.

It might also be worth exploring using SNAPPY as the default compression format for Parquet in dbt-athena.

Allow specifying credentials via parameters

Via the README#Credentials, this library only supports loading credentials through configured AWS CLI credentials. Would you support adding in the ability to configure the credentials (access key, secret key) as parameters to the connector? Looking at the underlying PyAthena library, it already supports it (see here), so it would just be amending this library to allow parsing it in / passing it along I think. Happy to look to put up a PR for this, but want to confirm if it would be merged before committing to it.

unique_key is not supported

Hi,

I have a question about the implementation of unique_key support, were there some issues with this?
Is there maybe a plan for implementing this in the future?

Thanks,

Jovana

Schema getting appended

Hi @Tomme ,

Thanks for dbt-athena , it works great.

I am trying to give a different schema than whats specified in profiles.yml, the model gets created in a schema completely different . Examples below

profiles.yml

default:
  outputs:
    dev_test:
      type: athena
      s3_staging_dir: s3://test_bucket/
      region_name: eu-west-1
      schema: default
      database: awsdatacatalog

dbt_project.yml

models:
  first_project:
    staging:
      +materialized: table
      +schema: test

Resulting table is getting created in default_test, basically its getting concatenating the schemas from both with an underscore

19:17:53 | Concurrency: 1 threads (target='dev_test')
19:17:53 |
19:17:53 | 1 of 1 START table model default_test.test_dbt....................... [RUN]
19:17:57 | 1 of 1 OK created table model default_test.test_dbt.................. [OK -1 in 3.96s]
19:17:57 |
19:17:57 | Finished running 1 table model in 51.26s.

Is there a way to override the schema with the one under dbt_project.yml
Any help highly appreciated. Thanks in advance

Seeds don't work with single quote

Problem

I'm trying to run a seed and it has a single quote in it and I'm getting some failures.

how to replicate

Just try to seed this table.

test,error
This works,None
This doesn't, mismatched "t"
escape also doesn'''t, mismatched "\"

relax boto3 version requirements

The boto3 version requirement currently means you have to use an old version of the awscli if you install both into the same venv. That in turn means that aws sso does not work properly. It would be nice if the install requirement in setup.py could be changed to >= to allow installing higher versions and assume that higher versions are not breaking API (until proven otherwise...).

E.g.

    install_requires=[
        "dbt-core>=1.0",
        "pyathena>=2.2.0",
        "boto3>=1.18.12",
        "tenacity>=6.3.1",
    ]

It's then up to the user to ensure that the installed versions produce a working set of packages.

timestamp with time zone doesn't work

how to replicate

create a sql file

--test.sql
SELECT CAST('2022-07-20 9:15:00 US/Pacific' AS Timestamp with time zone) as dt

run

dbt run --select test

error

Invalid column type for column dt: Unsupported Hive type: timestamp with time zone

Compatibility with newer dbt

Would you mind adding the following lines to AthenaCredentials class? This will make it possible to use dbt-athena with more recent dbt versions.

@property
def unique_field(self):
    return self.host

I know that the package is currently not marked as compatible with the latest versions, but this small modification will at least allow using it for the most common use cases without receiving an error.

external_location makes run fail because of HIVE_PATH_ALREADY_EXISTS

Hi!

I'm trying to use the "external_location" propery but I get this error when second model runs, as it's trying to use same path again: Target directory for table 'table' already exists.

I've setted the propery on the dbt_project.yml at a project level.
If I set it at a model level, it works the first time but on the second run, i get the same error.

Am i doing something wrong?

For what i've seen, when using the default path, final table path is generated using a UUID but when overriding it, this UUID is not used.

Thanks in advance!

CTAS with unique temporary table name

We're using dbt_athena in running 3 to 6 batches per hour (e.g. 6 parallel queries for the last hour, because the data is so big Athena has a few limitations/problems with it). The problem in running N parallel queries in dbt/dbt_athena is the "tmp" table used in the CTAS query with the incremental_overwrite mode.

Any ideas on how to make the "tmp" relation unique per query with a single model? Workarounds are to define "per-client" models (a table for each client) but that gets out of hand quickly.

I lost my way in the Jinja, else would've proposed a PR.

S3 Throttling for Big tables

Hey,

Recently we are facing s3 throttling issues while creating external tables which basically select data from another big table like

create table
    new_table

    with (
        external_location='s3://{{ bucket_name }} / {{ invocation_id }}',
        format='parquet'
    )
  as
    
select *
from bigger_table
join other_big_table
where some_logic

Error committing write to Hive com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)

Have anyone faced this issue in DBT and how have you guys solved it

Thanks in advance

type_string should be varchar

dbt_utils.type_string() is implemented as

{%- macro type_string() -%}
  {{ return(adapter.dispatch('type_string', 'dbt_utils')()) }}
{%- endmacro -%}

but returns 'string'. When running cast('constant', string) Athena throws:
Unknown type: string. You may need to manually clean the data at location ....

When used for typecasts, VARCHAR is a more safe target type.

Support dbt 0.20.1

As dbt has release v0.20.1 , can dbt-athena supports the latest version as well?

BUG: `dbt test` fails when column name is a keyword

dbt test fails on not_null test when column name is a keyword e.g. order, with the following error message:

An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported

Column and column test definition from .yml file:

columns:
   - name: order     
     tests:
         - not_null

Compiled column not_null test

select
    order
from dwh.table
where 
    order is null

Maybe this can be solved by quoting the column names in the compiled test sql? E.G

select
    "order"
from dwh.table
where 
    "order" is null

Specifying external_location isn't cleaning up the existing data.

Problem

I'm not understanding the accepted way to make external_location dynamic from dev to production.

I unfortunately can't use the s3_staging_dir from profiles.yml because of a constraint of my IT. Also the standard {s3_staging_dir}/tables/uuid doesn't make it easy to know which data goes with which table while looking through s3.

Because right now if I run it twice I'm getting

HIVE_PATH_ALREADY_EXISTS: Target directory for table 'dev_test.tbl_test' already exists: s3://my_external_location/PARQUET_STORE/dev_test/tbl_test. You may need to manually clean the data at location 's3://my_s3_staging_dir/tables/4d8dcf55-5fc0-424d-8668-0b6f182fb254' before retrying. Athena will not delete data in your account.

my tbl_test is simply

{{ config(materialized='table', external_location='s3://my_external_location/PARQUET_STORE/dev_test/tbl_test/')  }}

SELECT 'Test' AS "Fun"

I'm hoping I don't have to manually drop the table

Pipe dream

I'd love it if I could have the external_location in the form of

s3://my-bucket/some_extra_path/tbl_schema/tbl_name

Where I give the external_location prefix (different one for dev, qa, prod) and the file would know what table schema it is going to and what the table name is without me having to configure it.

Possible fix

PR 74 seems to fix the issue.

Documentation

I am relatively new to DBT and trying my hands on dbt-athena. I am trying to use it with my exiting DBT repo. Can I do it ?
Can I add dbt-athena into my requirements.txt ? Is there any example of some repo which is using dbt-athena?

Thanks

[BUG] Boto3 get_partitions pagination is not handled in impl.py

When utilizing an insert_overwrite materialization strategy, the framework uses boto3's glue client get_partitions function call. This returns a JSON object which can be paginated and requires handling of the NextToken value returned. If pagination is not handled, it appears as though partitions do not exist.

The partition_by property is not working

Hi!

I'm trying to use the "partition_by" property but the partition is not created.
I've set the property on the dbt_project.yml at a project level:

models:
project:
base:
materialized: table
schema: raw
partition_by: ["field_name"]

at the model level:

models:
- name: my_table
partition_by: ["field_name"]

and, at the file level:

{{ config( materialized='table', partition_by=["field_name"] ) }}

No error is produced in any situation, the table is created normally, but the partition is not created.
Am I doing something wrong?

Thanks in advance!

tomme / dbt-athena Goto Github PK

dbt-athena's People

Contributors

Stargazers

Watchers

Forkers

dbt-athena's Issues

What I expected to happen:

What happened

Cause

Workarounds

Problem

how to replicate

how to replicate

error

Problem

Pipe dream

Possible fix

Recommend Projects

Recommend Topics

Recommend Org