tomme / dbt-athena Goto Github PK
View Code? Open in Web Editor NEWThe athena adapter plugin for dbt (https://getdbt.com)
License: Apache License 2.0
The athena adapter plugin for dbt (https://getdbt.com)
License: Apache License 2.0
Hi, I am wondering if the database property for sources is defined for dbt athena? After setting the database property in sources and running dbt run, it tries to build tables using my target database and not source database. I hope you can help with this, I have attached my sources file. Thanks
To select from the source I call:
from {{source('fleetdog', 'raw_public_trips')}}
My source is configured like this:
version: 2
sources:
- name: fleetdog
database: dataalpha
quoting:
database: true
tables:
- name: raw_public_trips
Error:
07:41:39 Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
07:41:39 SYNTAX_ERROR: line 15:6: Table awsdatacatalog.fleetdog.raw_public_trips does not exist.
Hi
I'm sorry if I'm missing something - I'm new in dbt and just trying to switch the project from custom data model management to dbt
Does dbt-athena supports table recalculation without downtime while it's being recalculated?
What workaround I used to do in my own script: to give every CTAS an unique name (i.e. _) and have a helper view with a fixed name () and simple query select * from <name>_<random>
- so once new version of ctas is calculated I switch a view to the new version and drop the old one. All usages reference only view and the whole thing works without downtime
Will appreciate your thoughts and comments!
Is there any possibility for dbt/dbt_athena to run using IAM instead of the aws_profile_name settings (which requires access/secret key). The dbt/dbt_athena we are running is inside a dagster @op with proper kube2iam annotations. The invocation of dbt
however using context.resources.dbt.run (vars = payload)
from dagster doesn't seem to take the IAM role associated (we are sure the role works as we've tested it directly with PyAthena queries and simple s3 actions at the Dagster @op level). It just doesn't work when we invoke dbt/dbt_athena from the @op ... Ideas?
We're getting an InvalidRequestException when specifying format 'parquet' without specifying the compression algorithm, in which case the default compression algorithms should be used (gzip in case of parquet)
Exception:
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Compression algorithm '' is not supported for file format 'parquet'.
Seems that in the create_table_as.sql
no default for write_compression
is defined, which results in an empty string ''
(instead of just not specifying write_compression
at all which would result in the default (gzip) being used.)
Should be fixed by this PR
Stacktrace:
Failed to execute query.
Traceback (most recent call last):
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/pyathena/common.py", line 307, in _execute
query_id = retry_api_call(
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/pyathena/util.py", line 84, in retry_api_call
return retry(func, *args, **kwargs)
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 409, in call
do = self.iter(retry_state=retry_state)
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 356, in iter
return fut.result()
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/tenacity/__init__.py", line 412, in call
result = fn(*args, **kwargs)
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/botocore/client.py", line 386, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/juliansteger/.local/share/virtualenvs/dbt-lib-Tz_GiOtG/lib/python3.8/site-packages/botocore/client.py", line 705, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Compression algorithm '' is not supported for file format 'parquet'.
While adopting dbt-athena we noticed that for some of our databases dbt docs generate
hangs indefinitely.
After some investigation we found that it happens when using the athena__get_catalog
macro for a db with more than 100 tables. This is not a dbt-athena issue itself, rather an AWS Athena internal issues. We ended up talking with AWS support and they mentioned that an issue most likely related to this is something they are working on, but have no timeline for when it's going to be fixed.
In the mean time the workaround that we were advised to use is to also filter by tables. After some manual testing we found that splitting the catalog query to filter by databases AND tables (in batches of 100) solves the issue. Sample of the query:
select *
from (
(
with tables as (
...
),
columns as (
...
),
catalog as (
...
)
select *
from catalog
where lower("table_schema") = lower('very_large_database')
and (
"table_name" = lower('table_1')
or "table_name" = lower('table_2')
....
or "table_name" = lower('table_100')
)
union all
select *
from catalog
where lower("table_schema") = lower('very_large_database')
and (
"table_name" = lower('table_101')
or "table_name" = lower('table_102')
....
or "table_name" = lower('table_200')
)
.....
)
These tables can be retrieved from the manifest as part of the adapter get_catalog()
.
If a AWS profile is configured it is passed as expected to pyathena
but all direct calls to boto3
ignore it. This issue got exacerbated by 83f7756 since boto3
is going to be called on every dbt run
.
Hi @Tomme !
Great work on the adapter, it works like a charm and it's awesome to see you're continuing to add new features!
However, the non-standard way of installing your adapter directly from github makes deterministic and verified installs of the adapter a little cumbersome, and makes us hesitant of bringing it into production, even though our Analytics Engineers would love to use dbt to build and maintain Athena tables.
I'm wondering, do you have plans to introduce tagged releases and publish the package to pypi?
This would make using and integrating the adapter into our python environments much easier, since users can easily specify pinned sets of python dependencies and test against them, or revert to older version sets if things no longer work/
It seems the pypi package that @lsaletti published under to the name dbt-athena
has been abandoned, and its git repository has been deleted, however the info page on pypi links to your project with a recommendation to use it.
Have the two of you been in contact or talked about a potential transfer of the pypi project?
Even if a transfer of the pypi project is not possible or desired, I believe most users would be totally fine with installing your adapter using a new name, if it becomes possible to install versioned releases from pypi. For example dbt-athena-adapter
is still free as of now.
Please let me know if I can support in any way.
From https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html
Athena accepts mixed case in DDL and DML queries, but lower cases the names when it executes the query.
Macros athena__list_relations_without_caching
and athena__get_columns_in_relation
wrap references to information_schema.table_schema
and information_schema.table_name
in LOWER()
, which I think is unnecessary.
In my testing, the queries executed by these macros perform significantly faster when LOWER()
is removed from the information_schema
references.
Macro | Seconds to run with LOWER() | Seconds to run without LOWER() |
---|---|---|
list_relations_without_caching | 29.6 | 12.3 |
get_columns_in_relation | 17.5 | 0.5 |
Hello, dbt-core 1.0 has been rolled out and I'd to make this adapter compatible.
I'm starting this issue to document what should be done for v1 readiness. Do not hesitate to comment, I can update this issue.
profile_template.yml
see https://docs.getdbt.com/reference/commands/init#profile_templateymlI'm wondering if we should create a branch v1.0
to ease testing?
The README specifies that this plugin supports version 0.21.0
but on_schema_change
is not available. This was one of the main releases for version 0.21.0: https://github.com/dbt-labs/dbt-core/releases/tag/v0.21.0
Can the documentation be updated to show how on_schema_change
is supported or specify that it is not available?
Hi,
This is my model sql.
{{
config(
materialized="view",
)
}}
SELECT org_feature_name
FROM test.org_feature_v1
GROUP BY org_feature_name;
For the first time run, it works.
However, on the second time run, it failed to replace the view with this log.
Am I missing something?
19:24:08 Running with dbt=1.0.3
19:24:08 Found 1 model, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
19:24:08
19:24:14 Concurrency: 1 threads (target='dev')
19:24:14
19:24:14 1 of 1 START view model ft_analytics_dev.org_feature_name....................... [RUN]
19:24:14 1 of 1 ERROR creating view model ft_analytics_dev.org_feature_name.............. [ERROR in 0.02s]
19:24:15
19:24:15 Finished running 1 view model in 6.52s.
19:24:15
19:24:15 Completed with 1 error and 0 warnings:
19:24:15
19:24:15 Compilation Error in model org_feature_name (models/exp/org_feature_name.sql)
19:24:15 When searching for a relation, dbt found an approximate match. Instead of guessing
19:24:15 which relation to use, dbt will move on. Please delete ft_analytics_dev.org_feature_name, or rename it to be less ambiguous.
19:24:15 Searched for: ft_analytics_dev.org_feature_name
19:24:15 Found: ft_analytics_dev.org_feature_name
19:24:15
19:24:15 > in macro create_or_replace_view (macros/materializations/models/view/create_or_replace_view.sql)
19:24:15 > called by macro materialization_view_athena (macros/materializations/models/view/view.sql)
19:24:15 > called by model org_feature_name (models/exp/org_feature_name.sql)
19:24:15
19:24:15 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
#68 added support for endpoint_url
configuration, but included a typo:
"endpoing_url"
should be "endpoint_url"
When fixing, consider also updating the README to document the endpoint_url
configuration option.
Hi everyone, first of all, Thank you so much for developing this plugin to connect DBT with Athena.
I am facing an issue which says -
Compilation Error
DBT found more than one package with the name "my_first_project" included in this project. Package names must be unique in a project. Please rename one of these packages.
I have checked the entire project and I have only mentioned this name once,
Does anyone know why I am facing this issue or what part of the setup am I missing?
Thanks in advance!
Hey, really cool plugin!
I noticed that you rely on pyathena a lot - that is a beta package that has lots of issues in my experience. Why not try to use the offical python package from AWS - awswrangler - https://github.com/awslabs/aws-data-wrangler
If seeds have already been then subsequent runs fail with a unhelpful error
The error that DBT shows is
14:51:05 1 of 4 START seed file dbt_test.cohorts ........................................ [RUN]
14:51:06 1 of 4 ERROR loading seed file dbt_test.cohorts ................................ [ERROR in 0.28s]
14:51:06 2 of 4 START seed file dbt_test.diagnosis ...................................... [RUN]
14:51:06 2 of 4 ERROR loading seed file dbt_test.diagnosis .............................. [ERROR in 0.15s]
14:51:06 3 of 4 START seed file dbt_test.parameters ..................................... [RUN]
14:51:06 3 of 4 ERROR loading seed file dbt_test.parameters ............................. [ERROR in 0.13s]
14:51:06 4 of 4 START seed file dbt_test.valid_ranges ................................... [RUN]
14:51:06 4 of 4 ERROR loading seed file dbt_test.valid_ranges ........................... [ERROR in 0.13s]
14:51:06
14:51:06 Finished running 4 seeds in 12.70s.
14:51:06
14:51:06 Completed with 4 errors and 0 warnings:
14:51:06
14:51:06 Runtime Error in seed cohorts (seeds/meta/cohorts.csv)
14:51:06 local variable 'table' referenced before assignment
14:51:06
14:51:06 Runtime Error in seed diagnosis (seeds/meta/diagnosis.csv)
14:51:06 local variable 'table' referenced before assignment
14:51:06
14:51:06 Runtime Error in seed parameters (seeds/meta/parameters.csv)
14:51:06 local variable 'table' referenced before assignment
14:51:06
14:51:06 Runtime Error in seed valid_ranges (seeds/meta/valid_ranges.csv)
14:51:06 local variable 'table' referenced before assignment
and if you dig into the logs you can find
13:42:45.927212 [info ] [Thread-1 ]: 1 of 4 START seed file dbt.cohorts ............................................. [RUN]
13:42:45.927659 [debug] [Thread-1 ]: Acquiring new athena connection "seed.emr.cohorts"
13:42:45.928175 [debug] [Thread-1 ]: Began compiling node seed.emr.cohorts
13:42:45.928367 [debug] [Thread-1 ]: finished collecting timing info
13:42:45.928537 [debug] [Thread-1 ]: Began executing node seed.emr.cohorts
13:42:45.959518 [debug] [Thread-1 ]: Opening a new connection, currently in state closed
13:42:46.214376 [debug] [Thread-1 ]: Athena adapter: Error running SQL: macro drop_relation
13:42:46.214740 [debug] [Thread-1 ]: finished collecting timing info
13:42:46.214926 [debug] [Thread-1 ]: On seed.emr.cohorts: Close
13:42:46.215248 [debug] [Thread-1 ]: Runtime Error in seed cohorts (seeds/meta/cohorts.csv)
local variable 'table' referenced before assignment
The issue as far as I can tell is this line Athena adapter: Error running SQL: macro drop_relation
Hi!
Whenever any DBT command first runs then the following query is run to check the existence of views and tables on the entire catalog.
However, when the default catalog is large then it really hangs for a while.
Is there a way to speed this up or avoid this?
Is there a reason that it has to run over the whole catalog rather than just the relevant schema?
Thanks
WITH views AS (
select
table_catalog as database,
table_name as name,
table_schema as schema
from "awsdatacatalog".INFORMATION_SCHEMA.views
where table_schema = LOWER('*******')
), tables AS (
select
table_catalog as database,
table_name as name,
table_schema as schema
from "awsdatacatalog".INFORMATION_SCHEMA.tables
where table_schema = LOWER('********')
-- Views appear in both `tables` and `views`, so excluding them from tables
EXCEPT
select * from views
)
select views.*, 'view' AS table_type FROM views
UNION ALL
select tables.*, 'table' AS table_type FROM tables
With aws_profile_name
in my DBT profile, all connections to AWS would use the specified SSO credentials (assuming I have the necessary permissions).
Everything works fine for seeds, tests, materializing tables and views.
When trying to materialize an incremental, partitioned table in insert_overwrite
mode, the task fails part way through with the error Unable to locate credentials
. This appears to happen after the temporary table for incremental update has happened, but before it has been inserted into the original table.
It looks like the profile is not used when making calls directly to Glue/S3 (rather than Athena). This happens in clean_up_table
, clean_up_partitions
(which construct boto3.Client
directly), which expects to find static credentials in a config file or the environment.
This can probably be fixed by constructing a boto3.Session
with profile and passing that to both pyathena
and using it to construct S3/Glue clients.
I'll try and write a PR for this.
Hello all,
I have a dbt model with this config.
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partitioned_by=('dummy_partition',),
)
}}
With this config, the expected behaviors is like this.
The problem I noticed was that there can be duplicated data if the deleting S3 objects fails ( 'cuzed by permission issue )
This is the code and return value from Boto3 S3. I think this line should check if the the returned value's Error is empty or not.
https://github.com/Tomme/dbt-athena/blob/master/dbt/adapters/athena/impl.py#L105
s3_bucket.objects.filter(Prefix=prefix).delete()
[{'ResponseMetadata': {'RequestId': 'S2M8Z423GKH25JZW', 'HostId': '+igjv+OblFmcETz1WWHkuhcaW0muPZef9bI=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': '+igjv+OblFmcETz1WWHkuhcaW0muPZef9bI=', 'x-amz-request-id': 'S2M8Z423GJZW', 'date': 'Sat, 02 Apr 2022 00:53:39 GMT', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3', 'connection': 'close'}, 'RetryAttempts': 0}, 'Errors': [{'Key': 'dbt_v1/tables/3338-4afd-b918-ec7dbc7a2225/dummy_partition=default/20220402_003612_00013_t88vg_1685448d-3936-47ac-9a58-6640103ed4f3', 'Code': 'AccessDenied', 'Message': 'Access Denied'}, {'Key': 'dbt_v1/tables/3338-4afd-b918-ec7dbc7a2225/dummy_partition=default/20220402_003445_00029_9mcfm_a79200a7-f61e-476d-90ca-529139ec48bc', 'Code': 'AccessDenied', 'Message': 'Access Denied'}]}]
Looks like issue was caused by this merge: #43
If we have a static external location and materialized = table (for full overwrite), the prior data in the external location is not cleaned up.
We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1
(PyPI | GitHub release notes).
dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0
Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.
BaseDocsGenerate
and BaseDocsGenReferences
dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:
Hi there! Thanks so much for this project, it looks like it might be a perfect fit for my use-case: I have some Glue catalogs mounted in Redshift as external schemas. Some of the transforms I do on these schemas are compute intensive and are spilling to disk. I'd like to execute them with Athena (and this library!) instead.
I'm running into a few issues ...
executing is comparatively slow, dbt run
spends a lot of time executing numerous queries to get a list of views and tables from "awsdatacatalog".INFORMATION_SCHEMA.tables
(i.e. it's executing this macro). it's taking a list of all my redshift schemas and executing an athena query on each; all of them return no results because those schemas don't exist in athena
as part of my dbt project, i'm running an on-end-run
hook to set various schema permission, when it executes (it executes numerous sql statements) I get an error like
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Only one sql statement is allowed
Athena supports an optional "endpoint_url" parameter, in case the connection should go through a reverse proxy, or otherwise not the default URL.
Hi,
I'm trying to run
{{ config(partitioned_by='day') }}
select
a,
b,
c
from random_table
I can see that in target/run
the query is being compiled to
create table
dev_datawarehouse.test
with (
partitioned_by=ARRAY'day',
format='parquet'
)
as
select
a,
b,
c
from random_table
Which gives the following error: SYNTAX_ERROR: Unknown type: ARRAY.
the compiled query should instead be
create table
dev_datawarehouse.test
with (
partitioned_by=ARRAY['day'],
format='parquet'
)
as
select
a,
b,
c
from random_table
Which works just fine. Am I missing something obvious?
Currently dbt-athena
retries all exceptions even those that are guaranteed to keep failing, such as syntax errors in your models SQL. I would expect that operational errors to be retried but not user errors.
Using dbt-athena
from commit 192136a we get an error running dbt seed
with the following seed data:
dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:20:58 Running with dbt=1.1.0
23:20:58 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:20:58
23:21:04 Concurrency: 1 threads (target='dev')
23:21:04
23:21:04 1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:22:27 1 of 1 ERROR loading seed file taxi_data.trip_counts ........................... [ERROR in 83.08s]
23:22:27
23:22:27 Finished running 1 seed in 88.47s.
23:22:27
23:22:27 Completed with 1 error and 0 warnings:
23:22:27
23:22:27 Runtime Error in seed trip_counts (seeds/trip_counts.csv)
23:22:27 FAILED: ParseException line 2:67 cannot recognize input near 'integer' ')' 'stored' in column type
23:22:27
23:22:27 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv
date,trip_count
2020-01-01,1
2021-01-01,2
Renaming column date
to trip_date
causes it to work:
dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:23:39 Running with dbt=1.1.0
23:23:39 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:23:39
23:23:45 Concurrency: 1 threads (target='dev')
23:23:45
23:23:45 1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:23:49 1 of 1 OK loaded seed file taxi_data.trip_counts ............................... [INSERT 2 in 4.00s]
23:23:49
23:23:49 Finished running 1 seed in 10.17s.
23:23:49
23:23:49 Completed successfully
23:23:49
23:23:49 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv
trip_date,trip_count
2020-01-01,1
2021-01-01,2
Using different data for the second column but still using date
as the first column name we get a new error message:
dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:26:51 Running with dbt=1.1.0
23:26:51 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:26:51
23:26:57 Concurrency: 1 threads (target='dev')
23:26:57
23:26:57 1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:28:24 1 of 1 ERROR loading seed file taxi_data.trip_counts ........................... [ERROR in 87.06s]
23:28:24
23:28:24 Finished running 1 seed in 92.63s.
23:28:24
23:28:24 Completed with 1 error and 0 warnings:
23:28:24
23:28:24 Runtime Error in seed trip_counts (seeds/trip_counts.csv)
23:28:24 FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.UnsupportedOperationException: Parquet does not support date. See HIVE-6384
23:28:24
23:28:24 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv
date,value
2020-01-01,hello
2021-01-01,world
Again renaming date
to trip_date
fixes the error:
dsaxton:~/git-repos/dbt-scratch/scratch$ dbt seed
23:29:16 Running with dbt=1.1.0
23:29:16 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 167 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
23:29:16
23:29:22 Concurrency: 1 threads (target='dev')
23:29:22
23:29:22 1 of 1 START seed file taxi_data.trip_counts ................................... [RUN]
23:29:26 1 of 1 OK loaded seed file taxi_data.trip_counts ............................... [INSERT 2 in 4.02s]
23:29:26
23:29:26 Finished running 1 seed in 9.55s.
23:29:26
23:29:26 Completed successfully
23:29:26
23:29:26 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
dsaxton:~/git-repos/dbt-scratch/scratch$ cat seeds/trip_counts.csv
trip_date,value
2020-01-01,hello
2021-01-01,world
I'm experiencing a strange error when running a model with materialized='table'
and external_location
. It only happens when the table already exists in Athena. I thought #73 would fix it but it doesn't appear to.
portfolio_account.sql:
{{ config(
materialized = 'table',
external_location = 's3://<my bucket>/test/' + target.name + '/portfolio_account/',
partitioned_by = ['_company_id'],
format = 'orc',
write_compression = 'zlib'
) }}
...
dbt run --models portfolio_account (relevant dbt.log snippet):
�[0m14:38:54.901965 [debug] [Thread-1 (]: Began running node model.lake.portfolio_account
�[0m14:38:54.901965 [info ] [Thread-1 (]: 1 of 1 START table model lake_dev.portfolio_account ............................ [RUN]
�[0m14:38:54.901965 [debug] [Thread-1 (]: Acquiring new athena connection "model.lake.portfolio_account"
�[0m14:38:54.901965 [debug] [Thread-1 (]: Began compiling node model.lake.portfolio_account
�[0m14:38:54.917542 [debug] [Thread-1 (]: Compiling model.lake.portfolio_account
�[0m14:38:54.917542 [debug] [Thread-1 (]: Writing injected SQL for node "model.lake.portfolio_account"
�[0m14:38:54.917542 [debug] [Thread-1 (]: finished collecting timing info
�[0m14:38:54.917542 [debug] [Thread-1 (]: Began executing node model.lake.portfolio_account
�[0m14:38:54.982451 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m14:38:55.492034 [debug] [Thread-1 (]: Athena adapter: Error running SQL: macro drop_relation
�[0m14:38:55.507660 [debug] [Thread-1 (]: finished collecting timing info
�[0m14:38:55.507660 [debug] [Thread-1 (]: On model.lake.portfolio_account: Close
�[0m14:38:55.507660 [debug] [Thread-1 (]: Runtime Error in model portfolio_account (models\portfolio_account.sql)
local variable 'table' referenced before assignment
�[0m14:38:55.507660 [debug] [Thread-1 (]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'fedef137-2292-43ea-9d8f-27a910d27f59', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x0000027BE7F09F60>]}
�[0m14:38:55.507660 [error] [Thread-1 (]: 1 of 1 ERROR creating table model lake_dev.portfolio_account ................... [�[31mERROR�[0m in 0.61s]
�[0m14:38:55.507660 [debug] [Thread-1 (]: Finished running node model.lake.portfolio_account
Using multiple threads to rebuild models is a huge performance win under Athena because it just scales up like crazy, but when I use it with this adapter I randomly get strange, nondescript errors that say things like credential_provider
that sound like they're component or variable names in the AWS client library:
06:35:55 | 6 of 52 ERROR creating table model covid19_puerto_rico_model.hhs_hospitals [ERROR in 0.53s]
[further down]
Completed with 1 error and 0 warnings:
Runtime Error in model hhs_hospitals (models/hhs/staging/hhs_hospitals.sql)
'credential_provider'
dbt run --threads 1 --select hhs_hospitals+
)AWS has iceberg support in Athena currently in preview. Will it be possible to extend dbt-athena to support ICEBERG table format.
ICEBERG supports ACID transactions and can help with building type 2, type 3 dimensions. Since DBT supports snapshots, can dtb-athena use iceberg table type to do snapshots.
Switching to Hive to Iceberg requires change in metastore. Since Athena handles the metastore, can this be leveraged by dbt-athena? Can we add iceberg table type format to dbt-athena?
Hey @Tomme ,
First of all, thanks a ton for dbt-athena
-- it works very nicely!
I was wondering would would it take to implement snapshots with Athena. Since it doesn't really do updates, it will be quite a challenge but from the outset it does not seem completely impossible.
Would you see it that way as well? It seems the Presto adapter is also not implementing it, so I was wondering what would your thoughts on something like this be.
Thanks!
Currently you can only use aws_profile_name
for authentication. The connection to athena is managed using pyathena which does support using iam role.
It would be helpful if dbt-athena
supported specifying the parquet_compression
and orc_compression
properties for models.
By default, Athena will use GZIP compression for Parquet and ORC tables, but supports several other compression formats (docs). Generally speaking, SNAPPY
is faster to read/write, but GZIP
yields better compression ratios.
It might also be worth exploring using SNAPPY
as the default compression format for Parquet in dbt-athena
.
Via the README#Credentials, this library only supports loading credentials through configured AWS CLI credentials. Would you support adding in the ability to configure the credentials (access key, secret key) as parameters to the connector? Looking at the underlying PyAthena library, it already supports it (see here), so it would just be amending this library to allow parsing it in / passing it along I think. Happy to look to put up a PR for this, but want to confirm if it would be merged before committing to it.
Hi,
I have a question about the implementation of unique_key support, were there some issues with this?
Is there maybe a plan for implementing this in the future?
Thanks,
Jovana
Hi @Tomme ,
Thanks for dbt-athena , it works great.
I am trying to give a different schema than whats specified in profiles.yml
, the model gets created in a schema completely different . Examples below
profiles.yml
default:
outputs:
dev_test:
type: athena
s3_staging_dir: s3://test_bucket/
region_name: eu-west-1
schema: default
database: awsdatacatalog
dbt_project.yml
models:
first_project:
staging:
+materialized: table
+schema: test
Resulting table is getting created in default_test
, basically its getting concatenating the schemas from both with an underscore
19:17:53 | Concurrency: 1 threads (target='dev_test')
19:17:53 |
19:17:53 | 1 of 1 START table model default_test.test_dbt....................... [RUN]
19:17:57 | 1 of 1 OK created table model default_test.test_dbt.................. [OK -1 in 3.96s]
19:17:57 |
19:17:57 | Finished running 1 table model in 51.26s.
Is there a way to override the schema with the one under dbt_project.yml
Any help highly appreciated. Thanks in advance
I'm trying to run a seed and it has a single quote in it and I'm getting some failures.
Just try to seed this table.
test,error
This works,None
This doesn't, mismatched "t"
escape also doesn'''t, mismatched "\"
The boto3 version requirement currently means you have to use an old version of the awscli if you install both into the same venv. That in turn means that aws sso does not work properly. It would be nice if the install requirement in setup.py could be changed to >=
to allow installing higher versions and assume that higher versions are not breaking API (until proven otherwise...).
E.g.
install_requires=[
"dbt-core>=1.0",
"pyathena>=2.2.0",
"boto3>=1.18.12",
"tenacity>=6.3.1",
]
It's then up to the user to ensure that the installed versions produce a working set of packages.
create a sql file
--test.sql
SELECT CAST('2022-07-20 9:15:00 US/Pacific' AS Timestamp with time zone) as dt
run
dbt run --select test
Invalid column type for column dt: Unsupported Hive type: timestamp with time zone
Would you mind adding the following lines to AthenaCredentials class? This will make it possible to use dbt-athena with more recent dbt versions.
@property
def unique_field(self):
return self.host
I know that the package is currently not marked as compatible with the latest versions, but this small modification will at least allow using it for the most common use cases without receiving an error.
Hi!
I'm trying to use the "external_location" propery but I get this error when second model runs, as it's trying to use same path again: Target directory for table 'table' already exists.
I've setted the propery on the dbt_project.yml at a project level.
If I set it at a model level, it works the first time but on the second run, i get the same error.
Am i doing something wrong?
For what i've seen, when using the default path, final table path is generated using a UUID but when overriding it, this UUID is not used.
Thanks in advance!
We're using dbt_athena in running 3 to 6 batches per hour (e.g. 6 parallel queries for the last hour, because the data is so big Athena has a few limitations/problems with it). The problem in running N parallel queries in dbt/dbt_athena is the "tmp" table used in the CTAS query with the incremental_overwrite mode.
Any ideas on how to make the "tmp" relation unique per query with a single model? Workarounds are to define "per-client" models (a table for each client) but that gets out of hand quickly.
I lost my way in the Jinja, else would've proposed a PR.
Hey,
Recently we are facing s3 throttling issues while creating external tables which basically select data from another big table like
create table
new_table
with (
external_location='s3://{{ bucket_name }} / {{ invocation_id }}',
format='parquet'
)
as
select *
from bigger_table
join other_big_table
where some_logic
Error committing write to Hive com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)
Have anyone faced this issue in DBT and how have you guys solved it
Thanks in advance
dbt_utils.type_string()
is implemented as
{%- macro type_string() -%}
{{ return(adapter.dispatch('type_string', 'dbt_utils')()) }}
{%- endmacro -%}
but returns 'string'. When running cast('constant', string)
Athena throws:
Unknown type: string. You may need to manually clean the data at location ....
When used for typecasts, VARCHAR is a more safe target type.
As dbt has release v0.20.1 , can dbt-athena supports the latest version as well?
dbt test
fails on not_null
test when column name is a keyword e.g. order
, with the following error message:
An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: Queries of this type are not supported
Column and column test definition from .yml file:
columns:
- name: order
tests:
- not_null
Compiled column not_null
test
select
order
from dwh.table
where
order is null
Maybe this can be solved by quoting the column names in the compiled test sql? E.G
select
"order"
from dwh.table
where
"order" is null
I'm not understanding the accepted way to make external_location
dynamic from dev to production.
I unfortunately can't use the s3_staging_dir
from profiles.yml
because of a constraint of my IT. Also the standard {s3_staging_dir}/tables/uuid
doesn't make it easy to know which data goes with which table while looking through s3.
Because right now if I run it twice I'm getting
HIVE_PATH_ALREADY_EXISTS: Target directory for table 'dev_test.tbl_test' already exists: s3://my_external_location/PARQUET_STORE/dev_test/tbl_test. You may need to manually clean the data at location 's3://my_s3_staging_dir/tables/4d8dcf55-5fc0-424d-8668-0b6f182fb254' before retrying. Athena will not delete data in your account.
my tbl_test is simply
{{ config(materialized='table', external_location='s3://my_external_location/PARQUET_STORE/dev_test/tbl_test/') }}
SELECT 'Test' AS "Fun"
I'm hoping I don't have to manually drop the table
I'd love it if I could have the external_location
in the form of
s3://my-bucket/some_extra_path/tbl_schema/tbl_name
Where I give the external_location prefix (different one for dev, qa, prod) and the file would know what table schema it is going to and what the table name is without me having to configure it.
PR 74 seems to fix the issue.
I am relatively new to DBT and trying my hands on dbt-athena
. I am trying to use it with my exiting DBT repo. Can I do it ?
Can I add dbt-athena
into my requirements.txt
? Is there any example of some repo which is using dbt-athena
?
Thanks
When utilizing an insert_overwrite
materialization strategy, the framework uses boto3's glue client get_partitions function call. This returns a JSON object which can be paginated and requires handling of the NextToken value returned. If pagination is not handled, it appears as though partitions do not exist.
Hi!
I'm trying to use the "partition_by" property but the partition is not created.
I've set the property on the dbt_project.yml at a project level:
models:
project:
base:
materialized: table
schema: raw
partition_by: ["field_name"]
at the model level:
models:
- name: my_table
partition_by: ["field_name"]
and, at the file level:
{{ config( materialized='table', partition_by=["field_name"] ) }}
No error is produced in any situation, the table is created normally, but the partition is not created.
Am I doing something wrong?
Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.