Giter Site home page Giter Site logo

dbt-labs / dbt-bigquery Goto Github PK

View Code? Open in Web Editor NEW
183.0 9.0 129.0 1.87 MB

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.

Home Page: https://github.com/dbt-labs/dbt-bigquery

License: Apache License 2.0

Python 99.11% Shell 0.25% Makefile 0.32% Dockerfile 0.32%

dbt-bigquery's Introduction

dbt logo

Unit Tests Badge Integration Tests Badge

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt is the T in ELT. Organize, cleanse, denormalize, filter, rename, and pre-aggregate the raw data in your warehouse so that it's ready for analysis.

dbt-bigquery

The dbt-bigquery package contains all of the code enabling dbt to work with Google BigQuery. For more information on using dbt with BigQuery, consult the docs.

Getting started

Join the dbt Community

Reporting bugs and contributing code

Code of Conduct

Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the dbt Code of Conduct.

dbt-bigquery's People

Contributors

azhard avatar beckjake avatar chenyulinx avatar cmcarthur avatar colin-rogers-dbt avatar darrenhaken avatar db-magnus avatar dbeatty10 avatar dependabot[bot] avatar drewbanin avatar emmyoop avatar fishtownbuildbot avatar github-actions[bot] avatar github-christophe-oudar avatar gshank avatar iknox-fa avatar jmcarp avatar jtcohen6 avatar kayrnt avatar kconvey avatar leahwicz avatar max-sixty avatar mcknight-42 avatar michelleark avatar mikealfare avatar nssalian avatar prratek avatar ran-eh avatar versusfacit avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-bigquery's Issues

Support for ingestion time partition table on BigQuery as incremental materialization

Describe the feature

This is a follow-up from a discussion over the topic with @jtcohen6.

The issues with current incremental materialization implementation

Ingestion time partition tables are supported as table materialization right now but the support is meant to be deprecated.
The specificity for those tables is that partitioning field is _PARTITIONTIME OR _PARTITIONDATE as a "pseudo column".
Since that column doesn't really exist as a column within the table, Google doesn't allow the same kind of operations on the table.
Indeed if you would like to use

partition_by={
 "field": "_PARTITIONTIME",
 "data_type": "timestamp"
}

BigQuery doesn't let you create a ingestion time partitioned table using usual dbt approach (https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_ingestion-time_partitioned_table):

CREATE TABLE
  mydataset.newtable (transaction_id INT64)
PARTITION BY
  DATE_TRUNC(_PARTITIONTIME, DAY)
AS (
SELECT TIMESTAMP("2021-11-01") as _PARTITIONTIME, 1 as transaction_id
)

will fail as _PARTITIONTIME is not directly in the defined columns.


Working approach

So the required approach is to have first:

CREATE TABLE
  mydataset.newtable (transaction_id INT64)
PARTITION BY
  DATE_TRUNC(_PARTITIONTIME, DAY)

And then:

INSERT INTO mydataset.newtable (_PARTITIONTIME, transaction_id)
SELECT TIMESTAMP("2021-11-01"), 1 

Once we move to the merge part, it's indeed possible to insert data with

merge into mydataset.newtable as DBT_INTERNAL_DEST
    using (
    SELECT TIMESTAMP("2021-11-03") as _PARTITIONTIME, 3 as transaction_id
    ) as DBT_INTERNAL_SOURCE
    on FALSE

when not matched by source
        and DBT_INTERNAL_DEST._PARTITIONTIME in unnest([TIMESTAMP("2021-11-03")])
    then delete

when not matched then insert
    (`_PARTITIONTIME`, `transaction_id`)
values
    (`_PARTITIONTIME`, `transaction_id`)

Practically, it won't be 100% straightforward as the it requires to use
SELECT _PARTITIONTIME, * FROM mydataset.newtable__dbt_tmp as _PARTITIONTIME is not directly a column and therefore not in the SELECT * but else it appears to work.

Practically, I think everything can be done within incremental.sql

Describe alternatives you've considered

Alternatives are:

  • To work with column type ingestion tables with a tradeoff on performances
  • working with time ingestion partitioned tables and partition decorator is more efficient but also trickier to implement (I'll make another issue later on that topic)

Additional context

At Teads, we use a lot of time ingestion partitioned tables as they were created prior to column type partitioned table feature on BigQuery. Migrating would be an option but the overhead introduced for selecting data from column type partitioned table is "a drag" to make that move.

Who will this benefit?

It would benefit anyone using time ingestion partition table.

Are you interested in contributing this feature?

Yes

dbt deps on BigQuery requires either gcloud auth or "project" set

Describe the bug & Steps To Reproduce

Running dbt deps with a profiles.yml which a) doesn't have a project and b) in an environment without a valid gcloud auth (such as a docker build):

nimbus:
  target: user
  outputs:
    user:
      type: bigquery
      method: oauth
      dataset: my-dataset
      timeout_seconds: 3600
      threads: 24
      # project: my-project

Raises an error:

 => ERROR [dev 11/12] RUN dbt deps                                                                                                                                           5.2s
------
 > [dev 11/12] RUN dbt deps:
dbt-labs/dbt#36 1.615 Running with dbt=0.19.0
dbt-labs/dbt#36 4.925 Encountered an error:
dbt-labs/dbt#36 4.925 Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
------
executor failed running [/bin/sh -c dbt deps]: exit code: 2

These works:

  • Running with the project: uncommented
  • Running with dbt deps --target=null

Expected behavior

Does dbt deps need the database / project name at that stage? If so, this is the correct behavior. But if not, there's no need to break in a docker build.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.19.0
   latest version: 0.19.0

Up to date!

Plugins:
  - bigquery: 0.19.0
  - snowflake: 0.19.0
  - redshift: 0.19.0
  - postgres: 0.19.0

The operating system you're using:
MacOS

The output of python --version:
Python 3.8.8

`UPDATE` operation is missing a useful AdapterResponse

Describe the bug

When calling load_result() from a statement block connecting to a bigquery db, if the query is an UPDATE operation, we do not get an AdapterResponse on par with other DML operations.

Seems like it should be included here:

elif query_job.statement_type in ['INSERT', 'DELETE', 'MERGE']:

Moreover, what is the reason that SELECT statements do not also get a filled out AdapterResponse?

Steps To Reproduce

{% macro test_response_object() %}
{% call statement('test') %}
... UPDATE STATEMENT ...
{% endcall %}
{% set result = load_result('test') %}
{{ dbt_utils.log_info(result) }}
{% endmacro %}

From the command line:
dbt run-operation test_response_object

Result:

{'response': BigQueryAdapterResponse(_message='OK', code=None, rows_affected=None, bytes_processed=None), 'data': [], 'table': <agate.table.Table object at 0x114f1cd90>}

Expected behavior

BigQueryAdapterResponse field should be populated with values reported by the BigQuery API. Here is an example of what you get when you run a DELETE:

{'response': BigQueryAdapterResponse(_message='DELETE (1.0 rows, 352.5 KB processed)', code='DELETE', rows_affected=1, bytes_processed=360929), 'data': [], 'table': <agate.table.Table object at 0x1065e82b0>}

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

installed version: 0.21.0
   latest version: 0.21.0

Up to date!

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:
macOS v 11.6 (20G165)

The output of python --version:
Python 3.8.11

HTTPSConnectionPool(host='bigquery.googleapis.com', port=443): Max retries exceeded with url

Describe the bug

I currently have dbt in a docker image with a BigQuery warehouse, that is spun up as an airflow task deployed in kubernetes, when I perform a backfill and there are concurrent dbt pods (up to 8) running, I see the following error from dbt:

Encountered an error:
HTTPSConnectionPool(host='bigquery.googleapis.com', port=443): Max retries exceeded with url: /bigquery/v2/projects/<DEDACTED>/datasets/<DEDACTED>/tables?maxResults=100000&prettyPrint=false (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

I believe this log is from the requests library in dbt, however it is not clear why the retries are occuring in the first place ( ie what bq limits or quotas are causing the retries)

Steps To Reproduce

  • Run concurrent (10+ pods) of dbt against a bq warehouse.

Expected behavior

  • Clear message on what is causing the retries.

System information

Which database are you using dbt with?
bigquery

The output of dbt --version:

installed version: 0.19.1

Plugins:
  - bigquery: 0.19.1

The operating system you're using:
Docker image

The output of python --version:
python:3.8.5

Bigquery uniqueness test fails on column name matching table name

Describe the bug

schema test unique cannot be used on columns in bigquery where the column name has the same name as the table. Bigquery yields the following error:

Grouping by expressions of type STRUCT is not allowed at [4:10]

Steps To Reproduce

create table <your dataset>.foo (
    foo int,
    bar string,
);

-- this mimics this existing structure of a uniqueness test on the foo column in this table but fails with a:
-- Grouping by expressions of type STRUCT is not allowed at [4:10]
with validation_errors as (
    select
        foo, count(*)
    from <your dataset>.foo
    group by foo
    having count(*) > 1
)
select *
from validation_errors;

Expected behavior

using an alias for the table name disambiguates the reference for bigquery and allows the test to run succesfully

with validation_errors as (
    select
        model_table.foo, count(*)
    from <your dataset>.foo as model_table
    group by model_table.foo
    having count(*) > 1
)
select *
from validation_errors;

Screenshots and log output

Database Error in test dbt_utils_source_unique_combination_of_columns_... (models/....yml)
  Grouping by expressions of type STRUCT is not allowed at [18:14]
  compiled SQL at target/run/trumid_poc/models/....yml/schema_test/dbt_utils_source_unique_combin_1cc53959c9ef4a8b014db7bc87dadc5d.sql

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.20.0-rc1
   latest version: 0.19.1

Your version of dbt is ahead of the latest release!

Plugins:
  - bigquery: 0.20.0rc1
  - postgres: 0.20.0rc1
  - redshift: 0.20.0rc1
  - snowflake: 0.20.0rc1

The operating system you're using:

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

on wsl

The output of python --version:
3.9.2

Additional context

Found this issue in dbt_utils technically and will have to file an issue there as well but it holds for a classic uniqueness test as well

dbt seed fails on capital letters in dataset names

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When I run dbt seed with a dataset name that includes capital letters it fails eg if using "Bananas" instead of "bananas"

Expected Behavior

That it would work fine.

Steps To Reproduce

No response

Relevant log output

10:45:53 | Concurrency: 1 threads (target='caoilte')
10:45:53 | 
10:45:53 | 1 of 1 START seed file caoilte_Bananas_raw.seed_top1300 [RUN]
Unhandled error while executing seed.fruit.seed_top1300
404 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/banana-project/jobs?uploadType=resumable: Not found: Dataset banana-project:caoilte_Bananas_raw
10:45:53 | 1 of 1 ERROR loading seed file caoilte_Bananas_raw.seed_top1300 [ERROR in 0.27s]
10:45:53 | 
10:45:53 | Finished running 1 seed in 5.72s.

Completed with 1 error and 0 warnings:

404 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/banana-project/jobs?uploadType=resumable: Not found: Dataset banana-project:caoilte_Bananas_raw


### Environment

```markdown
- OS: Linux
- Python: 3.8.6
- dbt: 0.21.0

What database are you using dbt with?

bigquery

Additional Context

dbt/bigquery works fine for capitalised model names (It was trying to make the seed name consistent with the capitalised model name that highlighted the bug)

partition_by did not create table with corresponding fields

Describe the bug

I was trying to create a bigquery table with a specified partition field and 'day' granularity. However, running the with the below config only generate a table with the pseudo-column _PARTITIONTIME.

In similar fashion, declared cluster_key does not show up in the final table

{{ config(
    materialized = 'table',
    partition_by={
      "field": "partition_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=["partition_date"],
    partitions=dbt.partition_range(var('dates', default=yesterday())),
    verbose=True
)}}

Steps To Reproduce

  • run the following command to generate the data set
    dbt --full-refresh -m page_view_2 --vars 'dates: "20210701, 20210703"'

Expected behavior

Running with dbt=0.20.2

Found 6 models, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 2 sources, 0 exposures

16:39:11 | Concurrency: 1 threads (target='dev')
16:39:11 | 
16:39:11 | 1 of 1 START table model wc_data_core_MB_DEV.page_view_2............. [RUN]
16:39:12 | -> Running for day 20210701
16:39:21 | -> Running for day 20210702
16:39:30 | -> Running for day 20210703
16:39:36 | 1 of 1 OK created table model wc_data_core_MB_DEV.page_view_2........ [CREATED 3 PARTITIONS in 24.92s]
16:39:36 | 
16:39:36 | Finished running 1 table model in 26.05s.

Screenshots and log output

System information

The output of dbt --version:

installed version: 0.20.2

Plugins:
  - snowflake: 0.20.2
  - postgres: 0.20.2
  - bigquery: 0.20.2
  - redshift: 0.20.2
  - snowflake: 0.20.2
  - postgres: 0.20.2
  - bigquery: 0.20.2
  - redshift: 0.20.2

The operating system you're using: Mac OS Monterey

The output of python --version:
Python 3.6.15

Additional context

Add any other context about the problem here.

Integration test for query-comment as BigQuery job label

Describe the feature

dbt-labs/dbt-core#3145 did something really cool: By setting in dbt_project.yml:

query-comment:
  job-label: true

dbt will supply the key-value pairs from a query comment dictionary (or the full value as one key-value pair, if the query comment is a non-dict string) to BigQuery as a job label for the queries representing node execution:

Screen Shot 2021-03-04 at 12 33 39 PM

In order to really test that this is working, and avoid regression, we should add an integration test that:

  • Runs a model with the default query-comment and checks that the labels appropriately registered, by querying INFORMATION_SCHEMA.JOBS_BY_USER. (We'd want that query to filter on user_email and creation_time.)
  • Runs a model with a stringified query-comment, checks the same
  • Runs a model with a stringified and too-long (>128 char) comment, ensures an error is returned
  • Runs a model with job-label: false and ensures that only the default job label (invocation_id) is applied

Describe alternatives you've considered

Not integration testing this.

dbt-labs/dbt-core#3145 added unit testing for the python dict/string handling functions, so the nuts and bolts are in place. In order for this to be a feature that dbt-bigquery users can feel comfortable relying on, we need to ensure that it will continue working in future releases, which may touch query-comment functionality in indirect ways.

Remove / reduce downtime when running full refresh

Describe the feature

Change the full refresh behaviour for BigQuery (or at least make it configurable).

Current behaviour:

  1. Delete model destination table.
  2. Run create or replace statement.

Proposed behaviour:

  1. Create the full refreshed model in a temporary table (e.g. model_table_name__tmp).
  2. Delete the model_table_name if full refresh is successful.
  3. Copy model_table_name__tmp to model_table_name
  4. Delete model_table_name__tmp

This solution has two advantages:

  1. In case the query create or replace query fails the original table is still present.
  2. Minimize downtime to just the time between the deletion and the copy instead of the time to run the query, which could be several minutes or even hours.

Describe alternatives you've considered

No alternative solution found.

Additional context

I am using BigQuery, not sure if could be extended to other.

Who will this benefit?

Anyone that needs to run very big full refresh. This issue effects them linearly with the query execution time.
The solution reduces to the copy time, usually much faster than query execution for complex queries.

Are you interested in contributing this feature?

With some guidance.

dbt is not able to fetch error message from BigQuery via oauth

Describe the bug

I am having issues with dbt and BQ setup. In general everything works fine, but when BQ returns an error due to failing SQL, dbt CLI is not able to fetch error from BQ:

▶ dbt run --models asset.dda.global_dda_dv
Running with dbt=0.19.1
...
18:37:34 | Concurrency: 1 threads (target='dev')
18:37:34 |
18:37:34 | 1 of 1 START view model dbt_bart.global_dda_dv....................... [RUN]
18:37:35 | 1 of 1 ERROR creating view model dbt_bart.global_dda_dv.............. [ERROR in 1.27s]
18:37:35 |
18:37:35 | Finished running 1 view model in 3.85s.
Completed with 1 error and 0 warnings:
Runtime Error in model global_dda_dv (models/asset/dda/global_dda_dv.sql)
  404 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-xyz/queries/8b5ace67-828a-4328-88d1-4c0ebb8b67e8?maxResults=0&location=EU&prettyPrint=false: Not found: Job project-xyz:EU.8b5ace67-828a-4328-88d1-4c0ebb8b67e8
  (job ID: 8b5ace67-828a-4328-88d1-4c0ebb8b67e8)
Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

In the same time, this particular error I am able to get with bq show -j <job_id>.
I am using OAuth via gcloud for authorisation, here’s my config:

dde-dbt: # this needs to match the profile: in your dbt_project.yml file
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: project-xyz
      dataset: dbt_bart
      threads: 1
      timeout_seconds: 300
      location: EU
      priority: interactive
      retries: 1

Steps To Reproduce

  1. Configure local environment to connect with BigQuery through oauth
  2. Execute dbt run with failing SQL model
  3. dbt is not able to retrieve error message from BigQuery

Expected behavior

dbt CLI should be able to find BQ job and fetch error message

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other

The output of dbt --version:

installed version: 0.19.1
   latest version: 0.19.1

Up to date!

Plugins:
  - bigquery: 0.19.1
  - snowflake: 0.19.1
  - redshift: 0.19.1
  - postgres: 0.19.1

The operating system you're using: macOS Catalina 10.15.7

The output of python --version: Python 3.7.3

Remove ingestion-time partitioned tables in BigQuery and all associated logic

Describe the feature

dbt's support for "ingestion-time" partitioned tables is a relic of BigQuery's historical lack of support for table partitioning DDL. BigQuery has supported column-level partitioning for some time now, so we should remove the largely-unused ingestion-time partitioning table logic from dbt.

Methods to remove:

As a part of this change, we should also remove the BigQuery-specific timeout_seconds config, which is only used as a part of creating ingestion-time partitioned tables via the BigQuery HTTP API. We should consider supporting a dbt-wide timeout_seconds config in the future (this would apply to any query invoked by dbt), so let's continue to support the config in profiles.yml, but render a warning at run-start indicating that the config is not used.

Warning: The configured target specifies a timeout_seconds config, which is currently unused. Support for this config may be removed in a future release of dbt.

Additional context

BQ-only... we should be able to delete a whole lot of code!

Who will this benefit?

dbt maintainers/contributors, users who (rightfully) misunderstand the nature of the timeout_seconds config

[ADAP-629] [BQ] Support new partition configs with all incremental strategies

Describe the feature

Picking up from dbt-labs/dbt-core#2928, which added support for two new configs in dbt-bigquery: require_partition_filter and partition_expiration_days.

Let's ensure that require_partition_filter works with all the permutations of incremental models on BigQuery. Anyone is welcome to pick this up as a contribution for v0.20.0!

The merge strategy

We need the merge condition to be

on 
            DBT_INTERNAL_SOURCE.[unique_key] = DBT_INTERNAL_DEST.[unique_key]
            and DBT_INTERNAL_DEST.[partition_col] is not null

This could be accomplished by passing an additional predicate to get_merge_sql here, something like:

{% is_partition_filter_required = config.get('require_partition_filter', false) %}
{% set predicates = [] %}
{% if is_partition_filter_required %}
  {% set partition_filter %} ({{ partition_by.field }} is not null or {{ partition_by.field }} is null) {% endset %}
  {% do predicates.append(partition_filter) %}
{% endif %}

{% set build_sql = get_merge_sql(target_relation, source_sql, unique_key, dest_columns, predicates) %}

This is a bit of a hack—filtering only in this sense—but to be honest there isn't any straightforward way dbt can know in advance the specific partitions it's merging into. For that, you should use...

The insert_overwrite strategy

The require_partition_filter config works just fine with "static" insert_overwrite strategy—when the user supplies the values in advance via the partitions config—which is also the most performant for updating very large datasets. (It would still be a good idea to add a test for this.)

For the "dynamic" insert_overwrite strategy, the current error comes in step 2:

https://github.com/fishtown-analytics/dbt/blob/1f927a374c8bd52a12a20d892fed9d59cffd04f4/plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql#L63-L68

We need to either:

  • Change this query to include a where {{ partition_by.field }} is not null, thereby satisfying the filter requirement
  • Change the temp table to not set require_partition_by. Should this be a more general rule, that if temporary = True, the create table statement shouldn't set require_partition_filter = True? I think it would make good sense.

Describe alternatives you've considered

  • Not supporting require_partition_filter for all types of incremental models. I think we definitely should!

Related unsolved questions

  • Default schema tests will not work on tables with require_partition_filter set
  • Related addition in dbt-labs/dbt-core#2928: How should we advise on whether to set partition_expiration_days for incremental models? Personally, I'm not so sure

Who will this benefit?

  • BigQuery users with large-volume datasets

BigQuery API Deprecation Warning

Describe the feature

The BigQueryConnectionManager.dataset method is firing a deprecation warning due to its call to bigquery.Client.dataset:

[...]
2021-12-30 18:02:49.877082 (ThreadPoolExecutor-1_9): Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
[...]

From the current API docs:

Client.dataset
Deprecated since version 1.24.0: Construct a DatasetReference using its constructor or use a string where previously a reference object was used.

Suggested Work

Given this and other updates to the BigQuery API, I think it would be healthy to do two things:

  • Simplify BigQueryConnectionManager.dataset to directly create a bigquery.DatasetReference.
    • optionally, rename this method to dataset_ref to be clear about what's being returned
  • Simplify BigQueryConnectionManager.table_ref to directly create a bigquery.TableReference; make it a static method.

With these changes, we can quit passing a conn around just to deliver a reference, and we'll meet (my interpretation of) BQ API best practice by working with unconnected references where possible and deferring 'connected' operations to a client.

Describe alternatives you've considered

  1. do nothing / kick the can down the road on this deprecation
  2. fix only the call triggering the deprecation warning, leave the rest as-is
  3. cry

Additional context

n/a

Who will this benefit?

This will benefit maintainers by simplifying the signatures of two core methods.

Are you interested in contributing this feature?

Put me in, coach!

[CT-1936] Support Google Cloud Auth from AWS Credentials File

Describe the feature

Support connecting to BigQuery via from an AWS Credentials File.

Describe alternatives you've considered

None

Additional context

Google Cloud Auth AWS Docs

This is how my org does gcloud auth, so we don’t have many alternatives here

Who will this benefit?

Any user/company who runs Google cloud workloads from AWS and keeps credentials in AWS.

Are you interested in contributing this feature?

Yes

BigQuery labels on schemas/dataset

Describe the feature

We use labels on Google Cloud assets for inventory tracking purposes. For our Bigquery datasets, we need labels at the schema/dataset level.

It looks like it's possible to add other Bigquery schema-level configuration to the profile - e.g. dataset location.

Are there any plans to support labels at the dataset level?

Describe alternatives you've considered

We tested table-level labels implemented as per dbt-labs/dbt-core#1942 but our inventory tracking service (Vanta for SOC2 purposes) expects the labels at the dataset level.

Who will this benefit?

Other Bigquery users who are working with labels at the dataset level.

Report bytes processed after running BigQuery SQL

Describe the feature

As BigQuery bills users based on the volume of data processed, it's fairly important for users to be aware of this while developing models and pipelines. In the BQ UI this is fairly clearly presented after each query run, and the same information is available in the jobs.query API response.

You can find the totalBytesProcessed documented in the linked API documentation.

It would be helpful to see this output in the JSON logging - example:

{"timestamp": "2020-01-30", "message": "13:25:37 | 24 of 27 OK created incremental model dbt_andy.customers............. [CREATE TABLE (450) in 2.82s]", "channel": "dbt", "level": 11, "levelname": "INFO", "extra": {"unique_id": "model.customers", "bytes_processed": 4882}}

And also at the end of a run with the default log formatter:

13:16:52 | Finished running 23 incremental models, 4 table models, 1 hook in 35.87s
13:16:52 | 750mb of data processed

Describe alternatives you've considered

If you have enabled Export To BigQuery within your Google Cloud billing project, it would be possible to query how much data the service account your DBT user has processed. This could then be queried and monitored outside of DBT, but it would be so much helpful to have an indication of query size in realtime - this is an important optimisation when building BigQuery pipelines.

Who will this benefit?

This is very important for all users of BigQuery who have enough data stored to have a reasonable spend, but who are not so large that they have purchased commitments (flat fee / all you can eat) from Google. Even then, it's still best practice to pay attention to how much data you are processing...

Add macro/run-operation for uploading files from local filesystem to BigQuery.

Describe the feature

Prompted by brooklyn-data/dbt_artifacts#6.

Access to a macro such as:

{% do adapter.upload_file(file_path, destination) %}
would enable dbt_artifacts to become compatible with BigQuery. It looks as though the function to wrap is load_table_from_file.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Please include any other relevant context here.

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

Are you interested in contributing this feature?

Let us know if you want to write some code, and how we can help.

Cannot Access CSV on Google Drive when using OAuth

Describe the bug

When using the OAuth method to do a federated query for a table that points to a CSV file on Drive, I get an error.

When using a service account to access the same file, it works just fine.

• I have full permissions to the file
• All GDrive APIs in Google Cloud and Google Workspace are turned on
• I've also run the following command

gcloud auth application-default login \
  --scopes=https://www.googleapis.com/auth/bigquery,\
                    https://www.googleapis.com/auth/drive.readonly,\ 
                    https://www.googleapis.com/auth/iam.test

• I've also tried revoking auth and re-authing

Steps To Reproduce

• Place a (CSV) file in a drive and share it with yourself
• Make a table in Bigquery that uses the Drive file as the source (federated)
• SELECT * the table via dbt

Expected behavior

I expect it to be able to query the table.

Screenshots and log output

Access Denied: BigQuery BigQuery: Access Not Configured. Drive API has not been used in project 764086051850 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/drive.googleapis.com/overview?project=764086051850 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
  compiled SQL at ...

What's weird is that the project ID is not one of ours. When using a service account, the correct project is used. My profile.yml is set up correctly, so I don't know where this project ID is coming from.

I've noticed this issue happening with a similar project ID on StackOverflow

System information

The output of dbt --version:

installed version: 0.21.0
   latest version: 1.0.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:
Mac OS Big Sur version 11.6
Python 3.8.5

Update BQ labels on tables as they change

Describe the feature

Currently, labels apply when a table is created. However, for incremental models, the labels are not updated if they change in the future. This is also a problem as the labels become out of sync with the DBT project.

Describe alternatives you've considered

An alternative is to write a post-hook which applies the labels to the relation.
I'd prefer to contribute this feature back into the DBT project so other people can benefit from it.

Additional context

This feature is specific to BigQuery

Who will this benefit?

It will benefit users that access tables in BigQuery, the labels offer great metadata.
We also plan to have this visible in our metadata/catalog service.

`require_partition_filter=true` with the `insert_overwrite` strategy doesn't work

Describe the bug

When running an incremental model with the insert_overwrite strategy and the require_partition_filter=true option, it seems that the require_partition_filter option is applied to the temporary table.

  Query error: Cannot query over table 'sandbox-project.logs.event__dbt_tmp' without a filter over column(s) 'event_time' that can be used for partition elimination at [75:46]

Steps To Reproduce

  1. Create an incremental model with insert_overwrite
# test_insert_overwrite.sql
{{
  config(
    materialized="incremental",
    incremental_strategy='insert_overwrite',
    require_partition_filter=true,
    alias="test_insert_overwrite",
    partition_by={
      "field": "event_time",
      "data_type": "timestamp",
      "granularity": "day",
    },
  )
}}

SELECT
  CURRENT_TIMESTAMP() AS event_time
  1. Execute dbt run -s test_insert_overwrite twice
Completed with 1 error and 0 warnings:

Database Error in model test_insert_overwrite (models/test_insert_overwrite.sql)
  Query error: Cannot query over table 'sandbox-project.jaffle_shop.test_insert_overwrite__dbt_tmp' without a filter over column(s) 'event_time' that can be used for partition elimination at [35:46]
  compiled SQL at target/run/jaffle_shop/models/test_insert_overwrite.sql

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots and log output

The query below was generated when the second time execution of the incremental model. As we can see, the query to create the temporary table test_insert_overwrite__dbt_tmp includes require_partition_filter=true, because the macro to generate is create_table_as. The create_table_as macro automatically applies all the same options to even the temporary table.

      -- generated script to merge partitions into `sandbox-project`.`jaffle_shop`.`test_insert_overwrite`
      declare dbt_partitions_for_replacement array<timestamp>;
      declare _dbt_max_partition timestamp default (
          select max(event_time) from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite`
          where event_time is not null
      );

      
      
        -- 1. create a temp table
        

  create or replace table `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
  partition by timestamp_trunc(event_time, day)
  
  OPTIONS(
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour),
    
      require_partition_filter=True
    )
  as (
    

SELECT
  CURRENT_TIMESTAMP() AS event_time
  );
      

      -- 2. define partitions to update
      set (dbt_partitions_for_replacement) = (
          select as struct
              array_agg(distinct timestamp_trunc(event_time, day))
          from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
      );

      
      -- 3. run the merge statement
      

    merge into `sandbox-project`.`jaffle_shop`.`test_insert_overwrite` as DBT_INTERNAL_DEST
        using (
        select * from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
      ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and timestamp_trunc(DBT_INTERNAL_DEST.event_time, day) in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
        (`event_time`)
    values
        (`event_time`)

;

      -- 4. clean up the temp table
      drop table if exists `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`

System information

The output of dbt --version:

installed version: 0.21.0
   latest version: 0.21.0

Up to date!

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:

  • Mac OS 11.6 (20G165)

The output of python --version:

  • Python 3.8.8

Additional context

Add testing/validation checks for adapter_response

Describe the feature

Add some testing/validation checks for adapter_response object.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

issue #68 and pr #79 brought up a lacking in our validation checks for the adapter_response would be nice to add in some.

Who will this benefit?

Are you interested in contributing this feature?

allow `_dbt_max_partition` to be disabled

Describe the feature

In dbt v0.16, dbt run will make query calls for _dbt_max_partition on all BigQuery partitioned incremental models.

In our situation, we have many very large partitioned incremental models that have no use of _dbt_max_partition. However, they still have to query _dbt_max_partition. It is very costly when the runs in production add up.

Could we disable _dbt_max_partition for models that have no use of it?

Describe alternatives you've considered

To allow the model config to disable _dbt_max_partition for a particular model.

for example,

{{ config(
    materialized='table',
    partition_by={
      "field": "created_at",
      "data_type": "timestamp",
      "enable_dbt_max_partition": false
    }
)}}

Additional context

This is specific to BigQuery database, and it's part of the new merge feature from dbt v0.16

Who will this benefit?

Everyone who own very large BigQuery tables for incremental runs but not using _dbt_max_partition

Setting `on_schema_change` on incremental models using the `insert_overwrite` strategy with dynamic partitioning causes those models to fail

Describe the bug

If a model is materialized as incremental, its strategy is set to insert_overwrite, its on_schema_change is set to any value except for ignore, and the compiled model refers to _dbt_max_partition, the model will fail to run with the following error:

Unrecognized name: _dbt_max_partition at [X:Y]

Steps To Reproduce

Here's a minimal example model:

{{
  config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={
      'field': 'date',
      'data_type': 'date',
      'granularity': 'day'
    },
    on_schema_change='append_new_columns',
  )
}}

SELECT * FROM source_table
{% if is_incremental() %}
WHERE date > _dbt_max_partition
{% endif %}

If is_incremental() is True, the model will compile to the following:

SELECT * FROM source_table
WHERE date > _dbt_max_partition

When the BigQuery adapter builds the larger SQL script it needs to run, it will hit this run_query() command, which (if I'm understanding correctly) is mainly used to compute the updates' schema so that it can be compared against the destination table's existing schema. When it tries to run that query, it hasn't yet declared _dbt_max_partition (that declaration is done here, in bq_insert_overwrite()), which triggers the error.

Expected behavior

I'm expecting the temporary relation to be created after _dbt_max_partition has been defined, which ought to allow the rest of the incremental update to proceed as normal.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

installed version: 0.21.0
   latest version: 0.21.0

Up to date!

Plugins:
  - bigquery: 0.21.0
  - snowflake: 0.21.0
  - redshift: 0.21.0
  - postgres: 0.21.0

The operating system you're using:

macOS Mojave 10.14.6

The output of python --version:

Python 3.7.10

Additional context

Add any other context about the problem here.

Project is not refreshed in partial parsing

Thanks as ever for such a brilliant app!

Describe the bug

Changing the GCP project (e.g. with gcloud config set project {project}) and running a dbt command (e.g. dbt run) will not update the project with partial parsing. Instead, it's necessary to run dbt clean as specified in parsing known limitations.

Steps To Reproduce

dbt clean
gcloud config set project project_foo
dbt compile
rg `project_foo` | wc -l  # returns a big number

gcloud config set project project_bar
dbt compile
rg `project_foo` | wc -l  # returns a big number, should return zero
rg `project_bar` | wc -l  # returns zero, should return a big number

Expected behavior

I can understand why dbt doesn't track the implicit GCP project in its cache invalidation logic, and so this is a hard problem to solve.

But I'm not sure it's a necessary problem to have. Why does dbt materialize the project name in the queries? If it didn't materialize the projects (arguably by respecting that abstraction), BQ would resolve the implicit project without it being specified, and this would work.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

1.0.1

The operating system you're using:
MacOS

The output of python --version:
3.9.9

[BQ] Google service account authentication failure when running dbt tests

Describe the bug

When running dbt tests in a container, I intermittently receive authentication errors, which cause the running tasks to fail.

Steps To Reproduce

  1. Use oauth authentication in the bigquery profile, and set threads to 2
  2. In a containerized environment using a gcloud service account for authentication (as opposed to env var GOOGLE_APPLICATION_CREDENTIALS) run dbt test [my_model] (Expect ~288 tests to run)

Expected behavior

I expect all tests to succeed (verified by running locally).

Screenshots and log output

  1. log (without --debug)
    image

  2. example stack trace

"exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 206, in open\n    handle = cls.get_bigquery_client(connection.credentials)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 187, in get_bigquery_client\n    creds = cls.get_bigquery_credentials(profile_credentials)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 157, in get_bigquery_credentials\n    credentials, project_id = google.auth.default(scopes=cls.SCOPE)\n  File \"/usr/local/lib/python3.7/site-packages/google/auth/_default.py\", line 354, in default\n    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)\ngoogle.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 333, in safe_run\n    result = self.compile_and_execute(manifest, ctx)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 276, in compile_and_execute\n    result = self.run(ctx.node, manifest)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 378, in run\n    return self.execute(compiled_node, manifest)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/task/test.py\", line 84, in execute\n    failed_rows = self.execute_schema_test(test)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/task/test.py\", line 64, in execute_schema_test\n    fetch=True,\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/base/impl.py\", line 227, in execute\n    fetch=fetch\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 275, in execute\n    query_job, iterator = self.raw_execute(sql, fetch=fetch)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 248, in raw_execute\n    client = conn.handle\n  File \"/usr/local/lib/python3.7/site-packages/dbt/contracts/connection.py\", line 71, in handle\n    self._handle.resolve(self)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/contracts/connection.py\", line 96, in resolve\n    return self.opener(connection)\n  File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 210, in open\n    gcloud.setup_default_credentials()\n  File \"/usr/local/lib/python3.7/site-packages/dbt/clients/gcloud.py\", line 24, in setup_default_credentials\n    run_cmd('.', [\"gcloud\", \"auth\", \"application-default\", \"login\"])\n  File \"/usr/local/lib/python3.7/site-packages/dbt/clients/system.py\", line 436, in run_cmd\n    out, err)\ndbt.exceptions.CommandResultError: Got a non-zero returncode running: ['gcloud', 'auth', 'application-default', 'login']"

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.1

The operating system you're using:
image

The output of python --version:

Python 3.7.7

Additional context

  1. Using gcloud service account, not GOOGLE_APPLICATION_CREDENTIALS
  2. Note that prior to seen error, more than 150 tests PASS. Eventually, they start passing again.

Default BigQuery policy tag

Describe the feature

Can an optional, default policy tag for BigQuery be supported? Ideally it could be set at the project, directory, and individual model levels.

The purpose is to prevent accidental leaking of sensitive data. I currently set policy tags in sources (outside DBT) and views automatically inherit them. However table materializations remove the tags. Being able to set policy tags with DBT is amazing, however a developer may forget to set one for a column in its schema.yml. Or the model does a select * from source ... and the developer doesn't realize the source contains sensitive data. By setting a default policy tag, accidental sensitive data leaks can be prevented.

It would be amazing if a list of column types could given too. For example, booleans are rarely sensitive but strings can be. So being able to say "please set a default policy tag for string columns" would be amazing. This would be a nice to have, but not strictly necessary.

Describe alternatives you've considered

Manage policy tags outside DBT which is error prone in my experience.

Who will this benefit?

BigQuery users who utilize policy tags.

Are you interested in contributing this feature?

Possibly

Tighten OAuth Scopes for BigQuery

Describe the feature

Reduce the BigQuery OAuth scopes down to the minimal set needed, in both dbt and dbt Cloud.

Additional context

Currently, the dbt BigQuery connector requests these three OAuth scopes:

The BigQuery scope is needed to access the database, but the cloud-platform and drive scopes are probably too broad. These scopes were originally added to address issue dbt-labs/dbt-core#502, primarily to allow for access for reading from Google Sheets. However, I don't immediately see a need for the cloud-platform scope, which gives access to a wide range of GCP resources, such as the following:

  • View and manage your data across all Google Cloud Platform services, such as:
  • View and manage your tables, datasets, and jobs in Google BigQuery
  • View and manage your data in Google Cloud Storage
  • View and manage your instances in Google Cloud SQL

Similarly, the drive scope has this access:

This app wants permission to access everything in your Google Drive. It will be able to do the same things you can do, including:

See your files
Upload and download your files
Delete your files
See the names and emails of people you share files with
Share and stop sharing your files with others
Remove people from your files
Organize your Drive
There may be private information in your Google Drive, like financial records, medical reports, photos or tax info.

I would think that minimally, these scopes could be reduced to the 'read-only' variants, and could probably be reduced further depending on the access needed for external tables. Maybe something like:

But I don't know yet whether these scopes are too restrictive.

Also note that dbt Cloud has the same list of OAuth scopes, so whatever is changed in dbt should also be changed in dbt Cloud.

dbt-0.21.0 test issue: Alias getting added in test column which is causing test failure

Describe the bug

Command executed: dbt test --model <model name>
Error Message:

   SELECT list expression references column <column_name> which is neither grouped nor aggregated at [12:5]
      compiled SQL <location>\unique_test_model_ID_Name.sql

Steps To Reproduce

To reproduce execute following commands:

  1. Run model: dbt run --model test_model
  2. Test model: dbt test --model test_model

test_model.sql

  Select 
    '1' as ID
    , 'abc' as Name

mart.yml

version: 2

models:
  - name: test_model
    columns:
      - name: ID || Name
        tests:
          - unique

Expected behavior

Test case should have executed successfully.

System information

The output of dbt --version:

0.21.0

Additional context

The test case was running fine with dbt version: 0.19.0

Compiled SQL with dbt 0.21.0

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (
            select
                ID || Name as unique_field,
                count(*) as n_records
            from `<project_id>`.`<schema>`.`test_model`
            where ID || Name is not null
            group by ID || Name
            having count(*) > 1
    ) dbt_internal_test

Compiled SQL with dbt 0.19.0

select count(*) as validation_errors
from (

    select
        ID || Name

    from `<project_id>`.`<schema>`.`test_model`
    where ID || Name is not null
    group by ID || Name
    having count(*) > 1

) validation_errors

ISSUE:

Seems the aliasing of the OR field creates an issue for BigQuery.

Bigquery model-level unique tests on concat() columns

Hi everyone.

I did not find this issue anywhere yet which seems odd to me and I dont know if it is on my side.

Using Bigquery and dbt 0.20.2.

Having defined a test for a model like this:

  - name: base_test_model
    tests:
      - unique:
          column_name: "concat(id, field, changed_at)"

Which was working before dbt 0.20.0. Now I get an error:

Database Error in test unique_base_test_id_field_changed_at (models\base\schema.yml)
SELECT list expression references column id which is neither grouped nor aggregated at [12:5] compiled SQL at

I checked in projects where I use dbt 0.19.x and the test above compiled to this:

select count(*) as validation_errors
from (

    select
        concat(id, field, changed_at)

    from base_test
    where concat(id, field, changed_at) is not null
    group by concat(id, field, changed_at)
    having count(*) > 1

) validation_errors

Now it compiles to this:

select
      count(*) as failures
    from (
      
    
select
    concat(id, field, changed_at) as unique_field,
    count(*) as n_records

from base_test
where concat(id, field, changed_at) is not null
group by concat(id, field, changed_at)
having count(*) > 1

    ) dbt_internal_test

which Bigquery seems not to like. Using the query above and removing the alias unique_field works for Bigquery.

With Postgres and MySQL both of the above queries run. Did not test Snowflake and Redshift.

[CT-2051] [Bug] New insert_overwrite Bigquery partitioning with integer keys can create huge temporary array variables, exceeding BQ limits

Describe the bug

When doing an incremental update of an integer-partitioned bigquery table with the new insert_overwrite merge strategy then DBT calculates which partitions it should replace. In the process it can generate a huge BigQuery ARRAY value.

The cause is that DBT does not take the "interval" parameter of the partitioning specification into account. The generated SQL code selects "array_agg(distinct PARTITION_FIELD)" when calculating the partitions to replace. This selects ALL distinct partition field values of the incremental update, even if these values are actually in the same partition. This causes a potentially huge array to be created. If there is enough data in the table then this will even cause an error because BQ's maximum array size (100 MB) is exceeded.

Note that this bug is not triggered by time partitioned tables because for those all partition fields are dates and the partition size is always one day (i.e. there is only one valid value per partition).

Steps To Reproduce

  1. Generate a model with:
    config(
    materialized = 'incremental',
    unique_key = 'partkey',

    partition_by = {
    "field": "partkey",
    "data_type": "int64",

     "range": {
         "start": 0,
         "end": 100000,
         "interval": 1000
     }
    

    },
    incremental_strategy = 'insert_overwrite'
    )

  2. Run the model once with empty source data

  3. Add 1000 records in the source data with values for partkey from 0 to 999

  4. Generate the incremental update code for the model and look at the values in the dbt_partitions_for_replacement internal variable.

  5. You will see that it contains all 1000 values for partkey from 0 to 999, even though those are all inside the same single partition.

Expected behavior

DBT should ensure that no huge temporary array variable is generated.
The dbt_partitions_for_replacement array should have at most as many elements as the number of partitions being updated. In my opinion the way to go would be to store only the starting values of each partition in the array and then modify the merge clause to use a range for each partition.

Screenshots and log output

N/A

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.0

Up to date!

The operating system you're using:

macOS

The output of python --version:
Python 3.7.4

Additional context

[BQ] Use new DDL, delete code, profit

Describe the feature

The fine folks at BigQuery have been releasing a lot of solid SQL-standard syntax of late, which means a lucky contributor gets to delete a bunch of code.

The most obvious candidates to me are create schema and drop schema, which no longer need to be python methods—in fact, they no longer need to be bigquery__ macros at all, since the default implementations will do just fine.

https://github.com/fishtown-analytics/dbt/blob/7ec5c122e17ab7e1a5dd1e1cab36527159564a0f/plugins/bigquery/dbt/include/bigquery/macros/adapters.sql#L72-L78

Less pressing, still promising: alter table add columns.

https://github.com/fishtown-analytics/dbt/blob/7ec5c122e17ab7e1a5dd1e1cab36527159564a0f/plugins/bigquery/dbt/adapters/bigquery/impl.py#L704

I'm sure there are more such opportunities!

When has there ever been a better first issue?

[Feature] make BigQuery execution_project accessible from target object

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

In BigQuery, the execution_project might be different from the project. Sometimes we would need this information for context logging, monitoring and alerting. (such as alert the dbt user if they are using the undesired execution_project in their development environment)

It would be great if the execution_project is accessible under target object.
such as

target.execution_project

Describe alternatives you've considered

No response

Who will this benefit?

team using dbt-BigQuery and want to closely monitor how dbt users are using execution_projects correctly and provide warning at the run time.

Are you interested in contributing this feature?

yes

Anything else?

No response

[BQ] Policy tag missing after model run

Steps to reproduce

Add the following roles in IAM to the dbt service account:

  • Data Catalog Admin
  • Policy Tag Admin
  • Fine-Grained Reader
  • BigQuery Data Owner

In dbt_project.yml, add:

models:
  +persist_docs:
    relation: true
    columns: true

vars:
  policy_tag_ids:
    pt_name: 'projects/my-project/locations/<location>/taxonomies/<organization>/policyTags/<tag>'

In schema.yml, add:

models:
  - name: my_model
    columns:
      - name: name
        policy_tags:
          - '{{ var("policy_tag_ids")["pt_name"] }}'
$ dbt run -m my_model

Expected Result

Column-level policy tag on the my_model table in BigQuery’s schema tab.

Actual result

Nothing shows.

Log output

In the BigQuery cloud audit logs (both data access and activity), there isn’t any mention of policy tags.

System information

  • BigQuery
  • dbt version: 0.18.1

Additional context

From the original pull request that implemented this functionality, the test class seems to follow the same steps as the ones taken.

Provide jobIds behind model creation in run_results.json

In BigQuery, each time SQL is executed, a job is created and assigned a unique ID. (Example: job_9JVepH9O1bzemNfB6xZzxz_mfFkY) When DBT runs it checks for the results/status of these jobs.

We've found the run_results.json file extremely useful, and being able to tie these results (errors, compiled SQL, variables, tags, etc.) back to the specific job that executed in BigQuery for additional information and troubleshooting would make run_results.json even more useful.

Describe alternatives you've considered

You could try to look at a combination of DBT logs and BigQuery logs in Stackdriver Logging and try to approximate which exact job corresponded to the model creation by timestamp proximity or query (compiled SQL), but those methods are not deterministic.

Additional context

This is specific to BigQuery for now, but I am sure there are parallels in other databases (transaction ID? query ID?).

Another aspect to consider would be whether to provide multiple jobIds if there is a macro called in the model that executes additional queries against the database.

Who will this benefit?

Anyone that wants to tie run results back to specific jobs for additional debugging would benefit from this feature.

Missing import when running integration tests

    import json
    import re
    from contextlib import contextmanager
    from dataclasses import dataclass
    from functools import lru_cache
    import agate
    from requests.exceptions import ConnectionError
    from typing import Optional, Any, Dict, Tuple
    
    import google.auth
    import google.auth.exceptions
    import google.cloud.bigquery
    import google.cloud.exceptions
    from google.api_core import retry, client_info
    from google.auth import impersonated_credentials
    from google.oauth2 import (
        credentials as GoogleCredentials,
        service_account as GoogleServiceAccountCredentials
    )
    
    from dbt.utils import format_bytes, format_rows_number
>   from dbt.clients import agate_helper, gcloud
E   ImportError: cannot import name 'gcloud' from 'dbt.clients' (/Users/kwigley/workspace/dbt-bigquery/.tox/py38-bigquery/lib/python3.8/site-packages/dbt/clients/__init__.py)

Authorized Views Bug When Using Schema Property on Relationship

Describe the bug

BigQuery Authorized views cannot be dynamically granted.

Steps To Reproduce

This config should grant access to the underlying tables:

{{ config(
  grant_access_to=[
    {'project': ref('pharmacy_claims_recon_yearmonth').database , 'dataset': ref('pharmacy_claims_recon_yearmonth').schema}
  ]
  )
}}

select * 
from {{ ref('pharmacy_claims_recon_yearmonth') }}

The access grant successfully works with ref('pharmacy_claims_recon_yearmonth').database.

The access grant DOES NOT successfully work with ref('pharmacy_claims_recon_yearmonth').schema. Dataset will work with a variable declared with the {% set variable='value' %} command, but no variation on the schema property appears to work, including explicitly casting it to a string like ref('pharmacy_claims_recon_yearmonth').schema|string.

Expected behavior

Access grant works correctly off of the schema value of a relationship.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

2021-12-10T23:49:04.768410Z: Began running node model.vida.v_pharmacy_claims_recon_yearmonth
2021-12-10T23:49:04.768702Z: 23:49:04 | 1 of 1 START view model dbt_bliyanage_clinops.v_pharmacy_claims_recon_yearmonth [RUN]
2021-12-10T23:49:04.768933Z: Acquiring new bigquery connection "model.vida.v_pharmacy_claims_recon_yearmonth".
2021-12-10T23:49:04.769013Z: Compiling model.vida.v_pharmacy_claims_recon_yearmonth
2021-12-10T23:49:04.775412Z: Writing injected SQL for node "model.vida.v_pharmacy_claims_recon_yearmonth"
2021-12-10T23:49:04.790750Z: finished collecting timing info
2021-12-10T23:49:04.813317Z: Writing runtime SQL for node "model.vida.v_pharmacy_claims_recon_yearmonth"
2021-12-10T23:49:04.826940Z: Opening a new connection, currently in state closed
2021-12-10T23:49:04.827102Z: On model.vida.v_pharmacy_claims_recon_yearmonth: /* {"app": "dbt", "dbt_version": "0.20.2", "profile_name": "user", "target_name": "default", "node_id": "model.vida.v_pharmacy_claims_recon_yearmonth"} */


  create or replace view `test-dbt-vida`.`dbt_bliyanage_clinops`.`v_pharmacy_claims_recon_yearmonth`
  OPTIONS()
  as 






select * 
from `test-dbt-vida`.`dbt_bliyanage_claims`.`pharmacy_claims_recon_yearmonth`;


2021-12-10T23:49:05.944456Z: Access entry <AccessEntry: role=None, view={'projectId': 'test-dbt-vida', 'datasetId': 'dbt_bliyanage_clinops', 'tableId': 'v_pharmacy_claims_recon_yearmonth'}> already exists in dataset
2021-12-10T23:49:05.944813Z: finished collecting timing info
2021-12-10T23:49:05.945195Z: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '27b7581e-1241-47f3-ad36-bfb5e7099b84', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fc1c9436e80>]}
2021-12-10T23:49:05.945507Z: 23:49:05 | 1 of 1 OK created view model dbt_bliyanage_clinops.v_pharmacy_claims_recon_yearmonth [OK� in 1.18s]
2021-12-10T23:49:05.945589Z: Finished running node model.vida.v_pharmacy_claims_recon_yearmonth

System information

The output of dbt --version:

image

In build configs we use: 0.20.2

Additional context

Here are a couple other observations from troubleshooting this:

  • When this fails to authorize a view (for example, because of a bad value) the error is eaten and never raised, with either a log or error message. This makes it very hard to troubleshoot when this is working.
  • It would be nice if there was a log message confirming the authorization grant when it was successful as well.
  • It appears when there is an error, the logs report Access entry <AccessEntry: role=None, view={'projectId': 'test-dbt-vida', 'datasetId': 'dbt_bliyanage_clinops', 'tableId': 'v_pharmacy_claims_recon_yearmonth'}> already exists in dataset, even if there is no authorization.

In an ideal world, we would just be able to specify grant_access, and dbt would be able to resolve all ref's in a particular query, and grant access to the appropriate underlying datasets without them having to be explicitly named by a person.

Add extra guardrail for BQ insert_overwrite incremental strategy

Per @drewbanin's comment here, we might add a partition filter to the source as well as the destination in the merge statement generated by the insert_overwrite incremental strategy on BigQuery.

Status quo

....
    when not matched by source
         and DBT_INTERNAL_DEST.id in (
              1, 2, 3
          ) 
        then delete

    when not matched then insert (`id`, `ts`)
    values (`id`, `ts`)

Proposed change

....
    when not matched by source
         and DBT_INTERNAL_DEST.id in (
              1, 2, 3
          ) 
        then delete

    when not matched
      -- this is new:
      and DBT_INTERNAL_SOURCE.id in (
              1, 2, 3
          ) 

    then insert (`id`, `ts`) values (`id`, `ts`)

I'm pretty sure there's no substantive difference here, but to Drew's point,

This just has the benefit of avoiding a weird failure mode if the user should specify a list of partitions to overwrite, but they generate a SQL select statement which returns an incongruent set of partitions.

Who will this benefit?

BigQuery users of new incremental model functionality released in 0.16.0. Given that this is quite new behavior, we should seek user feedback on weird/unexpected edge cases.

[BUG] Problem create table with row access policy

Describe the bug

I create a row policy level in bigquery for the user who is used in dbt to access it, and when I try to run dbt run -m + dim_table the tables that is created using reference of the table with row access policy got a macro error. I got the run target for that table, and running directly on bigquery everything works fine.
I trying to use this process to add the dbt user in row policy access to spread more easily on others tables this row I don't want to show in Metabase.

Steps To Reproduce

Create a row access policy using

CREATE OR REPLACE ROW ACCESS POLICY teste
ON `tableA`
GRANT TO ("serviceAccount:dbt_service_account@project_id.iam.gserviceaccount.com")
FILTER USING (id != "99");

Where service account is the same used previous in dbt to access.

then create a sql file with

select
    *
from
    {{ ref('tableA') }}

Without row access policy evething works fine, with row access policy i got and error

Expected behavior

Expect the table be created and in table a when try to get information from id 99 will return an empty result
But instead i got and error

Screenshots and log output

Completed with 1 error and 0 warnings:

Compilation Error in macro statement (macros/core.sql)
  bad operand type for abs(): 'NoneType'
  
  > in macro materialization_table_bigquery (macros/materializations/table.sql)
  > called by macro statement (macros/core.sql)

Done. PASS=8 WARN=0 ERROR=1 SKIP=1 TOTAL=10

System information

I try in both

installed version: 0.19.1

Plugins:
  - redshift: 0.19.1
  - postgres: 0.19.1
  - bigquery: 0.19.1
  - snowflake: 0.19.1

and in

installed version: 0.21.0

Plugins:
  - redshift: 0.21.0
  - postgres: 0.21.0
  - bigquery: 0.21.0
  - snowflake: 0.21.0

The operating system you're using:
debian buster
The output of python --version:
python 3.8

Additional context

Add any other context about the problem here.

quoting of column names in nested fields

Describe the bug

the column type is overridden as a BigQueryColumn in BigQuery dbt projects

However, column names in nested fields are not quoted, which is ok for most coloums. However, in cases where reserved keywords are used as column names, the SQL code becomes invalid.

In order to circumvent this issue, all columns should be quoted.

There is already a function called "quoted".

In my opinion, line 86 should be changed, so that column names are quoted.

    @property
    def quoted(self):
        return '`{}`'.format(self.column)

    def literal(self, value):
        return "cast({} as {})".format(value, self.dtype)

    @property
    def data_type(self) -> str:
        if self.dtype.upper() == 'RECORD':
            subcols = [
                "{} {}".format(col.name, col.data_type) for col in self.fields
            ]
            field_type = 'STRUCT<{}>'.format(", ".join(subcols))

        else:
            field_type = self.dtype

        if self.mode.upper() == 'REPEATED':
            return 'ARRAY<{}>'.format(field_type)

        else:
            return field_type

Steps To Reproduce

We encoutered the issue, when running the dbt_utils.union_relations makro, which joined multiple nested tables, which had coloum names such as from and where deeply in their nested field structure.

Expected behavior

All column names (including the ones from nested fields) should be quoted, in order to prevent issues with reserved Keywords

Screenshots and log output

System information

The output of dbt --version:

installed version: 0.19.1
   latest version: 0.21.0

Plugins:
  - redshift: 0.19.1
  - snowflake: 0.19.1
  - bigquery: 0.19.1
  - postgres: 0.19.1

The operating system you're using:
Ubuntu

The output of python --version:
Python 3.8.10

Additional context

Add any other context about the problem here.

Optimize `insert_overwrite` incremental strategy with WRITE_TRUNCATE / Partition copy

Describe the feature

This is a follow-up from a discussion over the topic with @jtcohen6.

Let's assume we have a model with incremental materialization and we're going for an incremental run.

merge into mydataset.newtable as DBT_INTERNAL_DEST
    using (
    SELECT * from mydataset.newtable__dbt_tmp
    ) as DBT_INTERNAL_SOURCE
    on FALSE

when not matched by source
        and DBT_INTERNAL_DEST._PARTITIONTIME in unnest([TIMESTAMP("2021-11-30")])
    then delete

when not matched then insert
    (`_PARTITIONTIME`, `transaction_id`)
values
    (`_PARTITIONTIME`, `transaction_id`)

However the merge will first delete the data to insert the data.
On big partitions, it can take a while resulting in a long operation compared to other approaches such as:

  • select whole partition and insert in the destination table with a partition decorator (ie mydataset.newtable$20211130) while using WRITE_TRUNCATE setting.
  • copying the whole partition ie: bq cp mydataset.newtable__dbt_tmp$20211130 mydataset.newtable$20211130.

As far as I know the fastest approach is to use bq cp (it might require to benchmark it though) unless you don't need a temporary table (ie no column change & single partition) in which case, you would use the input query and write to the destination table with the partition decorator directly.

The main inconvenient for that approach is that even though the insert_overwrite strategy operation isn't fully atomic (the temporary table is created and stays even if the MERGE fails), the MERGE query is atomic on all partitions.
So to keep the same behavior, it requires to have a single partition or accept the tradeoff of breaking the atomicity of the partitions replacement step.

Therefore it could be relevant to have a copy_partitions config to activate that approach.

Describe alternatives you've considered

It could also be a whole new incremental strategy if relevant.

Additional context

I had a production case that I described in the thread where, using a single partition, MERGE version (insert_overwrite) was taking 43 minutes while the query with WRITE_TRUNCATE took 26 minutes for the same result.
At Teads, our internal BigQuery query wrapper tool is using a select & write_truncate since we're not using a temporary table as an intermediate step because we only process a single partition per query (which is a specific case compared to the dbt approach). It's quite a deal breaker to use dbt as is for those queries because of that performance overhead.
Of course, it would be much better if Google could make a server side optimisation on MERGE queries when it detects that pattern.

Who will this benefit?

It will benefit anyone using insert_overwrite incremental strategy with large partitions where delete/insert are long to process.

Are you interested in contributing this feature?

Yes

[Bug] Incremental models on partitioned string column fails

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Given a partitioned on a string column incremental model:

{{
    config(
        materialized='incremental',
        partition_by={'field': 'string_column', 'data_type' : 'string'},
        incremental_strategy='insert_overwrite',
    )
}}

SELECT '20201201' AS string_column

Running incremental dbt run on this model provides erroneous definition of partitions to update. It tries to use non-existent in BigQuery function string_trunc(string_column, day) code in /target/run/ directory:

      ...
      -- 2. define partitions to update
      set (dbt_partitions_for_replacement) = (
          select as struct
              array_agg(distinct string_trunc(string_column, day))  -- HERE NON-EXISTENT FUNCTION
          from MODEL_NAME
      );

      
      -- 3. run the merge statement
    merge into 
    ...
    when not matched by source
         and string_trunc(DBT_INTERNAL_DEST.string_column, day) in unnest(dbt_partitions_for_replacement)  -- HERE NON-EXISTENT FUNCTION
        then delete
        ...

It causes dbt to fail the run with following error
Query error: Function not found: string_trunc at [51:34]

Expected Behavior

As for dbt version 0.18.2 it ran such model correct:

      -- 2. define partitions to update
      set (dbt_partitions_for_replacement) = (
          select as struct
              array_agg(distinct string_column)
          from MODEL_NAME
      );
...
      -- 3. run the merge statement
    merge into MODEL_NAME
    ...

    when not matched by source
         and DBT_INTERNAL_DEST.string_column in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
    ...

Steps To Reproduce

No response

Relevant log output

No response

Environment

- OS: MacOS 12.0.1
- Python: 3.8.12
- dbt: official docker image version 0.21.0

What database are you using dbt with?

bigquery

Additional Context

No response

BigQuery ProjectID

Describe the bug

Ran dbt test with Big Query as the connection using Airflow, and got an Error on the custom test. All standard dbt tests worked fine.
Error message:
[2021-11-29, 17:33:46 UTC] {subprocess.py:89} INFO - Invalid project ID 'simple_bigquery_example_dag.[correct-project-id-here-redacted]'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Seems like the dataset name from the test's SQL query got prepended to the project ID name.
SQL Query:
SELECT * FROM simple_bigquery_example_dag.forestfires

Schema:


version: 2

models:
  - name: forestfire_test
    description: "A dbt model of forestfire data."
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - not_null
          - unique
      - name: month
        tests:
          - not_null
          - accepted_values:
              values: ['aug', 'mar', 'sep']
      - name: ffmc
        tests:
          - ffmc_value_check

Where ffmc_value_check is the failing test.

Query for the test:

{% test ffmc_value_check(model, column_name) %}

SELECT {{ column_name }}
FROM simple_bigquery_example_dag.{{ model }}
HAVING NOT({{ column_name }} >= 90)

{% endtest %}

Steps To Reproduce

The DAG from this pull request can be used to replicate the error:
astronomer/airflow-data-quality-demo#9

Steps:

  • Insert forest fire data into BigQuery in a dataset and table: simple_bigquery_example_dag.forestfire_test (this can also be done by running the BigQuery example DAG here).
  • Run the dbt test command from the PR linked above

Expected behavior

Test is expected to Fail, not to Error. Project ID is not expected to have the simple_bigquery_example_dag. prefix.

Screenshots and log output

Screen Shot 2021-11-29 at 1 07 54 PM

System information

The output of dbt --version:

dbt==0.21.0

The operating system you're using:
MacOS with Docker

The output of python --version:
python 3.9

Additional context

Add any other context about the problem here.

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. )

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Background

Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries

We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.

We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.

We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.

Technical Context

Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.

How BigQuery query works in dbt

at the core, BigQuery query is made by two steps in dbt

    def _query_and_results(self, client, sql, conn, job_params, timeout=None):
        """Query the client and wait for results."""
        # Cannot reuse job_config if destination is set and ddl is used
        job_config = google.cloud.bigquery.QueryJobConfig(**job_params)    # <--- Step 1
        query_job = client.query(sql, job_config=job_config)               # <--- Step 2
        iterator = query_job.result(timeout=timeout)
        return query_job, iterator

In the first step, client.query() submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id back to the client as part ofquery_job object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.

In the 2nd step, query_job.result() await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.

how dbt currently retries BigQuery queries

Currently, The dbt retry config is used to set the retries of the overall _query_and_results. the dbt timeout config is used only to control step 2, query_job.result(timeout=timeout). These configs make sense on its own.

However, you might notice that there is no control over the timeout of query() in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query() had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).

BQ team has since patched the query() to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query() could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.

Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.

the Proposed Solution

In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs

job_creation_timeout_seconds     # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds    # specific for awaiting job result, to control the timeout of step 2, result()

job_retry_deadline_seconds       # to control the overal query, retry_deadline of _query_and_results()
job_retries                      # to control the overall query, retries of _query_and_results()

For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.

job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3

These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.

NOTE:
job_execution_timeout_seconds is the renaming of the current timeout config.
job_retries is the renaming of the current retries config.

Describe alternatives you've considered

Currently, we have to override the dbt-core code to allow the query() to fail faster and retry, and to mitigate the 503 error.

Who will this benefit?

the dbt-bigquery users who would need more fine control of BigQuery query behaviors

Are you interested in contributing this feature?

yes

Anything else?

this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400

Implement CI

  • add GHA workflows for:
    • code quality checks
    • unit tests
    • integration tests
  • test against supported python versions and OS
  • use tox for configuration if needed

DBT + Bigquery : 404 Not found: Dataset hello-data-pipeline:staging_benjamin was not found in location EU

Describe the bug

When doing "DBT run" with this script

{{ config(materialized='table') }}

SELECT customer_id FROM `hello-data-pipeline.adwords.google_ads_campaign_stats` 

I get the following error

15:41:51 | 2 of 3 START table model staging_benjamin.yo......................... [RUN]
15:41:51 | 2 of 3 ERROR creating table model staging_benjamin.yo................ [ERROR in 
0.32s]
Runtime Error in model yo (models/yo.sql)
404 Not found: Dataset hello-data-pipeline:staging_benjamin was not found in location EU
(job ID: 4020ce42-ec44-438e-8ea0-9f4fe4adcc8c)

_NB. Bigquery does not show any error when doing this query in Bigquery Editor.
NB 2 DBT does not show any error when "running sql" command directly in the script editor
NB3 : No error when doing db run with example or with a public big query like this

{{ config(materialized='table') }}
SELECT *
FROM 'bigquery-public-data.stackoverflow.posts_questions'
ORDER BY view_count DESC
LIMIT 1

_

Steps To Reproduce

  • Create a bigquery project
  • Go to a bigquery table and click on "query table"
  • Take the Select and From Statement automatically generated
  • Create a DBT model and paste your code in your DBT model
  • DBT RUN

Expected behavior

It should DBT run without error..

System information

Which database are you using dbt with?
BIG QUERY

The operating system you're using:
DBT CLOUD

Others
Data imported from FIVETRAN

Support BigQuery Materialized Views

Describe the feature

Support https://cloud.google.com/bigquery/docs/materialized-views-intro

Describe alternatives you've considered

  1. add BQ specific configuration:
{{ config(
    materialized='view',
    materialized-view='true' # default 'false
)}}
  1. create new materialization, initially for BQ only, over time perhaps implement for additional databases
{{ config(
    materialized='materialized-view',
)}}

Additional context

Some powerful stuff here https://cloud.google.com/bigquery/docs/materialized-views-intro

Who will this benefit?

BQ users

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.