duckdb / dbt-duckdb Goto Github PK

View Code? Open in Web Editor NEW

806.0 20.0 69.0 854 KB

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)

License: Apache License 2.0

Python 99.13% Dockerfile 0.72% Shell 0.15%

dbt duckdb

dbt-duckdb's Introduction

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs, maps), and several extensions designed to make SQL easier to use.

DuckDB is available as a standalone CLI application and has clients for Python, R, Java, Wasm, etc., with deep integrations with packages such as pandas and dplyr.

For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install DuckDB, please see our installation page for instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The documentation contains a SQL introduction and reference.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

dbt-duckdb's People

Contributors

Stargazers

Watchers

Forkers

forestlzj franciscomagioli columbiaviz nathanhowell dbeatty10 exemaitch kj-9 nicku33 henmachado edgarrmondragon tomsej jpmmcneill markroddy alexandervr zkan cabeda keiththompson godatadriven nclaeys hubayirp danthelion rohankumardubey tailwind-vn felippecaso matinnuhamunada amiller-gh danhphan vlieven dwreeves armsepehr n-clerkx turbo1912 jtcohen6 arpitjain799 cmu-course ayuryshev bl3f motherduckdb gmontanola dforsber jcotton1123 thfmn l1xnan jmhsieh milicevica23 droher shrey0806 amaralvieira ahuachen nintorac mehd-io danieldiamond fork-arhive szarnyasg sujanadh firewall413 ayushsubedi williamdee karimfaiz johalnes b-per obronco hadousoftware hrl20 lorran guenp jaceksan

dbt-duckdb's Issues

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

S3 Parquet Support

Hi There!

DuckDB has the support to load Parquet files from S3. Are there any plans on supporting the ability to query from S3 Parquet files from dbt-duckdb? Would love to help in any way I can!

External Location: f-string value needs more than one variable

I am interested in using external_location f-string with several variable but only name, indicator and schema are known in the f-string.

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/{folder2}/{meta.name}.{meta.suffix}"
    tables:
      - name: source1
         meta:
             suffix: parquet
             folder: folder1
      - name: source2
        meta:
              suffix: csv
              folder: folder2

Additionally, I don't want to overwrite the external_location for each table name. In this regard, I suggest to improve the f-string in real format name such as below:

- name: external_source
    meta:
      external_location: "s3://my-bucket/{source.meta['folder2']}/{source.meta['name']}.{source.meta['suffix'}"
    tables:
      - name: source1
         meta:
             suffix: parquet
             folder: folder1
      - name: source2
        meta:
              suffix: csv
              folder: folder2

In this case, the indicator should be addressed the source.indicator. I can do it myself, but please clarify that do you have any suggestion to improve this? I have identified the problem in duckdb/relation.py and send the PR to you, but as far as I'm not sure that you're agree with change I didn't modify the README. Additionally, the test should be improved in this regard, so I can do it myself after your suggestion.

update macro statement error handling to provide helpful messaging to user

Howdy partner,

I ran into an error creating a minimal dbt-duckdb example. I wish I could give more information but I'm a dbt newbie. Below is a repository that (hopefully) recreates this issue w/ a couple of essentially trivial model files:

https://github.com/hamilton/dbt-duckdb-test

dbt run gives this output:

00:08:05  Running with dbt=1.0.1
00:08:05  Found 2 models, 4 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
00:08:05  
00:08:05  Concurrency: 1 threads (target='dev')
00:08:05  
00:08:05  1 of 2 START table model main.ingest............................................ [RUN]
00:08:05  1 of 2 ERROR creating table model main.ingest................................... [ERROR in 0.08s]
00:08:05  2 of 2 SKIP relation main.second................................................ [SKIP]
00:08:05  
00:08:05  Finished running 1 table model, 1 view model in 0.34s.
00:08:05  
00:08:05  Completed with 1 error and 0 warnings:
00:08:05  
00:08:05  Compilation Error in macro statement (macros/etc/statement.sql)
00:08:05    cannot unpack non-iterable NoneType object
00:08:05    
00:08:05    > in macro materialization_table_default (macros/materializations/models/table/table.sql)
00:08:05    > called by macro statement (macros/etc/statement.sql)
00:08:05  
00:08:05  Done. PASS=0 WARN=0 ERROR=1 SKIP=1 TOTAL=2

My profiles.yml:

duckdb_test:
  outputs:
    dev:
      path: /Users/hamiltonulmer/Code/tmp/duckdb_test/db.duckdb
      type: duckdb
  target: dev

Most recent releases missing from Github

The most recent release on Github is 1.3.1, but pypi is up to 1.3.4: https://pypi.org/project/dbt-duckdb/
Not sure if intentional but wanted to flag in case it wasn't. Thanks so much for this package, it's incredibly useful!

Python model fails with import statement outside of function

Python models aren't working as I expect. This basic example fails:

import pandas as pd

def model(dbt, session):

    return pd.Series(range(10)).to_frame()

15:32:53  Runtime Error in model 03 (days/03.py)
15:32:53    Python model failed:
15:32:53    name 'pd' is not defined

However, simply moving the import statement into the function allows it to work.

def model(dbt, session):
    import pandas as pd
    return pd.Series(range(10)).to_frame()

Versions:

dbt-duckdb 1.3.2
dbt 1.3.1
Python 3.10.7

Bug in register_upstream_external_models

I have an issue when running a dbt run --select statement where external tables are referenced. An invalid upstream location is generated, resulting in the error e.g. IO Error: No files found that match the pattern "external/root/.parquet" when running for the config external_root: "external/root".

I've identified the problem in macros/utils/upstream.sql, and I'll put up a PR shortly. I'm not sure how to write a test that reproduces this issue though, so I'd appreciate it if you could maybe give me some pointers there. Thanks!

S3 reading error "No files found that match the pattern"

Hi!

I am trying to read a parquet file from S3 by following the documentation, but for some reason I cannot seem to get it to work..

I am using dbt-duckdb==1.3.4.

When trying to read a parquet file from S3 running dbt build I get:

20:28:45    IO Error: No files found that match the pattern "s3://my-bucket/my-project/data.parquet" 
20:28:45  
20:28:45  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=

The s3://my-bucket/my-project/data.parquet is just a dummy string obviously, but I do have a Bucket with a parquet file in there which I replace with the actual string for testing.

What I did:

First I set the S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables, which are correct and working.

Then I have this in my profiles.yml:

my_dbt_project:
  outputs:
    dev:
      type: duckdb
      path: ":memory:"
    extensions:
      - httpfs
      - parquet
    settings:
      s3_region: eu-west-1
      s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
      s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

And this in sources.yml:

 version: 2

sources:
  - name: external_source
    meta:
      external_location: "s3://my-bucket/my-project/data.parquet"
    tables:
      - name: my_table

and trying to run this

select * from {{ source("external_source", "my_table") }}

But it keeps saying it cannot find the file. I tried to specify the external_root in profiles.yml but no luck..

Any ideas what I am missing.

Dbt models that are materialized as views result in catalog error (table cannot be found)

Dbt models that are materialized as views are unable to be found:

Catalog Error: Table with name <table_name> does not exist

The sql view model seems to get created without issues:

Concurrency: 1 threads (target='duckdb') 1 of 4 OK created sql view model

Running a select statement on the model returns the missing table error. Wondering if this is a bug/missing feature or if I am missing something in the setup?

I see that there is a duckdb specific create_view_as macro. Does a corresponding view.sql need to be created under the materializations folder?

https://github.com/dbt-labs/dbt-core/blob/dev/louisa-may-alcott/core/dbt/include/global_project/macros/materializations/view/view.sql

Please add a license

I'd like to kick the tires on this project but it's not clear what the license is. Can you add one?

fwiw DuckDB is MIT licensed and dbt-core is Apache. A license compatible with both of these would be ideal.

Out of sync with minor version of dbt-core

I'll open a PR shortly with a proposed fix to the following minor (😆) issue.

TLDR

Installing dbt-duckdb 1.1.1 yields dbt-core 1.0.8 (instead of 1.1.1).

Install commands

python3 -m venv env
source env/bin/activate
python -m pip install dbt-duckdb
source env/bin/activate
dbt --version

Expected output

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - duckdb: 1.1.1 - Up to date!

Actual output

installed version: 1.0.8
   latest version: 1.1.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - duckdb: 1.1.1 - Up to date!

The fix is simple, and I'll submit a PR.

Cannot get glue integration to work

I am getting fond of this library as it makes it easy to interface dbt with our existing pipelines which persist data in s3.

I can read/write data to s3 without issues, but I was curious about the glue integration, however I cannot get it to work.
My pip dependencies are the following:

attrs                    22.2.0
Babel                    2.11.0
boto3                    1.26.57
botocore                 1.29.57
certifi                  2022.12.7
cffi                     1.15.1
chardet                  5.1.0
charset-normalizer       3.0.1
click                    8.1.3
colorama                 0.4.5
dbt-core                 1.3.2
dbt-duckdb               1.3.4
dbt-extractor            0.4.1
diff-cover               7.4.0
duckdb                   0.6.1
exceptiongroup           1.1.0
future                   0.18.3
hologram                 0.0.15
idna                     3.4
iniconfig                2.0.0
isodate                  0.6.1
Jinja2                   3.1.2
jmespath                 1.0.1
jsonschema               3.2.0
leather                  0.3.4
Logbook                  1.5.3
MarkupSafe               2.1.2
mashumaro                3.0.4
minimal-snowplow-tracker 0.0.2
msgpack                  1.0.4
mypy-boto3-glue          1.26.53
networkx                 2.8.8
numpy                    1.24.1
packaging                21.3
parsedatetime            2.4
pathspec                 0.9.0
pip                      22.0.2
pluggy                   1.0.0
pycparser                2.21
Pygments                 2.14.0
pyparsing                3.0.9
pyrsistent               0.19.3
pytest                   7.2.1
python-dateutil          2.8.2
python-slugify           7.0.0
pytimeparse              1.1.8
pytz                     2022.7.1
PyYAML                   6.0
regex                    2022.10.31
requests                 2.28.2
s3transfer               0.6.0
setuptools               59.6.0
six                      1.16.0
sqlfluff                 1.2.1
sqlparse                 0.4.3
tblib                    1.7.0
text-unidecode           1.3
toml                     0.10.2
tomli                    2.0.1
tqdm                     4.64.1
typing_extensions        4.4.0
urllib3                  1.26.14
Werkzeug                 2.2.2

I want to try to create a glue table based on some seed data. Similar to the jaffle_shop customers.csv for example.
My model looks as follows:

{{ config(materialized='external', glue_register=true, glue_database='my_glue_db') }}
with source as (
    select * from {{ ref('raw_customers') }}
)

select * from source

In my profiles.yml I did not change anyhting compared to what I used for writing to s3:

    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      schema: dbt
      threads: 1
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: 'eu-west-1'
        s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
        s3_session_token: "{{ env_var('AWS_SESSION_TOKEN') }}"
      external_root: "s3://<somepath>

The result is that dbt compile finishes successfully, it writes my customers.parquet file to s3 but it does not create the table in glue.
I am not sure how to debug this, in the dbt.logs I did not find anything useful.
Can you point me in the right direction? Am I missing something?

Python models

It would be great to have a support for dbt Python Models. My use case is primarily to read/write into Delta tables (hope it will be obsolete with duckdb-deltatable-extension), but there are many other possibilities.

I have already created a prototype, but it is an ugly duckling that supports only single thread (which is okish for my use-case). But I do not expect PR to be merged anytime soon.

P.S. I have created this issue because it is always nice to have both PR and issue :)

Changing seeds columns fails

If I have seeds that look like this

id,value
1,'hello'
2,'world'

and then change to look like this

id,status
1,'hello'
2,'world'

then I get

00:40:48  Runtime Error in seed so2_status_codes (seeds/status_codes.csv)
00:40:48    Binder Error: Table "so2_status_codes" does not have a column with name "status"

And the table must be manually deleted in order to work around.

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

upgrade to support dbt-core v1.3.0

Background

The latest release cut for 1.3.0, dbt-core==1.3.0rc2 was published on October 3, 2022 (PyPI | Github). We are targeting releasing the official cut of 1.3.0 in time for the week of October 16 (in time for Coalesce conference).

We're trying to establish a following precedent w.r.t. minor versions:
Partner adapter maintainers release their adapter's minor version within four weeks of the initial RC being released. Given the delay on our side in notifying you, we'd like to set a target date of November 7 (four weeks from today) for maintainers to release their minor version

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Give a duck some ❤️

Ducks are fantastic animals but need some care, just like a code. Here is a repository for scaffolding dbt adapter. Unfortunately, this awesome project does not comply to that, which makes contribution more challenging than it has to be.

Would it be okay if I brought the structure closer to the template?

Exception in WAL playback: Catalog Error: Table does not exist

Hey @jwills

I got the following error when hacking on the demo published by the fal.ai folks and using the dbt-duckdb 1.2.1 multi-threading:

Exception in WAL playback: Catalog Error: Table with name customers__dbt_tmp does not exist!
Did you mean "customers"?

This demo is using a forked and customized beta version of dbt-core + the demo fal adapter + dbt-duckdb multi-threading, so who knows what is actually causing the issues -- may or may not be fixable within dbt-duckdb 🤷

Instructions to reproduce

Grab the code:

git clone https://github.com/dbeatty10/jaffle_shop_with_fal.git
cd jaffle_shop_with_fal
git checkout dbeatty/reproduce-bug

Create a virtual environment:

python3 -m venv env
source env/bin/activate
pip install --upgrade pip

Install the requirements and build:

pip install -r requirements.txt
source env/bin/activate
rm jaffle_shop.duckdb
dbt build

Deleting the jaffle_shop.duckdb file was crucial to reproduce the error for me -- if all the output tables exist already, everything works fine.

Work-around

Editing my requirements.txt to this and re-running the install+build steps worked again:

# Works with 1.2.0
dbt-duckdb==1.2.0

# Doesn't work with 1.2.1
# dbt-duckdb==1.2.1

# Use a local editable version of dbt-duckdb instead:
# -e /Users/dbeatty/projects/dbt-duckdb

Support for duckdb 0.6.*

Hi,

duckdb 0.6.0 has been released recently - it would be great if dbt-duckdb could support the new version.

Thanks!

Pip install not working

Super excited about this project. I love duckdb and tying it in with dbt is a great idea. I tried pip installing this package per the readme but a version on Pypi couldn't be found. Has it been published to Pypi yet?

Looks like duckdb only works on single-threaded operations

dbt-core version: dbt-core 1.0.8
adapter version: dbt-duckdb 1.1.1

# my config on a single thread that works
jaffle_shop:

  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'jaffle_shop.duckdb'

 ~/De/jaffle_shop_duckdb │ on duckdb !1  dbt build     ✔ │ took 5s │ jaffle_shop_duckdb Py │ system Node 
02:25:56  Running with dbt=1.0.8
02:25:56  Unable to do partial parsing because profile has changed
02:25:57  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
02:25:57  
02:25:57  Concurrency: 1 threads (target='dev')
02:25:57  
02:25:57  1 of 28 START seed file main.raw_customers...................................... [RUN]
02:25:57  1 of 28 OK loaded seed file main.raw_customers.................................. [INSERT 100 in 0.13s]
02:25:57  2 of 28 START seed file main.raw_orders......................................... [RUN]
02:25:57  2 of 28 OK loaded seed file main.raw_orders..................................... [INSERT 99 in 0.08s]
02:25:57  3 of 28 START seed file main.raw_payments....................................... [RUN]
02:25:57  3 of 28 OK loaded seed file main.raw_payments................................... [INSERT 113 in 0.08s]
02:25:57  4 of 28 START view model main.stg_customers..................................... [RUN]
02:25:57  4 of 28 OK created view model main.stg_customers................................ [OK in 0.14s]
02:25:57  5 of 28 START view model main.stg_orders........................................ [RUN]
02:25:57  5 of 28 OK created view model main.stg_orders................................... [OK in 0.07s]
02:25:57  6 of 28 START view model main.stg_payments...................................... [RUN]
02:25:58  6 of 28 OK created view model main.stg_payments................................. [OK in 0.07s]
02:25:58  7 of 28 START test not_null_stg_customers_customer_id........................... [RUN]
02:25:58  7 of 28 PASS not_null_stg_customers_customer_id................................. [PASS in 0.09s]
02:25:58  8 of 28 START test unique_stg_customers_customer_id............................. [RUN]
02:25:58  8 of 28 PASS unique_stg_customers_customer_id................................... [PASS in 0.07s]
02:25:58  9 of 28 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [RUN]
02:25:58  9 of 28 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [PASS in 0.07s]
02:25:58  10 of 28 START test not_null_stg_orders_order_id................................ [RUN]
02:25:58  10 of 28 PASS not_null_stg_orders_order_id...................................... [PASS in 0.07s]
02:25:58  11 of 28 START test unique_stg_orders_order_id.................................. [RUN]
02:25:58  11 of 28 PASS unique_stg_orders_order_id........................................ [PASS in 0.07s]
02:25:58  12 of 28 START test accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [RUN]
02:25:58  12 of 28 PASS accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [PASS in 0.07s]
02:25:58  13 of 28 START test not_null_stg_payments_payment_id............................ [RUN]
02:25:58  13 of 28 PASS not_null_stg_payments_payment_id.................................. [PASS in 0.07s]
02:25:58  14 of 28 START test unique_stg_payments_payment_id.............................. [RUN]
02:25:58  14 of 28 PASS unique_stg_payments_payment_id.................................... [PASS in 0.07s]
02:25:58  15 of 28 START table model main.customers....................................... [RUN]
02:25:58  15 of 28 OK created table model main.customers.................................. [OK in 0.09s]
02:25:58  16 of 28 START table model main.orders.......................................... [RUN]
02:25:58  16 of 28 OK created table model main.orders..................................... [OK in 0.08s]
02:25:58  17 of 28 START test not_null_customers_customer_id.............................. [RUN]
02:25:58  17 of 28 PASS not_null_customers_customer_id.................................... [PASS in 0.07s]
02:25:58  18 of 28 START test unique_customers_customer_id................................ [RUN]
02:25:58  18 of 28 PASS unique_customers_customer_id...................................... [PASS in 0.07s]
02:25:58  19 of 28 START test accepted_values_orders_status__placed__shipped__completed__return_pending__returned [RUN]
02:25:59  19 of 28 PASS accepted_values_orders_status__placed__shipped__completed__return_pending__returned [PASS in 0.07s]
02:25:59  20 of 28 START test not_null_orders_amount...................................... [RUN]
02:25:59  20 of 28 PASS not_null_orders_amount............................................ [PASS in 0.07s]
02:25:59  21 of 28 START test not_null_orders_bank_transfer_amount........................ [RUN]
02:25:59  21 of 28 PASS not_null_orders_bank_transfer_amount.............................. [PASS in 0.07s]
02:25:59  22 of 28 START test not_null_orders_coupon_amount............................... [RUN]
02:25:59  22 of 28 PASS not_null_orders_coupon_amount..................................... [PASS in 0.07s]
02:25:59  23 of 28 START test not_null_orders_credit_card_amount.......................... [RUN]
02:25:59  23 of 28 PASS not_null_orders_credit_card_amount................................ [PASS in 0.07s]
02:25:59  24 of 28 START test not_null_orders_customer_id................................. [RUN]
02:25:59  24 of 28 PASS not_null_orders_customer_id....................................... [PASS in 0.07s]
02:25:59  25 of 28 START test not_null_orders_gift_card_amount............................ [RUN]
02:25:59  25 of 28 PASS not_null_orders_gift_card_amount.................................. [PASS in 0.07s]
02:25:59  26 of 28 START test not_null_orders_order_id.................................... [RUN]
02:25:59  26 of 28 PASS not_null_orders_order_id.......................................... [PASS in 0.07s]
02:25:59  27 of 28 START test relationships_orders_customer_id__customer_id__ref_customers_ [RUN]
02:25:59  27 of 28 PASS relationships_orders_customer_id__customer_id__ref_customers_..... [PASS in 0.07s]
02:25:59  28 of 28 START test unique_orders_order_id...................................... [RUN]
02:25:59  28 of 28 PASS unique_orders_order_id............................................ [PASS in 0.07s]
02:25:59  
02:25:59  Finished running 3 seeds, 3 view models, 20 tests, 2 table models in 2.49s.
02:25:59  
02:25:59  Completed successfully
02:25:59  
02:25:59  Done. PASS=28 WARN=0 ERROR=0 SKIP=0 TOTAL=28

My gut tells me there are weird race conditions going on there across the threads. I haven't dug in deep yet, but wanted to point this out. This command sometimes works if I'm lucky on multiple threads.

# my config on 16 threads that does NOT work
jaffle_shop:

  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'jaffle_shop.duckdb'
      threads: 16

 ~/De/jaffle_shop_duckdb │ on duckdb !1  dbt build --full-refresh                                        ✔ │ jaffle_shop_duckdb Py │ system Node 
02:30:25  Running with dbt=1.0.8
02:30:25  Found 5 models, 20 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
02:30:25  
02:30:26  Concurrency: 16 threads (target='dev')
02:30:26  
02:30:26  1 of 28 START seed file main.raw_customers...................................... [RUN]
02:30:26  2 of 28 START seed file main.raw_orders......................................... [RUN]
02:30:26  3 of 28 START seed file main.raw_payments....................................... [RUN]
02:30:26  3 of 28 OK loaded seed file main.raw_payments................................... [CREATE 113 in 0.20s]
02:30:26  1 of 28 OK loaded seed file main.raw_customers.................................. [CREATE 100 in 0.25s]
02:30:26  2 of 28 OK loaded seed file main.raw_orders..................................... [CREATE 99 in 0.30s]
02:30:26  4 of 28 START view model main.stg_payments...................................... [RUN]
02:30:26  5 of 28 START view model main.stg_customers..................................... [RUN]
02:30:26  6 of 28 START view model main.stg_orders........................................ [RUN]
02:30:26  5 of 28 ERROR creating view model main.stg_customers............................ [ERROR in 0.17s]
02:30:26  6 of 28 ERROR creating view model main.stg_orders............................... [ERROR in 0.17s]
02:30:26  7 of 28 SKIP test not_null_stg_customers_customer_id............................ [SKIP]
02:30:26  8 of 28 SKIP test unique_stg_customers_customer_id.............................. [SKIP]
02:30:26  9 of 28 SKIP test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [SKIP]
02:30:26  10 of 28 SKIP test not_null_stg_orders_order_id................................. [SKIP]
02:30:26  11 of 28 SKIP test unique_stg_orders_order_id................................... [SKIP]
02:30:26  4 of 28 OK created view model main.stg_payments................................. [OK in 0.22s]
02:30:26  12 of 28 START test accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [RUN]
02:30:26  13 of 28 START test not_null_stg_payments_payment_id............................ [RUN]
02:30:26  14 of 28 START test unique_stg_payments_payment_id.............................. [RUN]
02:30:26  13 of 28 PASS not_null_stg_payments_payment_id.................................. [PASS in 0.16s]
02:30:26  12 of 28 PASS accepted_values_stg_payments_payment_method__credit_card__coupon__bank_transfer__gift_card [PASS in 0.22s]
02:30:26  14 of 28 PASS unique_stg_payments_payment_id.................................... [PASS in 0.27s]
02:30:26  15 of 28 SKIP relation main.customers........................................... [SKIP]
02:30:26  16 of 28 SKIP relation main.orders.............................................. [SKIP]
02:30:26  17 of 28 SKIP test not_null_customers_customer_id............................... [SKIP]
02:30:26  18 of 28 SKIP test unique_customers_customer_id................................. [SKIP]
02:30:26  22 of 28 SKIP test not_null_orders_coupon_amount................................ [SKIP]
02:30:26  19 of 28 SKIP test accepted_values_orders_status__placed__shipped__completed__return_pending__returned [SKIP]
02:30:26  20 of 28 SKIP test not_null_orders_amount....................................... [SKIP]
02:30:26  21 of 28 SKIP test not_null_orders_bank_transfer_amount......................... [SKIP]
02:30:26  23 of 28 SKIP test not_null_orders_credit_card_amount........................... [SKIP]
02:30:26  24 of 28 SKIP test not_null_orders_customer_id.................................. [SKIP]
02:30:26  26 of 28 SKIP test not_null_orders_order_id..................................... [SKIP]
02:30:26  27 of 28 SKIP test relationships_orders_customer_id__customer_id__ref_customers_ [SKIP]
02:30:26  28 of 28 SKIP test unique_orders_order_id....................................... [SKIP]
02:30:26  25 of 28 SKIP test not_null_orders_gift_card_amount............................. [SKIP]
02:30:27  
02:30:27  Finished running 3 seeds, 3 view models, 20 tests, 2 table models in 1.08s.
02:30:27  
02:30:27  Completed with 2 errors and 0 warnings:
02:30:27  
02:30:27  Runtime Error in model stg_customers (models/staging/stg_customers.sql)
02:30:27    Catalog Error: Table with name raw_customers does not exist!
02:30:27    Did you mean "stg_customers"?
02:30:27  
02:30:27  Runtime Error in model stg_orders (models/staging/stg_orders.sql)
02:30:27    Catalog Error: Table with name raw_orders does not exist!
02:30:27    Did you mean "stg_orders"?
02:30:27  
02:30:27  Done. PASS=7 WARN=0 ERROR=2 SKIP=19 TOTAL=28

add adapter to dbt docs's "Available Adapters" page

howdy @jwills

The Available Adapters page is one of the dbt community's most-visited docs pages. It would be of great benefit for first-time visitors to the dbt docs to see:

that this adapter is a possible option for using dbt-core, and
how many large the dbt ecosystem of support databases is.

dbt-labs/docs.getdbt.com#1489 exists to address this with all as-of-yet undocumented adapters.

We just released Documenting a new adapter, a new guide on how to add an adapter to the Available Adapters page. I'd love to see this adapter on that page, so feel free to reach out with any questions/blockers by either replying to this issue, or posting in the #adapter-ecosystem channel of the dbt Community Slack.

Looking forward to the contribution!

upgrade to support dbt-core v1.5.0

Background

The latest version of dbt Core,dbt-core==1.5.0rc1, was published on April 13, 2023 (PyPI | Github).

How to upgrade

dbt-labs/dbt-core#7213 is an open discussion with more detailed information. If you have questions, please put them there!

The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.5.0 release of your adapter.

Add support Python 3.11 (if you haven't already)
Add support for relevant tests (there's a lot of new ones!)
Add support model contracts
~~Add support for materialized views~~ (this likely will be bumped to 1.6.0)

the next minor release: `1.6.0`

FYI, dbt-core==1.6.0 is expected to be released at the end of July, with a release cut at least two weeks prior.

dbt models cannot see each other when using the :memory: database.

Describe the bug

dbt models cannot see each other when using the :memory: database.

To Reproduce

Steps to reproduce the behavior:

create two models. table_one.sql & table_two.sql
table_one.sql should be "SELECT 'test' AS value".
table_two.sql should be "SELECT * FROM {{ ref( 'table_one' ) }}
table_two will fail to build with following error: "Catalog Error: Table with name table_one does not exist"

Expected behavior

dbt build should succeed.

Your environment

Version of dbt-duckdb: ~=1.2.0

upgrade to support dbt-core v1.3.0

Background

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Support PyArrow Dataset as a valid return type for Python models

From the docs:

The pyarrow.dataset module provides functionality to efficiently work with tabular, potentially larger than memory, and multi-file datasets.

Since pyarrow.table type won't work with larger than memory datasets, this might be a good alternative. I was also thinking about Polars but wanted to check first if this was easier to implement.

Binder Error: Catalog "main" does not exist!

Core:
  - installed: 1.3.2
  - latest:    1.4.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - duckdb: 1.3.4 - Up to date!

Running the initial model dbt provides I'm getting this error

23:32:59  Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
23:32:59    Binder Error: Catalog "main" does not exist!

this is my profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb

Support `DuckDBPyRelation` as a valid return type for Python models

Right now, the dbt-duckdb python models only support Pandas DataFrames and PyArrow Tables as valid return types for dbt Python models; we should extend this support to include the native DuckDBPyRelation data type as well.

Catalog Error: Could not rename

Hi,

really impressed with how fast you implemented attach database! Looking forward to start working more with dbt + duckdb 😄

I'm trying it out now by mimicking databricks medallion with a separate database for bronze, silver and gold. First run always works like a charm, but with subsequent runs I get "Catalog Error: Could not rename". Could it be that there is some kind of restrictions with renaming across attached databases?

Changing back to only use the "main" database, the one added without attach, it works every time.

I'll try to force the same error from jaffle shop example or something similar, but probably won't have time until the weekend.

Ability to conditionally set/nullify `external_location`?

I'm working on a PR where we may want to have the option to either import raw data or load from seed files via the external_location annotation.

Is it possible to return a value in external_location that nullifies the expression?

Something like the psuedocode:

external_location: "{{ if {some value set} then {a real external location} else null }}"

Where in this case null would be intepreted as no instruction.

The alternative is to ask users to comment and uncomment the value, e.g.:

    # Uncomment these lines if you want to use the pregenerated seed files:
    # meta:
    #   external_location: "read_csv_auto('./jaffle-data/{name}.csv', header=1)"

https://github.com/MeltanoLabs/jaffle-shop-template/pull/1/files#diff-6c181d0ccc3c5e37e2feaacdd8872e09073b138835156160dcb5ee7c159f8c63L3-R9

Can you say if null today would cause an error, or just a no-op relative to the external_location operator?

External materialization file paths do not support jinja references

Consider the following config block:

{{ 
    config(
        materialized='external', 
        location='/path/to/folder/{{ this.name }}.parquet'
) }}

This should write the model name to the folder specified.

However, what happens is the name is written literally as {{ this.name }}.parquet as seen in the screenshot below.

I would expect that the external materialization works with the same syntax as this (but it does not):

COPY (SELECT * FROM {{ this }} ) TO '/tmp/data_catalog/conformed/{{ this.name }}.parquet' (FORMAT 'parquet')

Database increases by initial size after each dbt run is executed

I have a project running on Windows 10 that contains only views (no tables materialized) that select from sources using meta_external-location read_parquet statements. I have found if I delete the database and then successively execute dbt run commands the database file grows in size by approximately its initial run size.

The size of the database file for each successive run is:

29,964 kb
58,380 kb
86,796 kb
115,212 kb

Thanks for all your work on this, I'd love to contribute but I'm not sure where to start looking

External materialization of csv requires a specific `format=csv` config, whereas parquet is inferred

We should make these both inferred based on the extension for a more intuitive DX.

Current code

{{ 
    config(
        materialized='external', 
        location='reports/models/customers.csv',
        format='csv'
    )
    
}}

Would ideally be

{{
    config(
        materialized='external', 
        location='reports/models/customers.csv',
    )
}}

External materializations memory alloc issue

So not exactly sure what is going on here, the error coming back from dbt is not super clear. It looks duckdb is running out of memory.

The error is:

/usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

More context
Before external materializations, I was running my dbt project and then copying the data out to specific folders with an on-run-end macro. This allowed for me to run somewhere in the neighborhood of 100,000 simulations in my project (found over here) on a VM with 8GB of RAM without seeing this issue. The default number of simulations in the project is typically 10k to keep it fast.

However, when using external materializations, I have to reduce the run size to 1k in order for it to run successfully. This leads me to believe that there is a memory “leak” inside of DuckDB. If I were to speculate, DuckDB is holding both the external tables and the duckdb tables in memory, instead of dropping the duckdb tables once the file has been exported to its external storage location.

I am having issues to get register_upstream_external_models() to work

I stripped down my usecase to the following basic example:

stg layer: 1 model that outputs intermediary data as an external csv file (stg_orders)
marts layer: 1 model that uses the staging layer to do some small calculation (orders)

Instead of running the whole project as one (which works) I wanted to split it up into 2 steps and use the register_upstream_external_models() to make sure that this would work.

When I run only the marts package, it fails because it says it cannot find the table.

�[0m09:54:37.537998 [error] [MainThread]: �[33mRuntime Error in model orders (models/marts/orders.sql)�[0m
�[0m09:54:37.539425 [error] [MainThread]:   Catalog Error: Table with name stg_orders does not exist!
�[0m09:54:37.540547 [error] [MainThread]:   Did you mean "temp.pg_catalog.pg_views"?
�[0m09:54:37.541653 [error] [MainThread]:   LINE 11: select * from "coffee_shop_test"."main"."stg_orders"

When I look in the logs I see the following:

�[0m09:54:37.225208 [debug] [MainThread]: Writing runtime sql for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.227409 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.228500 [debug] [MainThread]: On master: BEGIN
�[0m09:54:37.229899 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.230767 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.231775 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "connection_name": "master"} */
create schema if not exists main
�[0m09:54:37.233732 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.237245 [debug] [MainThread]: Writing runtime sql for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.238110 [debug] [MainThread]: Using duckdb connection "master"
�[0m09:54:37.238579 [debug] [MainThread]: On master: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "connection_name": "master"} */
create or replace view "coffee_shop_test"."main"."stg_orders" as (
            select * from 'coffee-data/stg_orders.csv'
          );
�[0m09:54:37.240719 [debug] [MainThread]: SQL status: OK in 0 seconds
�[0m09:54:37.243547 [debug] [MainThread]: Writing injected SQL for node "operation.coffee_shop.coffee_shop-on-run-start-0"
�[0m09:54:37.247490 [info ] [MainThread]: 1 of 1 START hook: coffee_shop.on-run-start.0 .................................. [RUN]
�[0m09:54:37.249385 [info ] [MainThread]: 1 of 1 OK hook: coffee_shop.on-run-start.0 ..................................... [�[32mOK�[0m in 0.00s]
�[0m09:54:37.251099 [info ] [MainThread]: 
�[0m09:54:37.252567 [debug] [MainThread]: On master: ROLLBACK
�[0m09:54:37.254094 [debug] [MainThread]: Failed to rollback 'master'
�[0m09:54:37.254808 [debug] [MainThread]: On master: Close

�[0m09:54:37.259425 [info ] [MainThread]: Concurrency: 1 threads (target='dev')
�[0m09:54:37.260726 [info ] [MainThread]: 
�[0m09:54:37.269843 [debug] [Thread-1 (]: Began running node model.coffee_shop.orders
�[0m09:54:37.271039 [info ] [Thread-1 (]: 1 of 1 START sql external model main.orders .................................... [RUN]
�[0m09:54:37.273265 [debug] [Thread-1 (]: Acquiring new duckdb connection 'model.coffee_shop.orders'
�[0m09:54:37.274562 [debug] [Thread-1 (]: Began compiling node model.coffee_shop.orders
�[0m09:54:37.289179 [debug] [Thread-1 (]: Writing injected SQL for node "model.coffee_shop.orders"
�[0m09:54:37.290757 [debug] [Thread-1 (]: Timing info for model.coffee_shop.orders (compile): 2023-02-24 09:54:37.275617 => 2023-02-24 09:54:37.290464
�[0m09:54:37.291590 [debug] [Thread-1 (]: Began executing node model.coffee_shop.orders
�[0m09:54:37.392108 [debug] [Thread-1 (]: Using duckdb connection "model.coffee_shop.orders"
�[0m09:54:37.393062 [debug] [Thread-1 (]: On model.coffee_shop.orders: BEGIN
�[0m09:54:37.393669 [debug] [Thread-1 (]: Opening a new connection, currently in state init
�[0m09:54:37.403457 [debug] [Thread-1 (]: SQL status: OK in 0 seconds
�[0m09:54:37.404068 [debug] [Thread-1 (]: Using duckdb connection "model.coffee_shop.orders"
�[0m09:54:37.404537 [debug] [Thread-1 (]: On model.coffee_shop.orders: /* {"app": "dbt", "dbt_version": "1.4.0", "profile_name": "duckdb", "target_name": "dev", "node_id": "model.coffee_shop.orders"} */

  
    

    create  table
      "coffee_shop_test"."main"."orders__dbt_tmp"
    as (
      

select * from "coffee_shop_test"."main"."stg_orders"
    );
  
�[0m09:54:37.440405 [debug] [Thread-1 (]: Timing info for model.coffee_shop.orders (execute): 2023-02-24 09:54:37.292091 => 2023-02-24 09:54:37.440173
�[0m09:54:37.440917 [debug] [Thread-1 (]: On model.coffee_shop.orders: ROLLBACK
�[0m09:54:37.443434 [debug] [Thread-1 (]: Failed to rollback 'model.coffee_shop.orders'
�[0m09:54:37.443952 [debug] [Thread-1 (]: On model.coffee_shop.orders: Close
�[0m09:54:37.448232 [debug] [Thread-1 (]: Runtime Error in model orders (models/marts/orders.sql)
  Catalog Error: Table with name stg_orders does not exist!
  Did you mean "temp.pg_catalog.pg_views"?
  LINE 11: select * from "coffee_shop_test"."main"."stg_orders"

The generated sql seems ok, but for some reason it seems to want to rollback, which is probably why it subsequently dbt model does not find the stg_orders view. It is strange that I do not see the actual error that caused the rollback. When I execute the create view statement manually in duckdb it is correct.

stg_orders.sql:

{{
    config(
        materialized = 'external',
        location='coffee-data/stg_orders.csv',
        format='csv'
    )
}}

select range from range(3)

orders.sql:

{{
    config(
        materialized = 'external',
        location='coffee-data/orders.csv',
        format='csv'
    )
}}

select * from {{ ref('stg_orders') }}

If I change the reference in the orders.sql to a source reference it works of course but is not what I want.
Am I missing something or is there something wrong with the register_upstream_external_models function?
Is the actual error swalled somehow, any idea how I could debug this further?

I am using dbt-duckdb 1.4.0 as well as dbt 1.4.0 and duckdb 0.6.1

The database file was created with an newer version of DuckDB.

Thanks for developing both dbt-duckdb and target-duckdb!

Being new to the whole meltano, dbt, duckdb stack I tried to modify the official meltanto tutorial by replacing postgres with duckdb

cat meltano.yml 
version: 1
default_environment: dev
project_id: b1cb5b90-6956-4199-b7ed-e10092dfb5da
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-github
    variant: singer-io
    pip_url: tap-github
    config:
      repository: palaimon/ipyannotator palaimon/fastfm2
      start_date: ' 2016-11-28'
    select:
    - commits.url
    - commits.sha
    - commits.*
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4
    config:
      add_metadata_columns: true
      filepath: dump.duckdb
      default_target_schema: tap_github
  transformers:
  - name: dbt-duckdb
    variant: jwills
    pip_url: dbt-core~=1.2.0 dbt-duckdb~=1.2.0
    config:
      path: <my local path>/dump.duckdb

and run into the following error message.

dbt.exceptions.RuntimeException: Runtime Error
  IO Error: Trying to read a database file with version number 39, but we can only read version 38.
  The database file was created with an newer version of DuckDB.
  
  The storage of DuckDB is not yet stable; newer versions of DuckDB cannot read old database files and vice versa.
  The storage will be stabilized when version 1.0 releases.
  
  For now, we recommend that you load the database file in a supported version of DuckDB, and use the EXPORT DATABASE command followed by IMPORT DATABASE on the current version of DuckDB.

The error Trying to read a database file with version number 39, but we can only read version 38. seems pretty clear but I wasn't able to find a way to pin the dbt-duckdb and target-duckdb version in such a way that they are compatible.

A pointer on how to make dbt-duckdb and target-duckdb work together in meltano would be great.

dbt fails to run on duckdb==0.4.0

Any dbt command (I personally tried dbt seed and dbt run) fails with the following error:

Runtime Error
  TransactionContext Error: cannot commit - no transaction is active

Downgrading to duckdb v0.3.2 resolves the issue.

Support for `s3_endpoint`

S3-compatible services require

SET s3_endpoint='<domain>.<tld>:<port>';

https://duckdb.org/docs/extensions/httpfs#configuration

Temp table issues with duckdb >= 0.2.6

I tried bumping the duckdb version up to 0.2.8 (latest as of right now), and ran into a few issues. A couple are minor and I think easily addressed (I can open a PR if desired), but I hit a deeper snag running the pytest suite.

Whenever the test does a CTAS to create a temp table, a subsequent query doesn't see the temp table and fails.

I bisected the git history of duckdb and found duckdb/duckdb#1485 introduced the change, which was released with 0.2.6. But that change looks correct, and the older behavior was in fact a bug in duckdb that was masking another problem (now exposed).

I believe what's happening is an issue in duckdb's handling of multiple cursors on a single connection, but posting here first for advice and in case anyone else is looking to use dbt + duckdb with a recent version.

In duckdb, each call to conn.cursor() creates a new connection. So if a client does:

conn.cursor().execute('create temp table foo as (select 42)')
conn.cursor().execute('select * from foo')

The select will hit an error, because it is running on a separate connection/session and doesn't see table foo. This incorrectly worked previously because the CTAS was not actually creating a temp table.

This code fragment looks a bit silly here, but in the context of dbt this is effectively what is happening as the generic connection logic opens a new cursor prior to each query (see SQLConnectionManager.add_query in dbt).

Other databases (well, Postgres/psycopg2 at least) will share the connection across these cursors.

One potential solution is to open an issue with duckdb to see about connection-sharing across cursors (it may be the same as duckdb/duckdb#1848, in fact). Another is to override the add_query method in dbt's SQLConnectionManager to cache a single cursor for reuse across queries.

Any suggestions?

Glue registration not working "Format is not supported in Glue registrar."

When running a single model aimed to upload a seed csv to a parquet file on S3 and registering it with AWS Glue the following error appears:
"Format is not supported in Glue registrar."

Looking into the code it seems the file format is not passed properly into the "register_glue_table" function in impl.py.

example model:

{{
  config(
    materialized = 'external',
    format = 'parquet',
    location = 's3://example-bucket/example/data.parquet',
    glue_register = true
    )
}}
select 'dbt-duckdb rocks'

with profile:

example:
  outputs:
    dev:
      type: duckdb
      path: ":memory:"
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: 'eu-central-1'
        s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY') }}"
      format: 
  target: dev

I'd love to help out with submitting a fix, only not sure if this needs to be done in the macro or the python code.
I'll try to create a PR to fix it

Incremental external models

Do you think it's feasible to add support for incremental materialisation to external location? Ie. extending the external materialisation to support incremental semantics.

Namely I was thinking of the simplest case of an "append" strategy implemented as just writing a new parquet file. And the model overall would be defined with a glob location.

Do you see any drawbacks of such an approach? I'm happy to submit a PR if you point me in the right direction in terms of what API you'd expect this to have? Make this part of the external materialisation, or of the incremental one, or a completely new one?

Thanks
Nuno

dbt models that are parquet files

I haven't thought through this deeply. It might not make sense, it might require changes to dbt, or it might already work? But I wanted to raise it just in case, because it would help me out with something I'm building.

Could we enable configuring dbt-duckdb such that

select
    bla
from {{ ref("orders") }}

compiles to

select
    bla
from 's3://bucket/orders.parquet'

Support aws webidentity token credentials when interacting with s3 on duckdb

When looking through the code, I noticed that duckdb itself does not support the default aws credentials provider chain. Probably because they do not want to have external dependencies.

I am using dbt-duckdb and for the glue integration you are correctly using the credential provider chain if you do not pass the credentials explicitly. I was wondering if you also want to provide this for the s3 integration?

Next to providing the static credentials through settings properties (as exists at the moment), you could use the boto3 client to get and update credentials when they are expired. These credentials can then be used in the DuckDBConnectionWrapper such that they are always valid and any authentication mechanism can be used.

Example usecase for us:
we use aws irsa for our pods on kubernetes and get the following credentials. Basically a roleArn and a webIdentityToken file, which gets updated every hour. Before I can run dbt commands, I now need to convert these aws credentials to the correct format as is accepted by duckdb as follows:

import boto3

session = boto3.Session()
AWS_ACCESS_KEY_ID = session.get_credentials().access_key
AWS_SECRET_ACCESS_KEY = session.get_credentials().secret_key
AWS_SESSION_TOKEN = session.get_credentials().token

Only then I can execute dbt compile ...
Apart from annoying entrypoint, this has as disadvantage that if my job would take longer than an hour, the credentials would expire...

I think it would be great it the dbt-duckdb project would do this for me. Since you already depend on boto3 the impact is minimal and in line with what you already implemented for the glue integration. What do you think?

Create a simple plugin system for writing data to external destinations

The same as in #141, though, now for writing data to external destinations.

Motivation

The reasoning is similar to reading. Sometimes the data has to be presented in a different system. For example, I use jdbc tables in Spark to write data to a SQL server. The SQL server might be connected to a dashboard tool, which can be preferred over connecting the Spark cluster.

The general approach for the plugins is fantastic! Users could implement anything; maybe they want to send an email with a report summarizing the data quality of a data model.

External sources: more options & docs

I am interested in using external sources. What I am missing in either features or the documentation (if the feature already exists):

How do I set the delimiter for my CSV file?
How do I specify a local path?
What do the list of tables mean and how can I document that? The CSV/parquet would be a single table, why are there multiple in the docs?

Handle column names with spaces

I know it's a bad practice, but you can't always help it that is how you get your source data :)

If column names have spaces and you define tests for them, then you get errors.

E.g.

          - name: "van tijdstip"
            type: time
            description: Start time of the quarter
            tests:
              - not_null

20:24:34    Parser Error: syntax error at or near "tijdstip"
20:24:34    LINE 15: where van tijdstip is null

Probably need to surround all column names with double quotes.

Models that depend on external models do not automatically register their dependencies as views

Suppose we specify +materialization: external in our dbt_project.yml.

Initially we run all dbt computations using an in-memory db, materializing all the models as parquet files.

Subsequently we change one dbt model A, and want to recompute model A (and possibly downstream dependencies as well).

Now, if A depends on upstream models B and C, trying to run just A will result in a failure. Because, B and C are not registered as views yet (in-memory db!), despite all tables being specified as external. This forces the user to always recompute upstream dependencies just to trigger the registration logic https://github.com/jwills/dbt-duckdb/blob/master/dbt/include/duckdb/macros/materializations/external.sql#L51-L55

I'd like to propose putting in a check in the materialization macros to

loop through the parent nodes
for each parent_node, if load_cached_relation(parent_node) is None, call the registration logic.

Does this sound reasonable? Happy to send a PR.

upgrade to support dbt-core v1.4.0

Background

The latest version of dbt Core,dbt-core==1.4.0, was published on January 25, 2023 (PyPI | Github). In fact, a patch, dbt-core==1.4.1 (PyPI | Github), was also released on the same day.

How to upgrade

dbt-labs/dbt-core#6624 is an open discussion with more detailed information. If you have questions, please put them there! dbt-labs/dbt-core#6849 is for keeping track of the community's progress on releasing 1.4.0

The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.4.0 release of your adapter.

support Python 3.11 (only if your adapter's dependencies allow)
Consolidate timestamp functions & macros
Replace deprecated exception functions
Add support for more tests

the next minor release: 1.5.0

FYI, dbt-core==1.5.0 is expected to be released at the end of April. Please plan on allocating a more effort to upgrade support compared to previous minor versions. Expect to hear more in the middle of April.

At a high-level expect much greater adapter test coverage (a very good thing!), and some likely heaving renaming and restructuring as the API-ification of dbt-core is now well underway. See https://github.com/dbt-labs/dbt-core/milestone/82 for more information.

dbt-duckdb should know about Excel

Essentially, any external file format or storage system that can read/write a pandas.DataFrame (or an arrow record reader, or a polars data frame, or a JSON/CSV file) and that has Python library support should be usable as either a) a source or b) an external materialization target in dbt-duckdb.

Reference: dbt-excel.com

Parquet materialization

👋

Hey!

Something I've been thinking about is using parquet as a materialization. To be specific, one where dbt-duckdb would use the underlying duckdb connection as a shim to edit parquet rather than adding some tables into the .duckdb file.

I'm not sure if it's possible to override what ref does in an adapter, but this would roughly need to do two things:

Make a new materialization that just took the query for some model and executes
copy (select ...) to 'model_name.parquet' (format parquet) rather thancreate table model_name as (select ...) for parquet materialized models.
Update ref to use read_parquet(ref'd model name) rather than schema.table_name if the model being referenced was materialized as parquet.

Would love to hear what other people think / if this would be useful!

This is loosely related to: #15

duckdb / dbt-duckdb Goto Github PK

dbt-duckdb's Introduction

DuckDB

Installation

Data Import

SQL Reference

Development

Support

dbt-duckdb's People

Contributors

Stargazers

Watchers

Forkers

dbt-duckdb's Issues

TLDR

Install commands

Expected output

Actual output

Background

How to upgrade

Instructions to reproduce

Work-around

Background

How to upgrade

the next minor release: 1.6.0

Describe the bug

To Reproduce

Expected behavior

Your environment

Background

How to upgrade

Motivation

Background

How to upgrade

the next minor release: 1.5.0

Recommend Projects

Recommend Topics

Recommend Org

the next minor release: `1.6.0`