brooklyn-data / dbt_artifacts Goto Github PK
View Code? Open in Web Editor NEWA dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
License: Apache License 2.0
A dbt package for modelling dbt metadata. https://brooklyn-data.github.io/dbt_artifacts
License: Apache License 2.0
Issue Description
Model fct_dbt__run_results
has a wildcard-enhanced description of all dynamically built env_*
columns.
However, if the dbt_project.yml
is set to
models:
+persist_docs:
columns: true
this results in an error in dbt run
when dbt tries to attach the description to the target table/view, as the column_name env_*
is not a valid column name because of the *
. Furthermore, the operation would fail anyway, since the column simply does not exist.
Summary:
15:05:40 | 41 of 62 START incremental model core_dbt_artifacts.fct_dbt__run_results [RUN]
15:05:46 | 41 of 62 ERROR creating incremental model core_dbt_artifacts.fct_dbt__run_results [ERROR� in 6.16s]
Database Error in model fct_dbt__run_results (models/incremental/fct_dbt__run_results.sql)
001003 (42000): SQL compilation error:
syntax error line 19 at position 12 unexpected '*'.
compiled SQL at target/run/dbt_artifacts/models/incremental/fct_dbt__run_results.sql
Detailed log:
2021-07-09T15:05:46.042098Z: On model.dbt_artifacts.fct_dbt__run_results: /* run by philipp.***** in dbt */
alter table eh_account_dev_philipp_leufke.core_dbt_artifacts.fct_dbt__run_results alter
command_invocation_id COMMENT $$The id of the command which resulted in the source artifact's generation.$$ ,
artifact_generated_at COMMENT $$Timestamp of when the source artifact was generated.$$ ,
dbt_version COMMENT $$The version of dbt used to generate the source artifact.$$ ,
elapsed_time COMMENT $$The total run time of the command.$$ ,
execution_command COMMENT $$The actual command used.$$ ,
selected_models COMMENT $$A list of model selectors used in the command.$$ ,
target COMMENT $$The configured target for the command.$$ ,
was_full_refresh COMMENT $$Was the run executed with a --full-refresh flag?$$ ,
env_* COMMENT $$Columns for the environment variables set when the command was executed.$$ ;
2021-07-09T15:05:46.232293Z: Snowflake query id: 019d7929-0200-982d-0000-649500806272
2021-07-09T15:05:46.232396Z: Snowflake error: 001003 (42000): SQL compilation error:
syntax error line 19 at position 12 unexpected '*'.
2021-07-09T15:05:46.232539Z: finished collecting timing info
Discussion
I am unsure if the dbt is the culprit here, as I am not aware of how common it is to document a set of columns using wildcards in the column names.
Deactivating the persist_docs option for columns does fix the issue, but changes the result of the dbt run, of course.
We still would love to persist the descriptions in the column comments on the target DB.
Right now none of the models gives the information about the number of failures for a given test. If we write a test that gives a number of faulty records, then having this information in model will open the possibility to present it on some kind of dashboard
in stg_dbt__node_executions.sql:
`surrogate_key as (
select
{{ dbt_utils.surrogate_key(['fields.command_invocation_id', 'fields.node_id']) }} as node_execution_id,
fields.command_invocation_id,
fields.dbt_cloud_run_id,
fields.artifact_run_id,
fields.artifact_generated_at,
fields.was_full_refresh,
fields.node_id,
base_nodes.resource_type,
split(fields.result_json:thread_id::string, '-')[1]::integer as thread_id,
fields.status,
fields.result_json:message::string as message,
fields.compile_started_at,
fields.query_completed_at,
fields.total_node_runtime,
fields.result_json:adapter_response:rows_affected::int as rows_affected,
fields.result_json:failures::int as failures,
fields.result_json
from fields
-- Inner join so that we only represent results for nodes which definitely have a manifest
-- and visa versa.
inner join base_nodes on (
fields.artifact_run_id = base_nodes.artifact_run_id
and fields.node_id = base_nodes.node_id)
)in fct_dbt__test_executions.sql:
fields as (
select
node_execution_id as test_execution_id,
command_invocation_id,
dbt_cloud_run_id,
artifact_run_id,
artifact_generated_at,
was_full_refresh,
node_id,
thread_id,
status,
compile_started_at,
query_completed_at,
total_node_runtime,
rows_affected,
failures
from test_executions_incremental
)`
If you could add this, that would be awesome :)
I am getting compilation Error in the model. Below is message it generates while running all models under dbt_atifacts module it failed with model fct_dbt__critical_path.
2022-03-24T09:44:58.342+05:30Copy Processing aborted due to error 300010:391167117; incident 7185034. | Processing aborted due to error 300010:391167117; incident 7185034.
I would like to be able to leverage the package to present data source information in downstream tools, including object descriptions, tags, and metadata.
Proposal would be to add the following columns to dim_dbt__sources
, dim_dbt__models
and dim_dbt__exposures
:
Has this been considered before? I may have some time to contribute this if it's desired.
Right now users upload artifacts by specifying which artifacts to upload (manifest, catalog, etc.) That can be a pain to remember which artifacts to upload after which dbt command.
I'd be interested in seeing a wrapper macro so I could upload artifacts via:
dbt run-operation upload_dbt_artifacts_by_command --args '{command: run}'
This would just be a convenience feature that would map the command to the appropriate artifacts and run the existing macro to upload them. You could still upload specific artifacts using the existing macro.
I'm open to contributing if you're interested in adding this and don't want to spend the time on your end.
It appears that nodes where resource_type = 'seed'
and resource_type = 'snapshot'
are excluded from this package, although these are important operations to track. I would propose adding these as additional dim_
and fct_
tables to complement the existing test_executions
and model_executions
models.
The reason these are currently being exclude is due to this line in stg_dbt__model_executions
:
dbt_run as (
select *
from run_results
where data:args:which = 'run'
),
Filtering on the run
argument will filter out dbt snapshot
and dbt seed
calls.
It would be epic to have this package support Redshift. One of the things we will need to discuss is how we will recommend loading into Redshift (or S3 would be better for Redshift)
Hi! I hope this format is ok for a new issue, I didn't see any explicit guidelines in the README.
Currently, the README for 0.7.0 references ..._v2
functionality concerning artifact loads, but does not contain the code (which is in the 0.8.0 prerelease branches.)
This is somewhat confusing for people currently integrating the package into their projects and should probably not already be present in the current 0.7.0 release documentation.
Thank you for all the hard work on this, very neatly solves observability issues with dbt/dbt-Cloud!!!
Hi there!
I've been playing around with this package for a while, and really like all the info it provides! I see that there are currently a few issues out there to support other warehouses, and it looks like the main thing slowing that support down is the upload of the artifacts.
I'm curious if you have considered swapping to the context variables that dbt provides as the source of data. I haven't done an exhaustive review of the data between the two, but it looks like the results variable may have everything you need from run results, and the graph variable may have the things you need from the manifest. Since you can pull these directly from dbt without the need to import files, it may solve some problems on that side of things.
I'm in the very early stages of thinking about a stand-alone package to parse those two variables and load them to a database into audit style tables, and thought it may be helpful here as well.
Edit: A good example of this for the graph object is the recently released project evaluator package. These macros from that package show how you can parse the graph object to get info on the project.
Hello, recently I was tracking down why many of my models were empty and I tracked it down to the command I was using to invoke the new upload macro:
dbt run-operation upload_dbt_artifacts_v2
I found that my manifest.json was being replaced by the macro's execution and that was causing the STG_DBT__NODE_EXECUTIONS
model to have zero rows because it has an inner join between stg_dbt__artifacts
and stg_dbt__nodes
. In my case the issue was resolved by using this command to upload instead:
dbt --no-write-json run-operation upload_dbt_artifacts_v2
I recommend the README.md be updated to include the --no-write-json
parameter.
I also recommend committing the fix for issue #64 because that was also one of my root causes.
I think a really interesting use case for Meltano users would be to pair dbt_artifacts
with https://github.com/prratek/tap-dbt-artifacts
Just from eyeballing the staging models in dbt_artifacts
and comparing them with the the schemas generated by tap-dbt-artifacts
, it looks like it should be pretty easy to get these two talking to each other; I can imagine a few potential approaches but would might be to create staging tables that shape the tap-dbt-artifacts
generated tables into the form expected by the sources defined at https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/sources.yml
I'm interested in taking a stab at this but would be curious for feedback on best approach from the perspective of this package's maintainers.
hi team:
I am scratching my head on this one, I am running DBT from DBT cloud
first step is: dbt snapshot ( I have also tried dbt snapshot -s dbt_artifacts)
2nd step I run dbt run --exclude dbt_artifacts
3nd step is : dbt run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'
4th step is : dbt run -m dbt_artifacts
but after all processed, I have data in FCT_DBT__RUN_RESULTS and FCT_DBT__MODEL_EXECUTIONS
plus the DIM_DBT__SNAPSHOTS dimension is populated, I think from manifest..
I just dont know why the FCT_DBT__SNAPSHOT_EXECUTION doesnt?
thanks
We very rarely refresh our whole dbt project in one go - much more likely is that individual marts are refreshed separately.
This makes the critical path model a little wierd, because it currently only picks up the most recent critical path (for whichever mart ran most recently).
I have a suggestion to make this better:
<target>|<selector>
. That results in default|*
for most simple projects (if no explicit selectors is rendered as *
), or e.g. my_target|models/mart_a|my_package
for more complicated setups (if multiple selectors are also delimited by |
). I'm going to call this a job_selector
.job_selector
, the critical path makes sense again, so adding this as a column to that model, would then mean users could see the critical path for each job_selector
. For users with a basic setup, this wouldn't add any additional granularity to that model, because they would only have one.job_selector
ids are allowed to materialise in this model (maybe not something to build in round 1 though).latest_full_model_executions
model and the new current_models
model - in that the latter just has one row per node, but the former has one row per nod per job selector.Does this sound like a viable solution to the problem? @NiallRees
I recently implemented this package and it has been working great; however, we recently switched away from dbt run to dbt build. In doing so, I noticed that artifacts were no longer being parsed out into the fact and dimension models during a dbt_artifacts run despite being in the dbt_artifacts table.
Upon investigation, the model execution and run results models have a filter to limit the artifacts to those which are from run commands only, and not builds. I propose the filter be adjusted to data:args:which in ('run', 'build')
for these models (or at least do the reverse and filter out non-run commands for future scalability):
Unless there is a reason in particular while builds are not included - please let me know. Thanks!
The unique test for the field dim_dbt__exposures.manifest_model_id
is failing when depends_on_nodes
has more than one node. It seems like for this model, the depends_on shouldn't get expanded, and there should be a separate model to cross-reference exposures and their depends-on models?
Alternately, the test should get switched to a dbt_utils.unique_combination_of_columns
with manifest_model_id
and output_feeds
.
I'm glad to contribute to this if you let me know which way you want it to go - split out output_feeds
or keep it in and switch the test.
Hey there!
We've been collecting dbt artifacts for a while now and already back when we started I noticed that the fct_dbt__critical_path
table looked very suspicious (i.e. too good to be true) in our case, and it still does:
I tried to investigate this issue by looking at the structure of the fct_dbt__critical_path
model and it seems that everything until search_path (node_ids, total_time)
produces the expected results (in particular, in our case). I was just confused by the fact that total_time
column seems to be a string instead of float. However, starting with the next CTE called longest_path_node_ids
we are getting the wrong node sequence. There must be something that this case when
does that I don't understand.
Any help would be appreciated!
Hey there - was working on installing this package and was running into some issues with the fct_dbt__critical_path
table being empty. I took out part of the CTE in fct_dbt_latestfull_model_executions
that filters for full runs as this is not super helpful to us i.e.
latest_full as (
select *
from run_results
where
--selected_models is null and
was_full_refresh = false
order by artifact_generated_at desc
limit 1
),
but still no dice. I thought it was because the model joins on command_invocation_id
which seems to be different for each artifact created> As in the run_result json
has a different id than the manifest.json
even though we are running them on the same command line?
that join happens in the same model here-
joined as (
select
model_executions.*
from latest_full
left join model_executions on model_executions.command_invocation_id = latest_full.command_invocation_id
Any help is greatly appreciated.
This is a question raised by #75. Incidentally, my suggestion here would be an alternate solution, which would fix the bug in that issue.
The documented intent of the fct_dbt__latest_full_model_executions
model is:
A list of all models and executions times from the most recent, incremental run.
Our setup as a segmented approach to refreshing different marts, and so the most recent run overall, will almost never contain all models. I'd like to raise the possibility of changing how this model works - or at least clarifying it's purpose. In my mind there are two options:
The second interpretation would be much more useful in our context, but would be a change in the intent of what's going on. For any users who refresh the whole project monolithically - the main difference would be that any models which had been removed from the project would still appear in this model, just with a much older "last run" timestamp.
@NiallRees - do you think this change makes sense, or should we make a new model for this purpose?
We're intermittently experiencing this failure:
Failure in test not_null_fct_dbt__latest_full_model_executions_node_id (models/schemas.yml)
Got 1 result, configured to fail if != 0
compiled SQL at target/compiled/dbt_artifacts/models/schemas.yml/schema_test/not_null_fct_dbt__latest_full_model_executions_node_id.sql
On inspection this happens when a new upload coincides with an incremental run of the package. Due to (what I'm pretty sure is...) a race condition means that while the most recent run is present in fct_dbt__run_results
, it is not present in fct_dbt__model_executions
- and so the most recent run appears to have no model executions. The current workaround is to run a full rebuild, which take significant time.
A better solution would be to only allow runs which have model executions in this model. It adds complexity but would explicitly eliminate this race condition.
thread_id in run_results.json can have a different format from the default, which causes the mapping to fail
https://docs.python.org/3/library/threading.html#threading.Thread
name is the thread name. By default, a unique name is constructed of the form “Thread-N” where N is a small decimal number, or “Thread-N (target)” where “target” is target.name if the target argument is specified.
Caused error:
Database Error in model fct_dbt__test_executions (models/incremental/fct_dbt__test_executions.sql)
100071 (22000): Failed to cast variant value "6 (worker)" to FIXED
Database Error in model int_dbt__model_executions (models/incremental/int_dbt__model_executions.sql)
100071 (22000): Failed to cast variant value "1 (worker)" to FIXED
Database Error in model fct_dbt__snapshot_executions (models/incremental/fct_dbt__snapshot_executions.sql)
100071 (22000): Failed to cast variant value "4 (worker)" to FIXED
With ~1 year of historical manifest.json
and run_results.json
data, we have started experiencing timeouts running --full-refresh
of dbt_artifacts
.
1832 | 2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - Database Error in model dim_dbt__sources (models/incremental/dim_dbt__sources.sql)
-- | --
2021-04-15 15:43:49: 2021-04-15 15:43:49,243 - root - INFO - 000630 (57014): Statement reached its statement or warehouse timeout of 1,200 second(s) and was canceled.
Possible solutions:
#81 migrated several OR statements into a consistent artifact_run_id
. This was a good improvement but results in a few tricky issues about granularity.
Situation:
seed
, run
and test
(I don't believe the problem I'm highlighting would occur in the case of running build
).upload_artifacts
macros.run_results
for each step, with each of the models/tests/seeds executed at each step (which is good) and also a duplicate manifest for each, with a different command_invocation_id
but the same artifact_run_id
.MERGE
statements in #99 don't solve this situation currently because we're merging on artifact_run_id
AND atrifact_generated_at
.artifact_run_id
in models we get a cartesian join and more rows than we need. Most specifically in stg_dbt__node_executions
, but also in fct_dbt__snapshot_executions
, fct_dbt__seed_executions
& fct_dbt__model_executions
.I think the solution to this is to change the join in these cases to join only on command_invocation_id
which is unique to a given artifact - and avoids the cartesian join.
It would be epic to have this package support Databricks. One of the things we will need to discuss is how we will recommend loading into Databricks
A few questions have come though on the dbt
slack workspace about deployment on dbt Cloud relating to artefact uploads. It would be useful to have a definitive method for deployment on dbt Cloud, in addition to the manual upload and CLI run-operation
options.
As I understand it, the key issue revolves around the invocation_id
generated by dbt and stored in run-results.json
and manifest.json
. This is the same for a given command, however run-operation
(used to call the macro to upload artefacts) creates its own manifest.json
replacing the one from any previous operation 🤦♂️ This results in dissimilar invocation_id
s in run-results.json
and manifest.json
, breaking functionality that relies on joining the two artefacts.
For those calling run-operation
via the CLI, the --no-write-json
global flag is reported to solve this particular issue. It would also be possible to configure an env_var prefixed with DBT_ENV_CUSTOM_ENV_
that is injected into each artefact as per the docs. These vars would appear in stg_dbt__run_results_env_keys
and could be used in other artefacts to correlate artefacts from the same 'run' (this is not implemented in dbt_artifacts
but could be if it was of use).
However, dbt Cloud supports neither --no-write-json
nor custom env_vars 🤔 dbt Cloud does however provide env_vars with job and run id's that could potentially be used for this purpose. I cannot determine if these make it into the artefacts or not. If they don't, an alternative approach might be to inject them directly via the upload_artifacts
copy query. Something like:
begin;
copy into {{ src_dbt_artifacts }} from
(
select
$1 as data,
$1:metadata:generated_at::timestamp_ntz as generated_at,
metadata$filename as path,
regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type,
-- dbt cloud env_vars
{{ env_var('DBT_CLOUD_PROJECT_ID', 'not on cloud') }} as dbt_cloud_project_id,
{{ env_var('DBT_CLOUD_JOB_ID', 'not on cloud') }} as dbt_cloud_job_id,
{{ env_var('DBT_CLOUD_RUN_ID', 'not on cloud') }} as dbt_cloud_run_id,
{{ env_var('DBT_CLOUD_RUN_REASON_CATEGORY', 'not on cloud') }} as dbt_cloud_run_reason_category,
{{ env_var('DBT_CLOUD_RUN_REASON', 'not on cloud') }} as dbt_cloud_run_reason,
{{ env_var('DBT_CLOUD_PR_ID', 'not on cloud') }} as dbt_cloud_pr_id,
{{ env_var('DBT_CLOUD_GIT_SHA', 'not on cloud') }} as dbt_cloud_git_sha
from @{{ src_dbt_artifacts }}
)
file_format=(type='JSON')
on_error='skip_file';
commit;
Help, testing and comments on this very welcome 🙏
I ran dbt build followed by dbt run-operation upload_dbt_artifacts_v2 - four times on empty tables
the results from this query:
select distinct
artifact_run_id, command_invocation_id, 'dbt_run_results_nodes' as _source
from dbt_artifacts.dbt_run_results_nodes
union all select distinct
artifact_run_id, command_invocation_id, 'dbt_run_results' as _source
from dbt_artifacts.dbt_run_results
union all select distinct
artifact_run_id, command_invocation_id, 'dbt_manifest_nodes' as _source
from dbt_artifacts.dbt_manifest_nodes
order by _source, command_invocation_id
show that the ids from the manifest do not match those with the run_result nodes.
The result is empty tables after running the models from dbt_artifacts
Hi there - this package fits our needs for monitoring model run times perfectly, and we were able to run all aspects of it locally using the dbt CLI and in our development environments on dbt Cloud using the dbt Cloud IDE; however, we're unable to get the dbt-run-operation upload_dbt_artifacts --args '{filenames: [manifest, run_results]}'
to run as a step in a dbt Cloud job due to this recurring error:
Running with dbt=0.19.1
Clearing existing files from Stage:
remove @ANALYTICS.dbt_artifacts.artifacts pattern='.*.json.gz';
Encountered an error while running operation: Database Error
002003 (02000): SQL compilation error:
Stage 'ANALYTICS.DBT_ARTIFACTS.ARTIFACTS' does not exist or not authorized.
Some troubleshooting steps we've taken so far:
dbt_project.yml
file and reinstalling the packagedev
and prod
targets that we use internally as well as the default
targetAll of those steps result in the same failure message shown above only when running the operation from a dbt Cloud job.
I've also tried using our dbt Cloud user's credentials, which we typically only use for dbt Cloud jobs, and was able to get the artifact upload to work properly when running locally.
Do you have any ideas on why we might be seeing this failure or other steps I could try to troubleshoot? I've tried doing most of the troubleshooting on uploads following some sort of dbt run
operation (those run operations have succeeded). I am also able to run the artifact models from a dbt Cloud job successfully, it only seems to be the uploads that are causing issues.
Any advice would be appreciated - thanks for the help!
When I try to run all models by running dbt build -s dbt_artifacts
, I get the following error:
Database Error in model stg_dbt__node_executions (models/staging/stg_dbt__node_executions.sql)
07:36:46 001789 (42601): SQL compilation error:
07:36:46 invalid number of result columns for set operator input branches, expected 12, got 15 in branch 2
07:36:46 compiled SQL at target/run/dbt_artifacts/models/staging/stg_dbt__node_executions.sql
When checking the model compiled SQL, I noticed that indeed those two things produce different number of columns, thus UNION can't work:
-- V1 uploads
{{ flatten_results("run_results") }}
union all
-- V2 uploads
-- NB: We can safely select * because we know the schemas are the same
-- as they're made by the same macro.
select * from base_v2
This is my first attempt of trying to use your package, so I'm not sure whether I'm missing something. I'm also not sure what' V1 uploads sql comment means, since I've only ever used only V2 macro. I'm on 0.8.0 version.
Steps to reproduce:
dbt run-operation create_artifact_resources
dbt run -m +model_name
dbt --no-write-json run-operation upload_dbt_artifacts_v2
dbt build -s dbt_artifacts
Hey dbt_artifacts Team!
So I haven't yet migrated to dbt v1.0.0 due to the fact that we use the SQLFluff linter (Thank you Tails.com for that one too!) in our pre-commit checks. With that, I suspect the dbt_artifacts doesn't support v1.0.0 right now. Is there a timeline for when this package will support v1.0.0?
If not, I'm curious to try and work on this if it seems like a relatively low lift from y'all's perspective to implement.
Hey thanks so much for doing this it has really helped me track our dbt run times.
In the run_results.json
artefact, I see that the env property is null:
This means that stg_dbt__run_results_env_keys is not being populated because the sql is looking for a env.key field. Is there anyway we can populate this field so I can use all the models. I am running dbt on the cloud. CTE below which references this field:
Is the env variable a variable that I should be setting or an argument I should be passing in?
stg_dbt__run_results
show models args as selected_models column, but since moving to dbt1, the models flag has been deprecated and the select flag is being used instead. It would be great to have the information from the select flag available in the model.
An example implementation could be replacing this line:
https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/stg_dbt__run_results.sql#L36
by something like the following code:
coalesce(data:args:models, data:args:select) as selected_models,
to account for projects transitioning from dbt prior to dbt1 to dbt1.
The Unique test on this model is failing when I run on dbt Cloud:
Failure in test unique_fct_dbt__model_executions_model_execution_id (models/schemas.yml)
I suspect this is a result of two things, but it might still need more digging.
dim_dbt__models
.dbt_cloud_run_id
environment variable for this join in fct_dbt__model_executions
:
model_executions_with_materialization as (
select
model_executions_incremental.*,
models.model_materialization,
models.model_schema,
models.name
from model_executions_incremental
left join models on
(
model_executions_incremental.command_invocation_id = models.command_invocation_id
or model_executions_incremental.dbt_cloud_run_id = models.dbt_cloud_run_id
)
and model_executions_incremental.node_id = models.node_id
),
This doesn't immediately get caught if you use dbt Cloud for ci because the first dbt run
doesn't have the artifacts from itself, so only the manifest from dbt seed
exists the first time dbt_artifacts runs. Any subsequent run fails.
A simple workaround is not to upload manifest after dbt run
(or seed, but I figure it's good to upload it at the earliest possible step), but I'm not sure if that breaks anything later on in this package? If not, it might be a simple documentation fix to take manifest out of that step. I tested this and it resolves the failing test, I just don't know if this loses any detail I should have.
Another possible solution would be to coalesce(dbt_cloud_run_id, command_invocation_id)
into a single field so when the cloud_run_id is present it gets used, otherwise it falls back on the command_invocation_id. This might be a bigger task throughout the package, but I think would have the side effect of solving this issue.
we are seeing below test failures on dbt_artifacts models.
Failure in test unique_dim_dbt__current_models_manifest_model_id (models/schemas.yml)
Failure in test unique_dim_dbt__models_manifest_model_id (models/schemas.yml)
Failure in test unique_dim_dbt__sources_manifest_source_id (models/schemas.yml)
Failure in test unique_fct_dbt__critical_path_node_id (models/schemas.yml)
Failure in test unique_fct_dbt__latest_full_model_executions_node_id (models/schemas.yml)
Failure in test unique_fct_dbt__model_executions_model_execution_id (models/schemas.yml)
on investigation, we found that the below inner join fix for issue 75 is doing a 1 to many join and causing entire row duplicates.
"-- #75 inner join run_results on dbt_sources.artifact_run_id = run_results.artifact_run_id ) "
This is because we upload the run results after every dbt run where we generate multiple run results and single manifest. If we can change the join to where exists this duplication will be resolved as the idea is to only enforce and not pull any other fields from inne r join view.
We fixed it for time being by using 0.6.0 version.
Note: We are running this on a single tenancy dbt cloud
Thanks for creating this package - we have found it very useful in our dbt projects!
One thing we found missing was capturing information about exposures defined within our dbt projects. Especially, we were interested in exposing to our end users latest update times of different output models feeding the different exposures. We can extract the latest run of each model from the FCT_DBT__MODEL_EXECUTIONS model but exposures are not captured in any of the models.
Based on your work, we have created three models to capture this information and would be very keen to explore if you'd be interested in incorporating it into your package. Similar to your models capturing data on MODELS we built:
Let me know if that's something you'd be interested in!
Re: #62
Hi!
Just gave the V2 uploader a spin, and I can't seem to get around the 16MB limit still. Am I doing something wrong?
Debugging the sql tells me that the raw_data
cte is where the 100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes
error comes from. So I imagine that at that point, the sql is still trying to parse the manifest as one single variant..?
I've run dbt run-operation create_artifact_resources
beforehand.
Log:
dbt --no-write-json run-operation upload_dbt_artifacts_v2
13:52:38 Running with dbt=1.0.3
13:52:38 Clearing existing files from Stage:
remove @analytics_dev.dbt_larspetter.dbt_artifacts_stage pattern='.*\/c39ee8dc-197a-454e-9814-b4dddd921eea\/.*\.json.gz';
13:52:39 Uploading manifest.json to Stage:
put file://target/manifest.json @analytics_dev.dbt_larspetter.dbt_artifacts_stage/c39ee8dc-197a-454e-9814-b4dddd921eea auto_compress=true;
13:52:42 Persisting flattened manifest nodes manifest.json from Stage:
-- Merge to avoid duplicates
merge into analytics_dev.dbt_larspetter.dbt_manifest_nodes as old_data using (
with raw_data as (
select
manifests.$1:metadata as metadata,
metadata:invocation_id::string as command_invocation_id,
-- NOTE: DBT_CLOUD_RUN_ID is case sensitive here
metadata:env:DBT_CLOUD_RUN_ID::int as dbt_cloud_run_id,
sha2_hex(coalesce(dbt_cloud_run_id::string, command_invocation_id::string), 256)
as artifact_run_id,
metadata:generated_at::timestamp_tz as generated_at,
manifests.$1 as data
from @analytics_dev.dbt_larspetter.dbt_artifacts_stage/c39ee8dc-197a-454e-9814-b4dddd921eea as manifests
)
select
manifests.command_invocation_id,
manifests.dbt_cloud_run_id,
manifests.artifact_run_id,
manifests.generated_at::timestamp_tz as artifact_generated_at,
node.key as node_id,
node.value:resource_type::string as resource_type,
node.value:database::string as node_database,
node.value:schema::string as node_schema,
node.value:name::string as name,
-- Include the raw JSON to unpack other values.
node.value as node_json
from raw_data as manifests,
lateral flatten(input => manifests.data:nodes) as node
union all
select
manifests.command_invocation_id,
manifests.dbt_cloud_run_id,
manifests.artifact_run_id,
manifests.generated_at::timestamp_tz as artifact_generated_at,
exposure.key as node_id,
'exposure' as resource_type,
null as node_database,
null as node_schema,
exposure.value:name::string as name,
-- Include the raw JSON to unpack other values.
exposure.value as node_json
from raw_data as manifests,
lateral flatten(input => manifests.data:exposures) as exposure
union all
select
manifests.command_invocation_id,
manifests.dbt_cloud_run_id,
manifests.artifact_run_id,
manifests.generated_at::timestamp_tz as artifact_generated_at,
source.key as node_id,
'source' as resource_type,
source.value:database::string as node_database,
source.value:schema::string as node_schema,
source.value:name::string::string as name,
-- Include the raw JSON to unpack other values.
source.value as node_json
from raw_data as manifests,
lateral flatten(input => manifests.data:sources) as source
) as new_data
-- NB: We dedupe on artifact_run_id rather than command_invocation_id for manifest nodes
-- to avoid holding duplicate data.
on old_data.artifact_run_id = new_data.artifact_run_id and old_data.node_id = new_data.node_id
-- NB: No clause for "when matched" - as matching rows should be skipped.
when not matched then insert (
command_invocation_id,
dbt_cloud_run_id,
artifact_run_id,
artifact_generated_at,
node_id,
resource_type,
node_database,
node_schema,
name,
node_json
) values (
new_data.command_invocation_id,
new_data.dbt_cloud_run_id,
new_data.artifact_run_id,
new_data.artifact_generated_at,
new_data.node_id,
new_data.resource_type,
new_data.node_database,
new_data.node_schema,
new_data.name,
new_data.node_json
)
13:52:44 Encountered an error while running operation: Database Error
100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes
File 'c39ee8dc-197a-454e-9814-b4dddd921eea/manifest.json.gz', line 1, character 16777216
Row 0, column $1
Hi folks,
I'm attempting to get artifacts running, but I am getting the following error when I try to dbt run --models dbt_artifacts
:
Database Error in model fct_dbt__critical_path (models/fct_dbt__critical_path.sql)
000940 (22023): SQL compilation error: error line 46 at position 12
missing required argument [INPUT] for function [flatten]
compiled SQL at target/run/dbt_artifacts/models/fct_dbt__critical_path.sql
At a guess this is because we don't have any manifests loaded (ours are too big to fit into a single VARIANT). We were hoping to still get some value out of the package by checking model creation times, test statuses (when those land), etc. Can you take a look?
Hey there!
After analysing dbt models from this package I came to the conclusion that all artifacts that are produced by running dbt test
(or, simply representing tests in manifest.json
with resource_type = 'test'
) are not being used at all.
In all stg models either data:args:which = 'run'
for run_results.json
or node.value:resource_type = 'model'
for manifest.json
.
Hence, if my assumption above is correct, running
$ dbt test
$ dbt run-operation upload_dbt_artifacts --args '{filenames: [run_results]}'
will load run_results.json
into Snowflake but it won't be picked up by any of the consequent models (except for stg_dbt__artifacts
, but it stops there actually).
Long story short, I suggest adding the following to the package:
data:args:which
which should allow distinguishing between artifacts produced - this can also be taken from the data
column, so it's a nice to have.dbt test
- initially, there can be a model called stg_dbt__test_executions
and then some other models built on top of this one.Ideally, none of us would want any tests failing in production. However, in real-world they sometimes do and it would be nice to know which ones, since when, some SLA-related metrics, etc.
Let me know what you think of this :)
Currently it is not self-explaining how the fact and dimension tables are to be joined.
It would be helpful to have an ERD available in the documentation of this project.
Or there should be relationship
tests on the foreign-key columns, so the users can identify the suitable join conditions.
Executing the below steps with latest dbt version and latest pre-release of dbt_artifacts, but data fails to load into nearly all of the tables when running incrementally.
Have tried tracing all of it backward through stage models and incremental loads, but can't seem to narrow down an area where it's breaking down.
Some users with very large projects have encountered issues in the https://github.com/brooklyn-data/dbt_artifacts/blob/main/macros/upload_artifacts.sql#L25 copy into step, where the manifest.json is too large to load into a variant column. We currently use a single table for all artifact types, and each artifact becomes a single row.
We can solve this issue by flattening the data before loading into a Snowflake table. The schema for each artifact differs at the first level of keys however, and so we will need a separate table for each artifact.
Transformations are heavily restricted in the existing copy into
command (Snowflake docs). If we instead use an insert into
command to load from stage to table, we can run any select statement needed without any transformation restrictions. For example, an illustrative command similar to https://docs.snowflake.com/en/user-guide/data-load-transform.html#flatten-semi-structured-data:
insert into {{ src_dbt_artifacts_manifest }}(node, generated_at, path, artifact_type)
select
nodes.name as name,
nodes.unique_id as unique_id,
nodes.package_name as package_name,
nodes.path as path,
nodes.value as value,
artifacts.metadata:generated_at::timestamp_ntz as generated_at,
metadata$filename as stage_filename,
regexp_substr(metadata$filename, '([a-z_]+.json)') as artifact_type
from @{{ src_dbt_artifacts }} as artifacts,
lateral flatten(input => artifacts.nodes) as nodes
where regexp_substr(metadata$filename, '([a-z_]+.json)') = 'manifest.json'
I am not able to configure a custom schema. Can somebody please help with this?
I am getting still the Public schema used when I run dbt run -m dbt_artifacts.
I have set dbt_artifacts schema name and materialization but still, I am getting models been built under Public Schema.
My Configuration below in my root dbt project (dbt_project.yml)
# DBT_ARTIFACTS ----------------------------------------
dbt_artifacts:
+schema: dbt_artifacts
+materialized: table
incremental:
+schema: dbt_artifacts
+materialized: incremental
staging:
+schema: dbt_artifacts
+materliazed: view
When you run dbt deps to get packages from the dbt Package hub, the wrong version of dbt_artifacts is seemingly pulled.
You can easily verify this by going to https://hub.getdbt.com/brooklyn-data/dbt_artifacts/latest/ and click View on Github under the README file. Navigate to the dbt_project.yml file and see that the version is 0.6.0 instead of the latest, 0.8.0.
I'm noticing some undesirable behavior for the incremental population of the transient tables.
Let's assume fct_dbt__test_executions doesn't get populated on a dbt run dbt_artifacts
run, but gets created as a table with 0 records. Any subsequent run with how the incremental filter is setup like below is blocked from populating the table (which can easily happen if we only do model runs, without tests):
where artifact_generated_at > (select max(artifact_generated_at) from {{ this }})
The above will always be false since a null will be returned from the transient table. A fix may be as simple as adding another condition to is_incremental like below:
{% if is_incremental() and var_row_count > 0 %}
-- this filter will only be applied on an incremental run
where artifact_generated_at > (select max(artifact_generated_at) from {{ this }})
{% endif %}
Where var_row_count
is just a count of {{ this }}
. Might be overthinking this.. but it's strange when the incrementally populated table is empty while the view counterpart returns the current test executions.
Any suggestions or insights?
The dbt package is awesome. I would like to do the same thing for BigQuery. However, I am not sure we can realize the same with only run_query
, because for instance we have to upload artifacts files to GCS in the case of BigQuery. As far as I know, there is no statement type to put it to GCS or BigQuery directly.
Unfortunately, there is nothing like DDL to load data from local. It would be worthwhile asking that to Google Cloud.
The latest README says to update to 0.8 which is still in pre-realease.
All references throughout commits and documentation reference macros upload_dbt_artifacts
and upload_dbt_artifacts_v2
but the macros do not have _dbt
in the name (eg: it's upload_artifacts_v2
.
as such, users are unable to use the documentation without digging through source code to find the issues - the front page docs for 0.7 reflect macros that are still in pre-release as well as have name differences.
This repo sets a custom schema here:
https://github.com/tailsdotcom/dbt_artifacts/blob/main/dbt_project.yml#L8
models:
dbt_artifacts:
+schema: default_dbt_artifacts_schema
I think in most cases I'd recommend that the custom schema not be set (and therefore the default schema gets used).
Is there a particular reason for setting this value?
Occasionally I get one of these, when running the v2 upload on the new release branch (basically on master):
04:38:00 Encountered an error while running operation: Database Error
100112 (22000): Remote file 'http://.../stages/5ef57c80-83d2-4383-94ed-7c3738a0a04b/manifest.json.gz' was not found. There are several potential causes. The file might not exist. The required credentials may be missing or invalid. If you are running a copy command, please make sure files are not deleted when they are being loaded or files are not being loaded into two different tables concurrently with auto purge option.
This is a snowflake error, and I suspect it's to do with collisions in the stage. Either because the purge of the file is too aggressive (we're removing anything which matches pattern='.*.json.gz'
or because we need to wait for snowflake to load the file properly before removing.
Given I think each dbt cloud run is running in an isolated environment, I think it's most likely that the issue is happening in the stage rather than locally. Suggested mitigating measures:
remove
command.remove
and use the PURGE
option on the stage to try and get snowflake to remove files itself automatically post-load.command_invocation_id
.Gut feel on my part is that the last option alone might be the most elegant solution, because that totally isolates each run.
When running:
dbt run-operation upload_dbt_artifacts ...
This causes manifest.json
to get a different invocation_id
. This is intended dbt behavior and it makes sense, however it does have the side-effect of causing manifests and run-results that would usually be join-able on invocation_id
to not join.
I re-named the manifest file I wanted to retain the invocation_id
for prior to running the upload operation and uploaded the renamed version, using some updates in Snowflake to change the PATH and ARTIFACT_TYPE back to manifest.json
and manifest.json.gz
accordingly.
I'm not aware of any ways around this if uploading locally using the dbt cli. I will probably put this in a pipeline and upload it via AWS S3 and Snowflake (as per Option 2 on the README)
I think that this should at least be documented or noted somewhere because this took me a few days of headache and I think it will be worth saving anyone else the trouble in future.
Thanks!
I was linting a project which uses this package, and got a curious error:
RuntimeError: File C:\Users\...\data_warehouse\dbt_packages\dbt_artifacts\integration_test_project\models\incremental.sql was not found in dbt project
Now this is evidently a sqlfluff problem, but it's failing to find a file, which is a test file within dbt artifacts. I don't think this file should even be present or rendering within the host project.
Sure enough when I check the dbt_packages
folder I see the test project included in the installed package, which I suspect is why dbt is trying to compile it along with the rest of the project.
Is there something in the dbt package config which we should update so that the test project isn't included in the eventual parse DAG of a host project?
Only dbt_utils 0.7.0+ is supported for dbt 0.20.0. Is it possible to upgrade this dependency to support that?
https://github.com/tailsdotcom/dbt_artifacts/blob/main/packages.yml#L3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.