snowplow / dbt-snowplow-utils Goto Github PK
View Code? Open in Web Editor NEWSnowplow utility functions to be used in conjunction with the snowplow-web dbt package.
License: Other
Snowplow utility functions to be used in conjunction with the snowplow-web dbt package.
License: Other
Time to drop support for dbt versions below 1.0.0 and support all dbt 1.x versions
The release of snowplow-utils v.0.2.0 introduces integration tests for many of the macros within the package.
Some of these tests are native dbt based tests, while others require specific scripts.
This is typically because they either:
It would be good to have a script, integrations_tests.sh
, that runs all available tests across all warehouses.
The utils package contains a test suite. The testing workflow so far as been manual, executing scripts locally. This is cumbersome and error prone when testing across many warehouses and dbt versions.
Create a Github Action to run all tests across supported warehouses and dbt versions. The limitation here is we cannot currently test against Redshift via an Action. To do so we need to whitelist the Actions IP address. These rotate every week. Ideally we need a self hosted runner.
The structure of the snowplow-utils package could be improved by splitting out each macro into its own file. This would make it easier to find macros.
To bring Postgres support to the snowplow-web
package we need to:
snowplow-incremental
materializationWhen you run dbt compile
on a fresh installation of the dbt-snowplow-*
packages you get a compilation error since the get_incremental_manifest_status
macro tries to reference tables that do not exist. A user has kindly offered the following macro (for snowflake) that should solve this problem. Perhaps we can explore this macro and modify it to be compatible with all databases such that the dbt compile
command runs without errors on initial use.
{% macro get_incremental_manifest_status(incremental_manifest_table, models_in_run) -%}
{% if not execute %}
{{ return(['', '', '', '']) }}
{% endif %}
{% set table_exists_query %}
select exists(select 1 as x from information_schema.tables t where '"'+current_database()+'"'+'.'+'"'+table_schema+'"'+'.'+'"'+t.table_name+'"' = '{{ incremental_manifest_table }}') as has_tbl
{% endset %}
{% set table_exists_query_result = run_query(table_exists_query) %}
{% do log(table_exists_query_result.columns[0]) %}
{{ log('log0') }}
{% if table_exists_query_result.columns[0].values()[0] == True %}
{{ log('log01') }}
{% set last_success_query %}
select min(last_success) as min_last_success,
max(last_success) as max_last_success,
coalesce(count(*), 0) as models
from {{ incremental_manifest_table }}
where model in ({{ snowplow_utils.print_list(models_in_run) }})
{% endset %}
{{ log('log02') }}
{% set results = run_query(last_success_query) %}
{{ log('log1') }}
{% if execute %}
{{ log('log2') }}
{% set min_last_success = results.columns[0].values()[0] %}
{% set max_last_success = results.columns[1].values()[0] %}
{% set models_matched_from_manifest = results.columns[2].values()[0] %}
{% set has_matched_all_models = true if models_matched_from_manifest == models_in_run|length else false %}
{% endif %}
{{ return([min_last_success, max_last_success, models_matched_from_manifest, has_matched_all_models]) }}
{% else %}
{{ log('Table does not exists.') }}
{{ return(['', '', '', '']) }}
{% endif %}
{%- endmacro %}
{# Prints the run limits for the run to the console #}
{% macro print_run_limits(run_limits_relation) -%}
{% set run_limits_query %}
select lower_limit, upper_limit from {{ run_limits_relation }}
{% endset %}
{# Derive limits from manifest instead of selecting from limits table since run_query executes during 2nd parse the limits table is yet to be updated. #}
{% set results = run_query(run_limits_query) %}
{% if execute %}
{% set lower_limit = snowplow_utils.tstamp_to_str(results.columns[0].values()[0]) %}
{% set upper_limit = snowplow_utils.tstamp_to_str(results.columns[1].values()[0]) %}
{% set run_limits_message = "Snowplow: Processing data between " + lower_limit + " and " + upper_limit %}
{% do snowplow_utils.log_message(run_limits_message) %}
{% endif %}
{%- endmacro %}
The current versions of the macro has several limitions:
REPEATED
i.e. an array, only the first element will be taken.Refactor to solve for the above limitations.
Currently snowplow_delete_from_manifest
requires the package_name
arg. This allows the macro to work across both the web and mobile package.
This creates a slightly clunky user experience of either package when calling the macro.
Instead we can create a package specific version, which calls the generalised macro.
get_successful_snowplow_models
could be more generalised. This would:
Actions:
get_successful_snowplow_models
with generalised get_successful_models
In order to tag Snowplow models as being run by Snowplow, we need to add a query tag in Snowflake. Can we have some functionality that allows this to be generated when running the dbt models in Snowflake but not in the other databases?
A common workflow might entail having a different start date for the snowplow-web
package depending on the environment it is being run deom i.e. dev vs. prod. For instance while in dev you might only want to process the last 7 days of data to minimise cost and focus on the latest data, while in prod you would want a fixed date.
Provide a macro that dynamically sets the start date depending on an env var.
README formatting is incorrect when imported to dbt's package hub.
The run_results
object schema changed between 0.18.0 and 0.19.0. As part of this, the status
field's output has changed.
For the get_successful_models
macro, we use the status
field to determine successfully executed models.
Very clean to pull out successful models: "status": "success"
Pretty messy to determine success. status
message for successful models varies between warehouses and materialisation i.e. view vs. table. No clean "status": "success"
.
Potentially easier to infer success by removing failed or skipped models from results.
Failed: "status": "ERROR"
Skipped: "status": null
, "skip": true
Will need to update get_successful_models
tests as well. Dummy run_results
dataset is based on 0.19.0 schema.
snowplow_incremental
materialization currently only support BigQuery & Redshift.
Add support for Snowflake.
There are some functions in the cross_db
macros folder that do not have any documentation. In order for users to feel more comfortable taking advantage of these macros, we should add documentation around them to make it easier to get started with them.
The rational for this change: snowplow/dbt-snowplow-web#41
Create a macro to be used as a post-hook that inserts problematic long sessions into a blacklist.
The expected dataset provided for the snowplow_incremental_w_lookback_disabled
test on the snowplow_incremental
materialization is incorrect.
It contains duplicate rows for id = 3
, when only 1 row should be present. The test still passed since the dbt_utils.equality
test effectively takes a DISTINCT
of both the datasets when checking equality.
Issue outlined here: snowplow/dbt-snowplow-web#47
snowplow__session_lookback_days
var into return_base_new_event_limits
to calculate session_lookback_limit
.session_start_limit
, the lower limit for all session start tstamps for a given run of the modelApp id filter isn't correctly pulling through supplied app ids.
When developing custom modules for the snowplow-web
package it would be sensible to:
snowplow__start_date
to a recent date so you are testing on recent data.snowplow__backfill_limit_days
to 1 so you process as little data as possible.Add functionality to dynamically set these values based on whether the user is running in their production or development environment.
Use target.name
to denote environment. Typically, a user would have a target named 'dev' for development, with their production target named something else.
Previous README very sparsely populated.
Actions:
The snowplow-utils package is dependant on dbt-utils. WIth Fishtown Analytics recently rebranding to dbt-labs, their repo names have changed. This includes dbt-utils which should now point to their new URL as outlined here.
snowplow_delete_from_manifest
macro has a couple of issues:
DELETE
statement, which needs a COMMIT
statement for Redshift but not for BigQuery.Actions:
dbt v1 recently launched. Currently the package only supports up to dbt v0.19. The main breaking affecting this package is the introduction of the macro_namespace
arg. Auto commit has also been disabled for Snowflake.
macro_namespace
arg.snowplow_incremental
materialization to docssnowplow_web_base_sessions_lifecycle_manifest
model's schema is defined by the snowplow__manifest_custom_schema
variable. Since this model is a 'proper' dbt model, dbt will create this schema before doing anything else in the run if it doesn't exist already.snowplow_web_incremental_manifest
table's schema is also defined by the snowplow__manifest_custom_schema
variable. This however is a manually created table via DDL.snowplow_web_incremental_manifest
table, the manifest schema exists.snowplow_web_base_sessions_lifecycle_manifest
model's schema but not by changing the snowplow__manifest_custom_schema
variable accordingly (this can be achieved by amending the dbt_project.yml
)snowplow_web_incremental_manifest
tables schema is still defined by the snowplow__manifest_custom_schema
variable. This schema will no longer be generated automatically by dbt since the snowplow_web_base_sessions_lifecycle_manifest
model is not using the same schema. This would cause the CREATE TABLE
DDL to fail as the schema doesn't exist.create_snowplow_manifest_schema
step is included as a fail safe.adapter.create_schema
method I use here to manually create the schema, under the hood calls CREATE SCHEMA IF NOT EXISTS
.The snowplow_incremental
materialisation currently only supports Redshift.
To support BigQuery:
partition_by
config in BigQuery dbt models to restrict the table scan on the destination table.When querying the sessions lifecycle manifest, end_tstamp
is used to find the latest event processed. While this is technically correct, it is inefficient.
Since the sessions manifest is partitioned on start_tstamp
on BigQuery and sorted on start_tstamp
on Redshift, it would be more performant to query on this column. The results should be very similar and the worst case scenario is we end up reprocessing some events when we shouldn't. Given this query executes on every run of the package, the improved performance outweighs the remote chance of reprocessing events.
The rational for this change are outlined here: snowplow/dbt-snowplow-web#39
This mainly involves removing from the utils package:
snowplow_teardown_all
snowplow_incremental_pre_hook
into a dbt model.With the snowplow_incremental
logic, users aren't able to specify which column should be used to see which data has been processed. The is_run_with_new_events
macro is currently hardcoded to look for start_tstamp
, however technically this could be any timestamp column. We should allow users to specify which column they want to point to to see which events have been processed already.
In macros/utils/is_run_with_new_events.sql
we should replace start_tstamp
with a parameter such as {{ start_timestamp }}
which should take on a default value of start_tstamp
but could be specified by the user in the is_run_with_new_events
macro call. We should also create a test to ensure that this is working as expected with custom column names.
We already have the snowplow_web_delete_from_manifest
macro in snowplow_delete_from_manifest
. We should make the equivalent macro for the snowplow-mobile
package to make the API less cumbersome for the user
We can alternatively call the more generic snowplow_delete_from_manifest
function with two arguments as follows:
{{ snowplow_utils.snowplow_delete_from_manifest(var('models_to_remove', []), ref('snowplow_mobile_incremental_manifest')) }}
however this is more cumbersome
This will benefit all future users of the snowplow-mobile
package that is currently in development
To add support for BigQuery in the snowplow-web package, we will need various bits of extra functionality. These include:
combine_column_versions
. This is a BigQuery specific problem, where as the schema versions evolve, new columns are created in the events table. It is handy to coalesce fields nested within these columns across all versions of the column for a holistic view over time.The combine_column_versions
macro is designed to take fields from a column in BQ and extract them across multiple different schema versions of a column. It is supposed to accept a list of dicts as input for the required_fields
parameter which allow you to specify fields, their preferred names (if applicable) and their data types. An example is the following:
[
{'field':'latitude', 'renamed_field':'device_latitude', 'dtype':'float64'},
{'field':'longitude', 'renamed_field':'device_longitude', 'dtype':'float64'},
{'field':'latitude_longitude_accuracy', 'renamed_field':'device_latitude_longitude_accuracy', 'dtype':'float64'},
{'field':'altitude', 'renamed_field':'device_altitude', 'dtype':'float64'},
{'field':'altitude_accuracy', 'renamed_field':'device_altitude_accuracy', 'dtype':'float64'},
{'field':'bearing', 'renamed_field':'device_bearing', 'dtype':'float64'},
{'field':'speed', 'renamed_field':'device_speed', 'dtype':'float64'}
]
However upon inspection, the renamed_field
is not used correctly which results in errors when enabling all contexts, since all of the different id
fields which should be saved as session_id
, device_user_id
, etc are not renamed and remain as id
. BQ cannot handle this and so the creation of the table fails, and the model stops executing.
Run the dbt-snowplow-mobile
model using the feature/bq-snowflake-support
branch with dbt-snowplow-utils
v0.6.0 to see this problem in action.
-- screen context
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].id) as screen_id,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].name) as screen_name,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].activity) as screen_activity,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].fragment) as screen_fragment,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].top_view_controller) as screen_top_view_controller,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].type) as screen_type,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].view_controller) as screen_view_controller,
-- screen context
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].id) as id,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].name) as name,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].activity) as activity,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].fragment) as fragment,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].top_view_controller) as top_view_controller,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].type) as type,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].view_controller) as view_controller,
The contents of your packages.yml
file:
Which database are you using dbt with?
The output of dbt --version
:
installed version: 1.0.0
latest version: 1.0.0
Up to date!
Plugins:
- bigquery: 1.0.0
- snowflake: 0.21.1
- redshift: 0.21.1
- postgres: 0.21.1
The operating system you're using:
Mac OS Monterrey 12.1
The output of python --version
:
Python 3.9.9
The documentation for get_columns_in_relation_by_column_prefix
implies that the macro returns the names of matching columns. The macro actually returns column objects, from which the name property can be extracted.
Update to reflect this.
get_run_limits
currently performs two functions:
This should be split into two functions and tests added.
Based on the discussion here: https://getdbt.slack.com/archives/C01CYP05NF7/p1643739939515269
It would be useful to users if this package supported 0.8.0. I imagine even better if we can support >= 0.7.0 <0.9.0 but compatibility of dbts packages <1 seems challenging.
get_enabled_snowplow_models
currently isn't easy to test as:
models_to_run
var is nested inside functionActions:
models_to_run
to inputThere are currently 2 versions of the update_current_incremental_tstamp_table
macro with the only difference due to the DDL language to 'create a table as' across dbs. By stripping out the DDL into it's own macro, we can consolidate this macros code.
Once this is done, this macro will essentially just be calling other macros. Remove entirely and moves its functionality into snowplow_incremental_pre_hook
snowplow_teardown_all
macro has two issues:
Where necessary add Snowflake support to macros within snowplow-utils
package.
app_id_filter
is currently tricky to test as the snowplow__app_id
is nested inside rather than as an input.
Actions:
The macro snowplow_merge
will throw a compilation error when running against Redshift. This is because there is no default
prefix specified for this macro.
This was not initially included as Redshift will never use this macro as it doesn't support a MERGE
command. In dbt 0.19.2 however it seems like a default must be provided for all dispatch
based macros.
update_incremental_manifest_table
currently only supports Redshift and it's current design isn't easily testable.
create_incremental_manifest_table
currently only supports Redshift.
The cross warehouse differences within this macro are due to the DDL, create table and alter table.
Actions:
create_incremental_manifest_table
accordingly.In preparation of adding a Github Action to run all tests in the package we need to:
dbt v0.21 added a new config option for incremental models, on_schema_change
. In short this parameter allows you to define how to handle changes to the incremental table's schema. An example of this might be the addition of a column and whether you should mutate the table to accommodate this change. Since snowplow_incremental
is a custom materialization, work is required within the snowplow-utils package to facilitate this config parameter. More details on the parameter can be found here.
Update the snowplow_incremental
materialization logic to include this functionality. This will need to be done across all supported warehouses.
Update README to reflect all changes in the release, including the new file directory structure.
The dbt_utils.type_string()
macro returns a varchar for supported warehouses but the behaviour of that varies across databases. As a result, we need to implement our own functionality to create a varchar data type for each of the different warehouses such that the varchar is of maximum length.
For Snowflake we can continue to use the same functionality, as using varchar
will generate a varchar of maximum length
For Postgres, an excerpt from the character datatype page:
There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.
We should therefore consider using text or character varying
For redshift, we can use varchar(max)
TBD
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.