Giter Site home page Giter Site logo

dbt-snowplow-utils's People

Contributors

agnessnowplow avatar bgarf avatar bill-warner avatar emielver avatar georgewoodhead avatar rlh1994 avatar zloff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt-snowplow-utils's Issues

Add support for dbt 1.0.0+

Describe the feature

Time to drop support for dbt versions below 1.0.0 and support all dbt 1.x versions

Add integrations_tests script

The release of snowplow-utils v.0.2.0 introduces integration tests for many of the macros within the package.

Some of these tests are native dbt based tests, while others require specific scripts.

This is typically because they either:

  • Need to be executed several times in order to be tested correctly i.e. an incremental materialization.
  • To test they are throwing the correct compiler error. The error code is interpreted in bash.

It would be good to have a script, integrations_tests.sh, that runs all available tests across all warehouses.

Add CI Action

Issue

The utils package contains a test suite. The testing workflow so far as been manual, executing scripts locally. This is cumbersome and error prone when testing across many warehouses and dbt versions.

Solution

Create a Github Action to run all tests across supported warehouses and dbt versions. The limitation here is we cannot currently test against Redshift via an Action. To do so we need to whitelist the Actions IP address. These rotate every week. Ideally we need a self hosted runner.

Restructure snowplow-utils project

The structure of the snowplow-utils package could be improved by splitting out each macro into its own file. This would make it easier to find macros.

Add support for Postgres

To bring Postgres support to the snowplow-web package we need to:

  • Create a Postgres specific snowplow-incremental materialization
  • Any dispatched macros need a corresponding Postgres version

Adapt `get_incremental_manifest_status` macro to be compatible with `dbt compile`

Describe the feature

When you run dbt compile on a fresh installation of the dbt-snowplow-* packages you get a compilation error since the get_incremental_manifest_status macro tries to reference tables that do not exist. A user has kindly offered the following macro (for snowflake) that should solve this problem. Perhaps we can explore this macro and modify it to be compatible with all databases such that the dbt compile command runs without errors on initial use.

{% macro get_incremental_manifest_status(incremental_manifest_table, models_in_run) -%}
  {% if not execute %}
    {{ return(['', '', '', '']) }}
  {% endif %}
  {% set table_exists_query %}
    select exists(select 1 as x from information_schema.tables t where '"'+current_database()+'"'+'.'+'"'+table_schema+'"'+'.'+'"'+t.table_name+'"' = '{{ incremental_manifest_table }}') as has_tbl
  {% endset %}
  {% set table_exists_query_result = run_query(table_exists_query) %}
  {% do log(table_exists_query_result.columns[0]) %}
  {{ log('log0') }}
  {% if table_exists_query_result.columns[0].values()[0] == True %}
    {{ log('log01') }}
    {% set last_success_query %}
      select min(last_success) as min_last_success,
            max(last_success) as max_last_success,
            coalesce(count(*), 0) as models
      from {{ incremental_manifest_table }}
      where model in ({{ snowplow_utils.print_list(models_in_run) }})
    {% endset %}
    {{ log('log02') }}
    {% set results = run_query(last_success_query) %}
    {{ log('log1') }}
    {% if execute %}
      {{ log('log2') }}
      {% set min_last_success = results.columns[0].values()[0] %}
      {% set max_last_success = results.columns[1].values()[0] %}
      {% set models_matched_from_manifest = results.columns[2].values()[0] %}
      {% set has_matched_all_models = true if models_matched_from_manifest == models_in_run|length else false %}
    {% endif %}
    {{ return([min_last_success, max_last_success, models_matched_from_manifest, has_matched_all_models]) }}
  {% else %}
    {{ log('Table does not exists.') }}
    {{ return(['', '', '', '']) }}
  {% endif %}
{%- endmacro %}
{# Prints the run limits for the run to the console #}
{% macro print_run_limits(run_limits_relation) -%}
  {% set run_limits_query %}
    select lower_limit, upper_limit from {{ run_limits_relation }}
  {% endset %}
  {# Derive limits from manifest instead of selecting from limits table since run_query executes during 2nd parse the limits table is yet to be updated. #}
  {% set results = run_query(run_limits_query) %}
  {% if execute %}
    {% set lower_limit = snowplow_utils.tstamp_to_str(results.columns[0].values()[0]) %}
    {% set upper_limit = snowplow_utils.tstamp_to_str(results.columns[1].values()[0]) %}
    {% set run_limits_message = "Snowplow: Processing data between " + lower_limit + " and " + upper_limit %}
    {% do snowplow_utils.log_message(run_limits_message) %}
  {% endif %}
{%- endmacro %}

Refactor combine_column_versions

Issue

The current versions of the macro has several limitions:

  • Poor API to select & rename fields
  • Limited to flattening only 1st level fields. Any heavily nested fields will not be flattened.
  • If the column is of type REPEATED i.e. an array, only the first element will be taken.

Solution

Refactor to solve for the above limitations.

Refactor snowplow_delete_from_manifest

Currently snowplow_delete_from_manifest requires the package_name arg. This allows the macro to work across both the web and mobile package.

This creates a slightly clunky user experience of either package when calling the macro.

Instead we can create a package specific version, which calls the generalised macro.

Replace get_successful_snowplow_models

get_successful_snowplow_models could be more generalised. This would:

  • Increase usability
  • Improve testing

Actions:

  • Replace get_successful_snowplow_models with generalised get_successful_models
  • Add tests

Add a `set_query_tag` function

Describe the feature

In order to tag Snowplow models as being run by Snowplow, we need to add a query tag in Snowflake. Can we have some functionality that allows this to be generated when running the dbt models in Snowflake but not in the other databases?

Add start_date_by_env macro

A common workflow might entail having a different start date for the snowplow-web package depending on the environment it is being run deom i.e. dev vs. prod. For instance while in dev you might only want to process the last 7 days of data to minimise cost and focus on the latest data, while in prod you would want a fixed date.

Provide a macro that dynamically sets the start date depending on an env var.

Fix get_successful_models for dbt 0.18.0

The run_results object schema changed between 0.18.0 and 0.19.0. As part of this, the status field's output has changed.

For the get_successful_models macro, we use the status field to determine successfully executed models.

0.19.0

Very clean to pull out successful models: "status": "success"

0.18.0

Pretty messy to determine success. status message for successful models varies between warehouses and materialisation i.e. view vs. table. No clean "status": "success".

Potentially easier to infer success by removing failed or skipped models from results.

Failed: "status": "ERROR"
Skipped: "status": null, "skip": true

Will need to update get_successful_models tests as well. Dummy run_results dataset is based on 0.19.0 schema.

Add cross_db macro documentation

Describe the feature

There are some functions in the cross_db macros folder that do not have any documentation. In order for users to feel more comfortable taking advantage of these macros, we should add documentation around them to make it easier to get started with them.

Fix BigQuery snowplow_incremental expected dataset

The expected dataset provided for the snowplow_incremental_w_lookback_disabled test on the snowplow_incremental materialization is incorrect.

It contains duplicate rows for id = 3, when only 1 row should be present. The test still passed since the dbt_utils.equality test effectively takes a DISTINCT of both the datasets when checking equality.

Solution

  • Remove the duplicate row.
  • Add row count test check for 'true' equality across tables i.e. consider duplicates.

Fix for unexpired sessions

Issue

Issue outlined here: snowplow/dbt-snowplow-web#47

Solution

  • Reintroduce the snowplow__session_lookback_days var into return_base_new_event_limits to calculate session_lookback_limit.
  • Calculate session_start_limit, the lower limit for all session start tstamps for a given run of the model

Add dynamic variables based on target name

When developing custom modules for the snowplow-web package it would be sensible to:

  • Set snowplow__start_date to a recent date so you are testing on recent data.
  • Set snowplow__backfill_limit_days to 1 so you process as little data as possible.

Add functionality to dynamically set these values based on whether the user is running in their production or development environment.

Use target.name to denote environment. Typically, a user would have a target named 'dev' for development, with their production target named something else.

Update READMEs

Previous README very sparsely populated.

Actions:

  • Add documentation for public facing macros
  • Add README of integration tests

Update dbt_utils URL to new dbt-labs repo

The snowplow-utils package is dependant on dbt-utils. WIth Fishtown Analytics recently rebranding to dbt-labs, their repo names have changed. This includes dbt-utils which should now point to their new URL as outlined here.

Refactor & add tests snowplow_delete_from_manifest

snowplow_delete_from_manifest macro has a couple of issues:

  • Not easily testable since the manifest table name is hardcoded into the function.
  • It only supports Redshift. This is due to the DELETE statement, which needs a COMMIT statement for Redshift but not for BigQuery.

Actions:

  • Refactor to allow manifest table name to be overridden.
  • Make BigQuery compatible.
  • Add tests

Add dbt v1 compatibility

Issue

dbt v1 recently launched. Currently the package only supports up to dbt v0.19. The main breaking affecting this package is the introduction of the macro_namespace arg. Auto commit has also been disabled for Snowflake.

Solution

  • Update all dispatched macros to use the macro_namespace arg.
  • Wrap Snowflake delete/insert statement with transaction

Update README

  • Clean up formatting
  • Add table of contents
  • Add new macros to docs
  • Add Snowflake snowplow_incremental materialization to docs

Remove create_snowplow_manifest_schema step

Currently

  • The snowplow_web_base_sessions_lifecycle_manifest model's schema is defined by the snowplow__manifest_custom_schema variable. Since this model is a 'proper' dbt model, dbt will create this schema before doing anything else in the run if it doesn't exist already.
  • The snowplow_web_incremental_manifest table's schema is also defined by the snowplow__manifest_custom_schema variable. This however is a manually created table via DDL.
  • Since the manifest schema is created by dbt, when we go to create the snowplow_web_incremental_manifest table, the manifest schema exists.

Potential implementation

  • A user overrides the snowplow_web_base_sessions_lifecycle_manifest model's schema but not by changing the snowplow__manifest_custom_schema variable accordingly (this can be achieved by amending the dbt_project.yml)
  • The snowplow_web_incremental_manifest tables schema is still defined by the snowplow__manifest_custom_schema variable. This schema will no longer be generated automatically by dbt since the snowplow_web_base_sessions_lifecycle_manifest model is not using the same schema. This would cause the CREATE TABLE DDL to fail as the schema doesn't exist.
  • This is why the create_snowplow_manifest_schema step is included as a fail safe.

Issue

  • The adapter.create_schema method I use here to manually create the schema, under the hood calls CREATE SCHEMA IF NOT EXISTS.
  • If the user does not have schema creation permissions this will fail.
  • Despite dbt recommending having such permissions for all dbt users, one of our customers using the package does not allow their dbt users to have such permissions. Rather a super user manually creates all the schemas they need for dbt.
  • This meant every time they ran the package it failed due to this step, despite the required schema already existing.

Fix

  • We should remove the schema creation step to be sure the package will run irrespective of the users permissions.

Add BigQuery support to snowplow_incremental

The snowplow_incremental materialisation currently only supports Redshift.

To support BigQuery:

  • Use a 'merge' strategy rather than delete/insert as is used by Redshift.
    • Merge is meant to be more performant.
  • Make use of partition_by config in BigQuery dbt models to restrict the table scan on the destination table.
  • Consider adding tests (BigQuery and Redshift).

Improve is_run_with_new_events performance

When querying the sessions lifecycle manifest, end_tstamp is used to find the latest event processed. While this is technically correct, it is inefficient.

Since the sessions manifest is partitioned on start_tstamp on BigQuery and sorted on start_tstamp on Redshift, it would be more performant to query on this column. The results should be very similar and the worst case scenario is we end up reprocessing some events when we shouldn't. Given this query executes on every run of the package, the improved performance outweighs the remote chance of reprocessing events.

Allow the user to specify the column name for `snowplow_incremental` logic

Describe the feature

With the snowplow_incremental logic, users aren't able to specify which column should be used to see which data has been processed. The is_run_with_new_events macro is currently hardcoded to look for start_tstamp, however technically this could be any timestamp column. We should allow users to specify which column they want to point to to see which events have been processed already.

Proposed Solution

In macros/utils/is_run_with_new_events.sql we should replace start_tstamp with a parameter such as {{ start_timestamp }} which should take on a default value of start_tstamp but could be specified by the user in the is_run_with_new_events macro call. We should also create a test to ensure that this is working as expected with custom column names.

Add snowplow_mobile_delete_from_manifest macro

Describe the feature

We already have the snowplow_web_delete_from_manifest macro in snowplow_delete_from_manifest. We should make the equivalent macro for the snowplow-mobile package to make the API less cumbersome for the user

Describe alternatives you've considered

We can alternatively call the more generic snowplow_delete_from_manifest function with two arguments as follows:
{{ snowplow_utils.snowplow_delete_from_manifest(var('models_to_remove', []), ref('snowplow_mobile_incremental_manifest')) }}
however this is more cumbersome

Who will this benefit?

This will benefit all future users of the snowplow-mobile package that is currently in development

Add functionality to support BigQuery

To add support for BigQuery in the snowplow-web package, we will need various bits of extra functionality. These include:

  • A macro to combine versions of context or unstructured events columns, combine_column_versions. This is a BigQuery specific problem, where as the schema versions evolve, new columns are created in the events table. It is handy to coalesce fields nested within these columns across all versions of the column for a holistic view over time.
  • Macros to return the various timestamp based limits as literals. These are used to minimise table scans throughout the snowplow-web package. BigQuery can only partition prune if these are given as literals rather than subqueries (as we use with Redshift).
  • Cross-db timestamp functions. While dbt_utils has dateadd/date_diff which can operate on timestamp, it's native behaviour for BigQuery is to cast timestamps to datetimes and then perform the operation. This returns a datetime, which causes BigQuery to throw a mismatch datatype error when comparing against a timestamp column.

Combine_column_versions macro does not properly take into account field aliases.

Describe the bug

The combine_column_versions macro is designed to take fields from a column in BQ and extract them across multiple different schema versions of a column. It is supposed to accept a list of dicts as input for the required_fields parameter which allow you to specify fields, their preferred names (if applicable) and their data types. An example is the following:

[
    {'field':'latitude', 'renamed_field':'device_latitude', 'dtype':'float64'},
    {'field':'longitude', 'renamed_field':'device_longitude', 'dtype':'float64'},
    {'field':'latitude_longitude_accuracy', 'renamed_field':'device_latitude_longitude_accuracy', 'dtype':'float64'},
    {'field':'altitude', 'renamed_field':'device_altitude', 'dtype':'float64'},
    {'field':'altitude_accuracy', 'renamed_field':'device_altitude_accuracy', 'dtype':'float64'},
    {'field':'bearing', 'renamed_field':'device_bearing', 'dtype':'float64'},
    {'field':'speed', 'renamed_field':'device_speed', 'dtype':'float64'}
]

However upon inspection, the renamed_field is not used correctly which results in errors when enabling all contexts, since all of the different id fields which should be saved as session_id, device_user_id, etc are not renamed and remain as id. BQ cannot handle this and so the creation of the table fails, and the model stops executing.

Steps to reproduce

Run the dbt-snowplow-mobile model using the feature/bq-snowflake-support branch with dbt-snowplow-utils v0.6.0 to see this problem in action.

Expected results

-- screen context
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].id) as screen_id,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].name) as screen_name,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].activity) as screen_activity,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].fragment) as screen_fragment,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].top_view_controller) as screen_top_view_controller,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].type) as screen_type,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].view_controller) as screen_view_controller,

Actual results

-- screen context
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].id) as id,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].name) as name,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].activity) as activity,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].fragment) as fragment,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].top_view_controller) as top_view_controller,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].type) as type,
coalesce(a.contexts_com_snowplowanalytics_mobile_screen_1_0_0[safe_offset(0)].view_controller) as view_controller,

System information

The contents of your packages.yml file:

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - bigquery: 1.0.0
  - snowflake: 0.21.1
  - redshift: 0.21.1
  - postgres: 0.21.1

The operating system you're using:
Mac OS Monterrey 12.1
The output of python --version:
Python 3.9.9

Fix get_columns_in_relation_by_column_prefix docs

The documentation for get_columns_in_relation_by_column_prefix implies that the macro returns the names of matching columns. The macro actually returns column objects, from which the name property can be extracted.

Update to reflect this.

Refactor & add tests get_run_limits

get_run_limits currently performs two functions:

  • Getting the current status from the incremental manifest table
  • Calculating the run limits of the current row based on current status.

This should be split into two functions and tests added.

Refactor & add tests get_enabled_snowplow_models

get_enabled_snowplow_models currently isn't easy to test as:

  • models_to_run var is nested inside function
  • graph object is fixed to dbt's graph object

Actions:

  • Change models_to_run to input
  • Add functionality to override native dbt graph object
  • Add tests

Remove update_current_incremental_tstamp_table

There are currently 2 versions of the update_current_incremental_tstamp_table macro with the only difference due to the DDL language to 'create a table as' across dbs. By stripping out the DDL into it's own macro, we can consolidate this macros code.

Once this is done, this macro will essentially just be calling other macros. Remove entirely and moves its functionality into snowplow_incremental_pre_hook

Fix snowplow_teardown_all macro

snowplow_teardown_all macro has two issues:

  • It will fail when the models it is trying to drop do not exist.
  • It attempts to drop the users manifest table, which has been removed snowplow-web v0.2.0.

Refactor & add tests app_id_filter

app_id_filter is currently tricky to test as the snowplow__app_id is nested inside rather than as an input.

Actions:

  • Move supplied app ids to an input arg
  • Simplify the logic to print the input app ids
  • Add tests

Fix snowplow_merge for dbt 0.19.2

The macro snowplow_merge will throw a compilation error when running against Redshift. This is because there is no default prefix specified for this macro.

This was not initially included as Redshift will never use this macro as it doesn't support a MERGE command. In dbt 0.19.2 however it seems like a default must be provided for all dispatch based macros.

Fix

  • Rename the BigQuery merge macro to default.

Refactor create_incremental_manifest_table

create_incremental_manifest_table currently only supports Redshift.

The cross warehouse differences within this macro are due to the DDL, create table and alter table.

Actions:

  • Split DDL into their own macros
  • Refactor create_incremental_manifest_table accordingly.

Update integration tests for Github Actions

In preparation of adding a Github Action to run all tests in the package we need to:

  • Rename sample.profiles.yml file to profiles.yml
  • Update the schema name in the profiles.yml

Add 'on_shema_change' config to snowplow_incremental materialization

Issue

dbt v0.21 added a new config option for incremental models, on_schema_change. In short this parameter allows you to define how to handle changes to the incremental table's schema. An example of this might be the addition of a column and whether you should mutate the table to accommodate this change. Since snowplow_incremental is a custom materialization, work is required within the snowplow-utils package to facilitate this config parameter. More details on the parameter can be found here.

Solution

Update the snowplow_incremental materialization logic to include this functionality. This will need to be done across all supported warehouses.

Update README

Update README to reflect all changes in the release, including the new file directory structure.

Add a type_string macro to generate a varchar type

Describe the feature

The dbt_utils.type_string() macro returns a varchar for supported warehouses but the behaviour of that varies across databases. As a result, we need to implement our own functionality to create a varchar data type for each of the different warehouses such that the varchar is of maximum length.

Snowflake

For Snowflake we can continue to use the same functionality, as using varchar will generate a varchar of maximum length

Postgres

For Postgres, an excerpt from the character datatype page:

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

We should therefore consider using text or character varying

Redshift

For redshift, we can use varchar(max)

Bigquery

TBD

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.