dbt-labs / snowplow Goto Github PK

View Code? Open in Web Editor NEW

126.0 13.0 45.0 985 KB

Data models for snowplow analytics.

Home Page: https://hub.getdbt.com/dbt-labs/snowplow/latest/

License: Apache License 2.0

LookML 98.21% Makefile 1.79%

snowplow's Issues

Update package readme

New package syntax for installation
Link to new docs site in line 29
etc.

no primary key defined in lookml for page views

Add Spark support

I could have a WIP PR for this very quickly. The changes to the default implementations are actually very simple:

Use cast(column as datatype) instead of column::datatype
Implement spark__convert_timezone and spark__dateadd
Call snowplow.dateadd, which then calls dbt_utils.dateadd or spark__dateadd` as appropriate

Thoughts:

Should Snowplow encode the logic for spark__dateadd, or pull it from (e.g.) spark-utils? I wouldn't want that dependency, however, in the 99% of cases when people aren't running this package on Spark. This will all be easier once we reimplement adapter macros.
It's trivial to implement this for Delta, which supports merging on a unique key, but what about standard Spark (insert_overwrite) incrementals? Do we need to add a truncated date column to these models, to pass to partition_by? I don't think we can partition on a column expression

Deduplication in snowplow_web_page_context drops all events that have a duplicate

If I don't get it wrong, the deduplication implemented here
https://github.com/fishtown-analytics/snowplow/blob/3795d06f365213ca4930d2447bd1580cb7031557/models/page_views/default/snowplow_web_page_context.sql#L43
drops all events that have a duplicated event_id (named root_id there), instead of keeping only the first one of those. Seems strange to me, is there a reason?

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

Performance issue with model snowplow_page_views

I belive in v 0.7.0 was introduce code that dramatically slow down building of model snowplow_page_views. Merge: 91e42ef
Line of code 45-57: https://github.com/fishtown-analytics/snowplow/blob/91e42efaef97a5f52ba9965155b3d50ac6d45f6b/macros/adapters/default/page_views/snowplow_page_views.sql#L45
It was commented as

-- we need to grab all events for any session that has appeared
-- in order to correctly process the session index below

We started collecting snowplow data about 8 month ago, and currently in snowplow_web_events there are about 73M of record. Looks like join to that table just as is will be painful for any mature projects. I suppose this part of code could be optimized (e.g. taking events for last n days). As far as I know sessions in snowplow by default lasts 30 minutes. So in 99% of time checking events from last several days will be OK. Am I correct follow the idea of introducing that code?

When are the todos in bigquery going to be implemented? Columns missing

Describe the bug

The final output tables for BigQuery are missing useragent and pref_timing columns. See the view for snowplow_page_views.sql

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: ["1.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.5.0", "<0.9.0"]
  - package: dbt-labs/snowplow
     version: ["0.15.0"]

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.3.2
  - latest:    1.4.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.3.0 - Up to date!

The operating system you're using:

The output of python --version:

NoneType has no len() error.

Describe the bug

I'm getting a cryptic error message after removing a module.: TypeError: object of type 'NoneType' has no len()

Steps To Reproduce

dbt clean
dbt run
(I'm sure the problem is something to do with my project.yml but i'm not sure what exactly is throwing the error.... will report back when i can.)

Expected behavior

An error message that will help me fix my configs/project.yml

Screenshots and log output

2021-01-15 11:19:42.774812 (MainThread): Traceback (most recent call last):
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 124, in main
results, succeeded = handle_and_check(args)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 202, in handle_and_check
task, res = run_from_args(parsed)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 255, in run_from_args
results = task.run()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 399, in run
self._runtime_initialize()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 118, in _runtime_initialize
super()._runtime_initialize()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 78, in _runtime_initialize
self.load_manifest()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 65, in load_manifest
self.manifest = get_full_manifest(self.config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/perf_utils.py", line 31, in get_full_manifest
adapter.connections.set_query_header,
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 678, in load_manifest
return ManifestLoader.load_all(config, macro_manifest, macro_hook)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 348, in load_all
loader.load(macro_manifest=macro_manifest)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 204, in load
self.parse_project(project, macro_manifest, old_results)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 177, in parse_project
self.parse_with_cache(path, parser, old_results)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 133, in parse_with_cache
parser.parse_file(block)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 429, in parse_file
self.parse_node(file_block)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 402, in parse_node
self.render_update(node, config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 377, in render_update
self.render_with_context(node, config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 289, in render_with_context
parsed_node.raw_sql, context, parsed_node, capture_macros=True
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/clients/jinja.py", line 577, in get_rendered
return render_template(template, ctx, node)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/clients/jinja.py", line 528, in render_template
return template.render(ctx)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/environment.py", line 1090, in render
self.environment.handle_exception()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/environment.py", line 832, in handle_exception
reraise(*rewrite_traceback_stack(source=source))
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/_compat.py", line 28, in reraise
raise value.with_traceback(tb)
File "", line 111, in top-level template code
TypeError: object of type 'NoneType' has no len()

System information

Which database are you using dbt with?

[X ] bigquery

The output of dbt --version:

0.18.0

The operating system you're using:
MAC OS
The output of python --version:
3.7.3

Additional context

I've recently removed the snowplow module and copied some of the models into my own models folder. reversing these steps gets rid of the problem.

snowplow_web_events_internal_fixed is not used

I can't see anything referencing the model snowplow_web_events_internal_fixed

I am following the code based on the model flow graph in the README.

The graph explains that snowplow_page_views references it, but I see nothing in the macro

I also see no entry for it in the schema.yml

I realize it is ephemeral but it should still be referenced in order to inline it as a CTE.

Have I missed something? Or can it be removed and the graph updated?

Add additional bot traffic

snowplow_page_views (https://github.com/fishtown-analytics/snowplow/blob/master/models/page_views/snowplow_page_views.sql) currently filters for bot traffic in both br_family and user_agent but would like to add br_type not in ('Bot/Crawler', 'Robot')

failing unique test on session_id

session_id fails the unique test in snowplow_sessions after a full-refresh: https://github.com/fishtown-analytics/snowplow/blob/master/macros/adapters/default/sessions/snowplow_sessions.sql

Fails to exclude page views from pingdom

I have encountered the case where page hits from pindgom are not behind excluded.

In this particular case the useragent from pindgom comes like this:
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PingdomTMS/0.8.5 Safari/534.34

The regex is only setup to pick up pingdom and PingdomBot

Update snowplow models to accept non-js event trackers as first-class citizens

Currently these open source models rely on functionality only provided in the js tracker -- domainsessionidx and domainsessionid. We need to reimplement sessionization to allow other trackers that don't contain that functionality to also participate as first-class citizens in these models (currently they get filtered out).

Check br_family field for some value is killing lot of results with NULL value

I have faced an issue with br_family NULL value in Redsfhit. There is a condition in snowplow_page_views.sql at line 307 where a.br_family != 'Robot/Spider'. In case there is a NULL value in br_family, such condition will not output that rows, cause NULL can not be equal or not equal to anything. As result, lot of pageviews with null value in br_family are totaly lost in result tables of DBT model.

Compilation Error in model snowplow_web_ua_parser_context (models/page_views/optional/snowplow_web_ua_parser_context.sql) at path ['enabled']: None is not of type 'boolean' Code: 10004

When trying to install snowplow using the following packages.yml file:

packages:

package: fishtown-analytics/snowplow
version: 0.12.0

this error is generated:

Installing fishtown-analytics/[email protected]
Installed from version 0.12.0

Installing fishtown-analytics/[email protected]
Installed from version 0.6.6

Compilation Error in model snowplow_web_ua_parser_context (models/page_views/optional/snowplow_web_ua_parser_context.sql)
at path ['enabled']: None is not of type 'boolean' Code: 10004

Syntax Error in BigQuery snowplow_page_views

When variable snowplow:pass_through_columns is empty, the Jinja strippers at line 170 - 174 in /macros/adapters/bigquery/pageviews/snowplow_page_views.sql remove too many whitespaces, resulting in the following compiled code:

  struct(
        br_renderengine as browser_engine,
        dvce_type as type,
        dvce_ismobile as is_mobile
    ) as devicefrom events
  where event = 'page_view'

Removing one of the strippers will yield the correct syntax:

  struct(
        br_renderengine as browser_engine,
        dvce_type as type,
        dvce_ismobile as is_mobile
    ) as device

  from events
  where event = 'page_view'

May be it is incorrect way of filtering duplicates?

I'm not sure, but may be it is not a correct way for filtering duplicates:
https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/models/page_views/optional/snowplow_web_ua_parser_context.sql#L43
Searching for several rows with same id:

duplicated as (
select
    id

from prep

group by 1
having count(*) > 1

)

Next step is to exclude them all:
where page_view_id not in (select id from duplicated)
Let's assume that we have 3 rows with the same id, so we will deny all of them, but isn't it to be correct way to filter 2 of them and 1 use in result set?

Support new snowplow data model

Describe the feature

It looks like Snowplow has switched to the new of processing raw data https://github.com/snowplow/data-models and call new data model as V1. Current implementation of DBT module does not suits it anymore. New data model should be supported/adopted by DBT snowplow module.

Describe alternatives you've considered

Possible workaround - create by DBT module all snowplow tables by parsing JSON fields in EVENTS table. E.g. create table com_snowplowanalytics_snowplow_web_page_1 by parsing field EVENTS.contexts_com_snowplowanalytics_snowplow_web_page_1

Who will this benefit?

All users who has using DBT and Snwoplow at the same time on their projects.

Package tidy up

Copied from dbt-labs/segment#42

Repo checklist:

Incorrect mapping for user agent and performance timing contexts

I have faced a problem with snowplow model in DBT:

Database Error in model snowplow_web_timing_context (models/page_views/optional/snowplow_web_timing_context.sql)
  column "event_id" specified in USING clause does not exist in left table
  compiled SQL at target/compiled/snowplow/page_views/optional/snowplow_web_timing_context.sql

I'm using for timing context just standart context: https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#performanceTiming with schema https://github.com/snowplow/iglu-central/blob/master/sql/org.w3/performance_timing_1.sql
I suspect there is a bug of using event_id as key for join, because in schema used root_id as reference to event. Also, here I see an example of processing timing: https://discourse.snowplowanalytics.com/t/measuring-page-load-times-with-the-performance-timing-context-tutorial/100 and it is done with root_id.

There is one more issue: not existing field page_view_id, I suspect that not optional views were amended accroding to new field name: https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/macros/adapters/default/page_views/snowplow_web_page_context.sql#L35 In optional view it was not changed, so they still use page_view_id, not just id.

snowflake parity

To Do:

investigate field definitions for useragents and performance_timing contexts
change field definition in snowplow_web_events for min_tstamp and max_tstamp to use timestamp_ntz
create a dbt_util macro for similar to (RS) vs. rlike (Snowflake)
remove page_view, users, sessions time dimensions

Use enabled config to implement cross-database support

Right now, Snowplow models are encoded as macros in order to implement cross-database support. Instead, we should experiment with making models specifically for each support adapter. I think broadly, this falls into:

default (snowflake/redshift/postgres)
bigquery

The models directory might look like:

models:
 - common/
 - default/
 - bigquery/

Every model would need to have a config like:

{{ config(enabled=(var('adapter') == 'default')) }}
or
{{ config(enabled=(var('adapter') == 'bigquery')) }}

I don't think we can configure this at a folder level in dbt_project.yml, but it might be worth exploring if there's some clever way to implement that.

The results of this experimentation might result is us wanting to keep the adapter macro pattern, but I'm interested in exploring alternative ways of supporting cross-db model implementations in packages!

Out of range exception

Due to code regarding timing: https://github.com/fishtown-analytics/snowplow/blob/692938e572e3eca5cb019d2adf877f0d1b7f0147/macros/adapters/default/page_views/snowplow_page_views.sql#L272
there is a possible issue. It is casted to integer, however in source table
https://github.com/fishtown-analytics/snowplow/blob/692938e572e3eca5cb019d2adf877f0d1b7f0147/models/page_views/optional/snowplow_web_timing_context.sql#L110
There is a calculation with bigint values, so resulting value is also bigint. I get such error:

Value out of range for 4 bytes.
DETAIL:
-----------------------------------------------
error: Value out of range for 4 bytes.
code: 8001
context: Input:2311553464.
query: 1121021
location: funcs_int.hpp:97
process: query0_106_1121021 [pid=12067]
-----------------------------------------------
compiled SQL at target/compiled/snowplow/page_views/snowplow_page_views.sql`

I figured out that it is about field dom_loading_to_interactive_time_in_ms, but it can happens with any field.

I'm going to prepare PR regarding fixing casting to int. The problem is that table is incremental, so where should I place some migration script from int based table to bigint table?

wrong page_view_index if session alive across days

I think with incremental loadingafter each load the page_view_index resets to 1 even if the user keeps the page open across days.

Snowflake web page context is unnecessarily separated and is not incremental

Describe the feature

The default models assume that the web_page_context data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something like CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID ? Has that been considered?

The snowplow_web_page_context model contains this comment:

-- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!

I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the snowplow_sessions_tmp model fails on merge.

Describe alternatives you've considered

To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):

QUALIFY
  row_number() over (partition by EVENT_ID order by COLLECTOR_TSTAMP, DERIVED_TSTAMP) = 1;

That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.

My concern is with the web_page_context model being separated from the events, not incremental, only to be later recombined into a number of other models.

If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?

Additional context

This is Snowflake specific.

Who will this benefit?

Those with huge amounts of evergrowing events from Snowplow pipelines.

Note undesired behavior from setting enabled: true in root project

From dbt Slack:

A user had the following config in dbt_project.yml:
models:
 ...
 snowplow:
   enabled: true
Setting enabled: true in the root project overrides the dynamic, target-dependent enabled values set from within the snowplow package. Removing that line made this magically work again.

I don't think there's anything we could or should do to "lock" those enabled values within a package, since it's expected (and good) behavior that the user can override locally. What I will do is document that setting a top-level enabled config can have surprising results.

Reminder to self to add a note to the README :)

no snowplow_users model - lookml should be removed

Testing for late-arriving facts

Include in package integration tests event_update:

new event, same session_id as earlier event
new event, same page_view_id as earlier event (?)
same event_id as earlier event (?)

Our approach to these edge cases has been in flux over the past few releases. I think we have a good grasp of what desired behavior is as of 0.7.1. Let's include these cases in our integration tests to ensure the modeling operates as expected.

Map inferred_user_id in snowplow_page_views

Rationale

The Snowplow package has two first-class outputs: snowplow_sessions and snowplow_page_views. Both should include the stitched user identity, inferred_user_id, that is the product of snowplow_id_map.

We frequently find ourselves building reports on top of both tables, and we should be able to perform counts of distinct visitors that agree between them.

"Unrecognized name: root_id" and "No matching signature for function TIMESTAMP"

Hi, I'm using the GCP Snowplow setup. This puts events into a single BigQuery table, with this schema: https://gist.github.com/wub/9d4b9d27f15f8e7c87906e55cb913f68

When I run with this configuration in dbt 0.14.0:

models:
    snowplow:
        vars:
            'snowplow:timezone': 'Pacific/Auckland'
            'snowplow:page_ping_frequency': 10
            'snowplow:events': "dataset_name.events"
            'snowplow:context:web_page': "dataset_name.events"
            'snowplow:context:performance_timing': false
            'snowplow:context:useragent': false
            'snowplow:pass_through_columns': []
    base:
      optional:
        enabled: false
    page_views:
      optional:
        enabled: false

I get this error:

Found 7 models, 22 tests, 0 snapshots, 0 analyses, 163 macros, 0 operations, 1 seed files, 0 sources

21:34:12 | Concurrency: 2 threads (target='dev')
21:34:12 |
21:34:13 | 1 of 5 START table model dataset_name.snowplow_web_page_context [RUN]
21:34:13 | 2 of 5 START incremental model dataset_name.snowplow_id_map.... [RUN]
21:34:17 | 2 of 5 ERROR creating incremental model dataset_name.snowplow_id_map [ERROR in 2.85s]
21:34:19 | 1 of 5 ERROR creating table model dataset_name.snowplow_web_page_context [ERROR in 3.23s]
21:34:19 | 3 of 5 SKIP relation dataset_name.snowplow_page_views.......... [SKIP]
21:34:19 | 4 of 5 SKIP relation dataset_name.snowplow_sessions_tmp........ [SKIP]
21:34:19 | 5 of 5 SKIP relation dataset_name.snowplow_sessions............ [SKIP]
21:34:19 |
21:34:19 | Finished running 3 incremental models, 2 table models in 10.23s.

Completed with 2 errors:

Database Error in model snowplow_id_map (models\identification\snowplow_id_map.sql)
  No matching signature for function TIMESTAMP for argument types: TIMESTAMP. Supported signatures: TIMESTAMP(STRING, [STRING]); TIMESTAMP(DATE, [STRING]); TIMESTAMP(DATETIME, [STRING]) at [101:13]
  compiled SQL at target\compiled\snowplow\identification\snowplow_id_map.sql

Database Error in model snowplow_web_page_context (models\page_views\snowplow_web_page_context.sql)
  Unrecognized name: root_id at [35:9]
  compiled SQL at target\compiled\snowplow\page_views\snowplow_web_page_context.sql

Done. PASS=0 ERROR=2 SKIP=3 TOTAL=5

I understand that it's probably dodgy to simply set "dataset_name.events" as the context table
Is this related to #52 or #41?
Should I override the given models to try and make this work, or am I simply running it wrong?

Any info/direction would be greatly appreciated - thanks!

dbt-labs / snowplow Goto Github PK

snowplow's Issues

Describe the bug

Screenshots and log output

System information

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Repo checklist:

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Rationale

Recommend Projects

Recommend Topics

Recommend Org