dbt-labs / snowplow Goto Github PK
View Code? Open in Web Editor NEWData models for snowplow analytics.
Home Page: https://hub.getdbt.com/dbt-labs/snowplow/latest/
License: Apache License 2.0
Data models for snowplow analytics.
Home Page: https://hub.getdbt.com/dbt-labs/snowplow/latest/
License: Apache License 2.0
I could have a WIP PR for this very quickly. The changes to the default implementations are actually very simple:
cast(column as datatype)
instead of column::datatype
spark__convert_timezone
and spark__dateadd
snowplow.dateadd
, which then calls dbt_utils.dateadd or
spark__dateadd` as appropriateThoughts:
spark__dateadd
, or pull it from (e.g.) spark-utils? I wouldn't want that dependency, however, in the 99% of cases when people aren't running this package on Spark. This will all be easier once we reimplement adapter macros.insert_overwrite
) incrementals? Do we need to add a truncated date column to these models, to pass to partition_by
? I don't think we can partition on a column expressionIf I don't get it wrong, the deduplication implemented here
https://github.com/fishtown-analytics/snowplow/blob/3795d06f365213ca4930d2447bd1580cb7031557/models/page_views/default/snowplow_web_page_context.sql#L43
drops all events that have a duplicated event_id (named root_id there), instead of keeping only the first one of those. Seems strange to me, is there a reason?
Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?
I belive in v 0.7.0 was introduce code that dramatically slow down building of model snowplow_page_views. Merge: 91e42ef
Line of code 45-57: https://github.com/fishtown-analytics/snowplow/blob/91e42efaef97a5f52ba9965155b3d50ac6d45f6b/macros/adapters/default/page_views/snowplow_page_views.sql#L45
It was commented as
-- we need to grab all events for any session that has appeared
-- in order to correctly process the session index below
We started collecting snowplow data about 8 month ago, and currently in snowplow_web_events there are about 73M of record. Looks like join to that table just as is will be painful for any mature projects. I suppose this part of code could be optimized (e.g. taking events for last n days). As far as I know sessions in snowplow by default lasts 30 minutes. So in 99% of time checking events from last several days will be OK. Am I correct follow the idea of introducing that code?
The final output tables for BigQuery are missing useragent and pref_timing columns. See the view for snowplow_page_views.sql
The contents of your packages.yml
file:
packages:
- package: dbt-labs/dbt_utils
version: ["1.0.0"]
- package: calogica/dbt_expectations
version: [">=0.5.0", "<0.9.0"]
- package: dbt-labs/snowplow
version: ["0.15.0"]
Which database are you using dbt with?
The output of dbt --version
:
Core:
- installed: 1.3.2
- latest: 1.4.0 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- bigquery: 1.3.0 - Up to date!
The operating system you're using:
The output of python --version
:
I'm getting a cryptic error message after removing a module.: TypeError: object of type 'NoneType' has no len()
dbt clean
dbt run
(I'm sure the problem is something to do with my project.yml but i'm not sure what exactly is throwing the error.... will report back when i can.)
An error message that will help me fix my configs/project.yml
2021-01-15 11:19:42.774812 (MainThread): Traceback (most recent call last):
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 124, in main
results, succeeded = handle_and_check(args)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 202, in handle_and_check
task, res = run_from_args(parsed)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/main.py", line 255, in run_from_args
results = task.run()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 399, in run
self._runtime_initialize()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 118, in _runtime_initialize
super()._runtime_initialize()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 78, in _runtime_initialize
self.load_manifest()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/task/runnable.py", line 65, in load_manifest
self.manifest = get_full_manifest(self.config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/perf_utils.py", line 31, in get_full_manifest
adapter.connections.set_query_header,
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 678, in load_manifest
return ManifestLoader.load_all(config, macro_manifest, macro_hook)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 348, in load_all
loader.load(macro_manifest=macro_manifest)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 204, in load
self.parse_project(project, macro_manifest, old_results)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 177, in parse_project
self.parse_with_cache(path, parser, old_results)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/manifest.py", line 133, in parse_with_cache
parser.parse_file(block)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 429, in parse_file
self.parse_node(file_block)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 402, in parse_node
self.render_update(node, config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 377, in render_update
self.render_with_context(node, config)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/parser/base.py", line 289, in render_with_context
parsed_node.raw_sql, context, parsed_node, capture_macros=True
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/clients/jinja.py", line 577, in get_rendered
return render_template(template, ctx, node)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/dbt/clients/jinja.py", line 528, in render_template
return template.render(ctx)
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/environment.py", line 1090, in render
self.environment.handle_exception()
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/environment.py", line 832, in handle_exception
reraise(*rewrite_traceback_stack(source=source))
File "/Users/JoshuaLayton-Wood/venvs/dbt/lib/python3.7/site-packages/jinja2/_compat.py", line 28, in reraise
raise value.with_traceback(tb)
File "", line 111, in top-level template code
TypeError: object of type 'NoneType' has no len()
Which database are you using dbt with?
The output of dbt --version
:
0.18.0
The operating system you're using:
MAC OS
The output of python --version
:
3.7.3
I've recently removed the snowplow module and copied some of the models into my own models folder. reversing these steps gets rid of the problem.
I can't see anything referencing the model snowplow_web_events_internal_fixed
I am following the code based on the model flow graph in the README.
The graph explains that snowplow_page_views
references it, but I see nothing in the macro
I also see no entry for it in the schema.yml
I realize it is ephemeral
but it should still be referenced in order to inline it as a CTE.
Have I missed something? Or can it be removed and the graph updated?
snowplow_page_views
(https://github.com/fishtown-analytics/snowplow/blob/master/models/page_views/snowplow_page_views.sql) currently filters for bot traffic in both br_family and user_agent but would like to add br_type not in ('Bot/Crawler', 'Robot')
session_id
fails the unique test in snowplow_sessions
after a full-refresh: https://github.com/fishtown-analytics/snowplow/blob/master/macros/adapters/default/sessions/snowplow_sessions.sql
I have encountered the case where page hits from pindgom are not behind excluded.
In this particular case the useragent from pindgom comes like this:
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) PingdomTMS/0.8.5 Safari/534.34
The regex is only setup to pick up pingdom
and PingdomBot
Currently these open source models rely on functionality only provided in the js tracker -- domainsessionidx and domainsessionid. We need to reimplement sessionization to allow other trackers that don't contain that functionality to also participate as first-class citizens in these models (currently they get filtered out).
I have faced an issue with br_family
NULL value in Redsfhit. There is a condition in snowplow_page_views.sql
at line 307 where a.br_family != 'Robot/Spider'
. In case there is a NULL value in br_family
, such condition will not output that rows, cause NULL can not be equal or not equal to anything. As result, lot of pageviews with null value in br_family are totaly lost in result tables of DBT model.
When trying to install snowplow using the following packages.yml file:
packages:
this error is generated:
Installing fishtown-analytics/[email protected]
Installed from version 0.12.0
Installing fishtown-analytics/[email protected]
Installed from version 0.6.6
Compilation Error in model snowplow_web_ua_parser_context (models/page_views/optional/snowplow_web_ua_parser_context.sql)
at path ['enabled']: None is not of type 'boolean' Code: 10004
When variable snowplow:pass_through_columns
is empty, the Jinja strippers at line 170 - 174 in /macros/adapters/bigquery/pageviews/snowplow_page_views.sql
remove too many whitespaces, resulting in the following compiled code:
struct(
br_renderengine as browser_engine,
dvce_type as type,
dvce_ismobile as is_mobile
) as devicefrom events
where event = 'page_view'
Removing one of the strippers will yield the correct syntax:
struct(
br_renderengine as browser_engine,
dvce_type as type,
dvce_ismobile as is_mobile
) as device
from events
where event = 'page_view'
I'm not sure, but may be it is not a correct way for filtering duplicates:
https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/models/page_views/optional/snowplow_web_ua_parser_context.sql#L43
Searching for several rows with same id:
duplicated as (
select
id
from prep
group by 1
having count(*) > 1
)
Next step is to exclude them all:
where page_view_id not in (select id from duplicated)
Let's assume that we have 3 rows with the same id, so we will deny all of them, but isn't it to be correct way to filter 2 of them and 1 use in result set?
It looks like Snowplow has switched to the new of processing raw data https://github.com/snowplow/data-models and call new data model as V1. Current implementation of DBT module does not suits it anymore. New data model should be supported/adopted by DBT snowplow module.
Possible workaround - create by DBT module all snowplow tables by parsing JSON fields in EVENTS table. E.g. create table com_snowplowanalytics_snowplow_web_page_1 by parsing field EVENTS.contexts_com_snowplowanalytics_snowplow_web_page_1
All users who has using DBT and Snwoplow at the same time on their projects.
Copied from dbt-labs/segment#42
dbt_project.yml
filerequire-dbt-version
parameter in the dbt_project.yml
I have faced a problem with snowplow model in DBT:
Database Error in model snowplow_web_timing_context (models/page_views/optional/snowplow_web_timing_context.sql)
column "event_id" specified in USING clause does not exist in left table
compiled SQL at target/compiled/snowplow/page_views/optional/snowplow_web_timing_context.sql
I'm using for timing context just standart context: https://github.com/snowplow/snowplow/wiki/1-General-parameters-for-the-Javascript-tracker#performanceTiming with schema https://github.com/snowplow/iglu-central/blob/master/sql/org.w3/performance_timing_1.sql
I suspect there is a bug of using event_id as key for join, because in schema used root_id as reference to event. Also, here I see an example of processing timing: https://discourse.snowplowanalytics.com/t/measuring-page-load-times-with-the-performance-timing-context-tutorial/100 and it is done with root_id.
There is one more issue: not existing field page_view_id
, I suspect that not optional views were amended accroding to new field name: https://github.com/fishtown-analytics/snowplow/blob/274c7969d61a552cdc996d00801d7f43cc9b6c03/macros/adapters/default/page_views/snowplow_web_page_context.sql#L35 In optional view it was not changed, so they still use page_view_id, not just id.
To Do:
Right now, Snowplow models are encoded as macros in order to implement cross-database support. Instead, we should experiment with making models specifically for each support adapter. I think broadly, this falls into:
The models directory might look like:
models:
- common/
- default/
- bigquery/
Every model would need to have a config like:
{{ config(enabled=(var('adapter') == 'default')) }}
or
{{ config(enabled=(var('adapter') == 'bigquery')) }}
I don't think we can configure this at a folder level in dbt_project.yml
, but it might be worth exploring if there's some clever way to implement that.
The results of this experimentation might result is us wanting to keep the adapter macro pattern, but I'm interested in exploring alternative ways of supporting cross-db model implementations in packages!
Due to code regarding timing: https://github.com/fishtown-analytics/snowplow/blob/692938e572e3eca5cb019d2adf877f0d1b7f0147/macros/adapters/default/page_views/snowplow_page_views.sql#L272
there is a possible issue. It is casted to integer, however in source table
https://github.com/fishtown-analytics/snowplow/blob/692938e572e3eca5cb019d2adf877f0d1b7f0147/models/page_views/optional/snowplow_web_timing_context.sql#L110
There is a calculation with bigint values, so resulting value is also bigint. I get such error:
Value out of range for 4 bytes.
DETAIL:
-----------------------------------------------
error: Value out of range for 4 bytes.
code: 8001
context: Input:2311553464.
query: 1121021
location: funcs_int.hpp:97
process: query0_106_1121021 [pid=12067]
-----------------------------------------------
compiled SQL at target/compiled/snowplow/page_views/snowplow_page_views.sql`
I figured out that it is about field dom_loading_to_interactive_time_in_ms
, but it can happens with any field.
I'm going to prepare PR regarding fixing casting to int. The problem is that table is incremental, so where should I place some migration script from int
based table to bigint
table?
I think with incremental loadingafter each load the page_view_index resets to 1 even if the user keeps the page open across days.
The default models assume that the web_page_context
data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something like CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID
? Has that been considered?
The snowplow_web_page_context model contains this comment:
-- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!
I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the snowplow_sessions_tmp
model fails on merge.
To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):
QUALIFY
row_number() over (partition by EVENT_ID order by COLLECTOR_TSTAMP, DERIVED_TSTAMP) = 1;
That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.
My concern is with the web_page_context
model being separated from the events, not incremental, only to be later recombined into a number of other models.
If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?
This is Snowflake specific.
Those with huge amounts of evergrowing events from Snowplow pipelines.
From dbt Slack:
A user had the following config in dbt_project.yml:
models: ... snowplow: enabled: true
Setting
enabled: true
in the root project overrides the dynamic, target-dependent enabled values set from within the snowplow package. Removing that line made this magically work again.I don't think there's anything we could or should do to "lock" those enabled values within a package, since it's expected (and good) behavior that the user can override locally. What I will do is document that setting a top-level
enabled
config can have surprising results.
Reminder to self to add a note to the README :)
Include in package integration tests event_update
:
session_id
as earlier eventpage_view_id
as earlier event (?)event_id
as earlier event (?)Our approach to these edge cases has been in flux over the past few releases. I think we have a good grasp of what desired behavior is as of 0.7.1. Let's include these cases in our integration tests to ensure the modeling operates as expected.
The Snowplow package has two first-class outputs: snowplow_sessions
and snowplow_page_views
. Both should include the stitched user identity, inferred_user_id
, that is the product of snowplow_id_map
.
We frequently find ourselves building reports on top of both tables, and we should be able to perform counts of distinct visitors that agree between them.
Hi, I'm using the GCP Snowplow setup. This puts events into a single BigQuery table, with this schema: https://gist.github.com/wub/9d4b9d27f15f8e7c87906e55cb913f68
When I run with this configuration in dbt 0.14.0:
models:
snowplow:
vars:
'snowplow:timezone': 'Pacific/Auckland'
'snowplow:page_ping_frequency': 10
'snowplow:events': "dataset_name.events"
'snowplow:context:web_page': "dataset_name.events"
'snowplow:context:performance_timing': false
'snowplow:context:useragent': false
'snowplow:pass_through_columns': []
base:
optional:
enabled: false
page_views:
optional:
enabled: false
I get this error:
Found 7 models, 22 tests, 0 snapshots, 0 analyses, 163 macros, 0 operations, 1 seed files, 0 sources
21:34:12 | Concurrency: 2 threads (target='dev')
21:34:12 |
21:34:13 | 1 of 5 START table model dataset_name.snowplow_web_page_context [RUN]
21:34:13 | 2 of 5 START incremental model dataset_name.snowplow_id_map.... [RUN]
21:34:17 | 2 of 5 ERROR creating incremental model dataset_name.snowplow_id_map [ERROR in 2.85s]
21:34:19 | 1 of 5 ERROR creating table model dataset_name.snowplow_web_page_context [ERROR in 3.23s]
21:34:19 | 3 of 5 SKIP relation dataset_name.snowplow_page_views.......... [SKIP]
21:34:19 | 4 of 5 SKIP relation dataset_name.snowplow_sessions_tmp........ [SKIP]
21:34:19 | 5 of 5 SKIP relation dataset_name.snowplow_sessions............ [SKIP]
21:34:19 |
21:34:19 | Finished running 3 incremental models, 2 table models in 10.23s.
Completed with 2 errors:
Database Error in model snowplow_id_map (models\identification\snowplow_id_map.sql)
No matching signature for function TIMESTAMP for argument types: TIMESTAMP. Supported signatures: TIMESTAMP(STRING, [STRING]); TIMESTAMP(DATE, [STRING]); TIMESTAMP(DATETIME, [STRING]) at [101:13]
compiled SQL at target\compiled\snowplow\identification\snowplow_id_map.sql
Database Error in model snowplow_web_page_context (models\page_views\snowplow_web_page_context.sql)
Unrecognized name: root_id at [35:9]
compiled SQL at target\compiled\snowplow\page_views\snowplow_web_page_context.sql
Done. PASS=0 ERROR=2 SKIP=3 TOTAL=5
Any info/direction would be greatly appreciated - thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.