velir / dbt-ga4 Goto Github PK

dbt Package for modeling raw data exported by Google Analytics 4. BigQuery support, only.

License: MIT License

Python 25.90% SQL 74.10%

dbt-ga4's Introduction

GA4 DBT Package

This dbt package connects to an exported GA4 dataset and provides useful transformations as well as report-ready dimensional models that can be used to build reports.

Features include:

Flattened models to access common events and event parameters such as page_view, session_start, and purchase
Conversion of sharded event tables into a single partitioned table
Incremental loading of GA4 data into your staging tables
Page, session and user dimensional models with conversion counts
Last non-direct session attribution
Simple methods for accessing query parameters (like UTM params) or filtering query parameters (like click IDs)
Support for custom event parameters & user properties
Mapping from source/medium to default channel grouping

Models

model	description
stg_ga4__events	Contains cleaned event data that is enhanced with useful event and session keys.
stg_ga4__event_*	1 model per event (ex: page_view, purchase) which flattens event parameters specific to that event
stg_ga4__event_items	Contains item data associated with e-commerce events (Purchase, add to cart, etc)
stg_ga4__event_to_query_string_params	Mapping between each event and any query parameters & values that were contained in the event's `page_location` field
stg_ga4__user_properties	Finds the most recent occurance of specified user_properties for each user
stg_ga4__derived_user_properties	Finds the most recent occurance of specific event_params value and assigns them to a client_key. Derived user properties are specified as variables (see documentation below)
stg_ga4__derived_session_properties	Finds the most recent occurance of specific event_params or user_properties value and assigns them to a session's session_key. Derived session properties are specified as variables (see documentation below)
stg_ga4__session_conversions_daily	Produces daily counts of conversions per session. The list of conversion events to include is configurable (see documentation below)
stg_ga4__sessions_traffic_sources	Finds the first source, medium, campaign, content, paid search term (from UTM tracking), and default channel grouping for each session.
stg_ga4__sessions_traffic_sources_daily	Same data as stg_ga4__sessions_traffic_sources, but partitioned by day to allow for efficient loading and querying of data.
stg_ga4__sessions_traffic_sources_last_non_direct_daily	Finds the last non-direct source attributed to each session within a 30-day lookback window. Assumes each session is contained within a day.
dim_ga4__client_keys	Dimension table for user devices as indicated by client_keys. Contains attributes such as first and last page viewed.
dim_ga4__sessions	Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data. Can be expensive to run on large installs (see `dim_ga4__sessions_daily`)
dim_ga4__sessions_daily	Query-optimized session dimension table that is incremental and partitioned on date. Assumes that each partition is contained within a single day
fct_ga4__pages	Fact table for pages which aggregates common page metrics by date, stream_id and page_location.
fct_ga4__sessions_daily	Fact table for session metrics, partitioned by date. A single session may span multiple rows given that sessions can span multiple days.
fct_ga4__sessions	Fact table that aggregates session metrics across days. This table is not partitioned, so be mindful of performance/cost when querying.

Seeds

seed file	description
ga4_source_categories.csv	Google's mapping between `source` and `source_category`. Downloaded from https://support.google.com/analytics/answer/9756891?hl=en

Be sure to run dbt seed before you run dbt run.

Installation & Configuration

Install from DBT Package Hub

To pull the latest stable release along with minor updates, add the following to your packages.yml file:

packages:
  - package: Velir/ga4
    version: [">=6.0.0", "<6.1.0"]

Install From main branch on GitHub

To install the latest code (may be unstable), add the following to your packages.yml file:

packages:
  - git: "https://github.com/Velir/dbt-ga4.git"

Install From Local Directory

Clone this repository to a folder in the same parent directory as your DBT project
Update your project's packages.yml to include a reference to this package:

packages:
  - local: ../dbt-ga4

Required Variables

This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is defined using the project and dataset variables below. The static_incremental_days variable defines how many days' worth of data to reprocess during incremental runs.

vars:
  ga4:
    source_project: "my_source_gcp_project" # Project that contains raw GA4 data
    property_ids: [11111111] # Array of properties to process
    start_date: "YYYYMMDD" # Earliest date to load
    static_incremental_days: 3 # Number of days to scan and reprocess on each run

Required Variables (Multi-Project Instance)

When processing multiple properties at a time, the required variables change slightly. See Multi-Property Support section for details on configuring multiple GA4 properties as a source.

vars:
  ga4:
    source_project: "my_source_gcp_project" # Project that contains raw GA4 data
    combined_dataset: "my_combined_data" # Dataset where multi-property data is cloned
    property_ids: [11111111,2222222] # Array of properties to process
    start_date: "YYYYMMDD" # Earliest date to load
    static_incremental_days: 3 # Number of days to scan and reprocess on each run

Optional Variables

Query Parameter Exclusions

Setting query_parameter_exclusions will remove query string parameters from the page_location and page_referrer fields for all downstream processing. Original parameters are captured in the original_page_location and original_page_referrer fields. Ex:

vars:
  ga4: 
    query_parameter_exclusions: ["gclid","fbclid","_ga"]

You can remove all query parameters by setting query_parameter_exclusions to *all*.

vars:
  ga4:
    query_parameter_exclusions: ["*all*"]

By default, parameters are removed from URL fragments (elements after the hash # symbol). If you wish to exclude query parameters but keep URL fragments with the same key, you will need to override the remove_query_parameters macro.

Query Parameter Extraction

Setting query_parameter_extraction will extract query string parameters from the page_location field into new columns. This can be used to extract advertising click IDs into columns that can be joined with advertising data sets. Ex:

vars:
  ga4: 
    query_parameter_extraction: ["gclid","fbclid","keyword"]

Custom Parameters

Within GA4, you can add custom parameters to any event. These custom parameters will be picked up by this package if they are defined as variables within your dbt_project.yml file using the following syntax:

[event name]_custom_parameters
  - name: "[name of custom parameter]"
    value_type: "[string_value|int_value|float_value|double_value]"

For example:

vars:
  ga4:
    page_view_custom_parameters:
      - name: "clean_event"
        value_type: "string_value"
      - name: "country_code"
        value_type: "int_value"

You can optionally rename the output column:

vars:
  ga4:
    page_view_custom_parameters:
      - name: "country_code"
        value_type: "int_value"
        rename_to: "country"

If there are custom parameters you need on all events, you can define defaults using default_custom_parameters, for example:

vars:
  ga4:
    default_custom_parameters:
      - name: "country_code"
        value_type: "int_value"

User Properties

User properties are provided by GA4 in the user_properties repeated field. The most recent user property for each user will be extracted and included in the dim_ga4__users model by configuring the user_properties variable in your project as follows:

vars:
  ga4:
    user_properties:
      - user_property_name: "membership_level"
        value_type: "int_value"
      - user_property_name: "account_status"
        value_type: "string_value"

Derived User Properties

Derived user properties are different from "User Properties" in that they are derived from event parameters. This provides additional flexibility in allowing users to turn any event parameter into a user property.

Derived User Properties are included in the dim_ga4__users model and contain the latest event parameter value per user.

derived_user_properties:
  - event_parameter: "[your event parameter]"
    user_property_name: "[a unique name for the derived user property]"
    value_type: "[string_value|int_value|float_value|double_value]"

For example:

vars:
  ga4:
    derived_user_properties:
      - event_parameter: "page_location"
        user_property_name: "most_recent_page_location"
        value_type: "string_value"
      - event_parameter: "another_event_param"
        user_property_name: "most_recent_param"
        value_type: "string_value"

Derived Session Properties

Derived session properties are similar to derived user properties, but on a per-session basis, for properties that change slowly over time. This provides additional flexibility in allowing users to turn any event parameter into a session property.

Derived Session Properties are included in the dim_ga4__sessions and dim_ga4__sessions_daily models and contain the latest event parameter or user property value per session.

derived_session_properties:
  - event_parameter: "[your event parameter]"
    session_property_name: "[a unique name for the derived session property]"
    value_type: "[string_value|int_value|float_value|double_value]"
  - user_property: "[your user property key]"
    session_property_name: "[a unique name for the derived session property]"
    value_type: "[string_value|int_value|float_value|double_value]"

For example:

vars:
  ga4:
    derived_session_properties:
      - event_parameter: "page_location"
        session_property_name: "most_recent_page_location"
        value_type: "string_value"
      - event_parameter: "another_event_param"
        session_property_name: "most_recent_param"
        value_type: "string_value"
      - user_property: "first_open_time"
        session_property_name: "first_open_time"
        value_type: "int_value"

GA4 Recommended Events

See the README file at /dbt_packages/models/staging/recommended_events for instructions on enabling Google's recommended events.

Conversion Events

Specific event names can be specified as conversions by setting the conversion_events variable in your dbt_project.yml file. These events will be counted against each session and included in the fct_sessions.sql dimensional model. Ex:

vars:
  ga4:
    conversion_events: ['purchase','download']

Session Attribution Lookback Window

The stg_ga4__sessions_traffic_sources_last_non_direct_daily model provides last non-direct session attribution within a configurable lookback window. The default is 30 days, but this can be overridden with the session_attribution_lookback_window_days variable.

vars:
  ga4:
    session_attribution_lookback_window_days: 90

Custom Events

Custom events can be generated in your project using the create_custom_event macro. Simply create a new model in your project and enter the following:

{{ ga4.create_custom_event('my_custom_event') }}

Note, however, that any event-specific custom parameters or default custom parameters must be defined in the global variable space as shown below:

vars:
    default_custom_parameters:
      - name: "some_parameter"
        value_type: "string_value"
    my_custom_event_custom_parameters:
      - name: "some_other_parameter"
        value_type: "string_value"

Connecting to BigQuery

This package assumes that BigQuery is the source of your GA4 data. Full instructions for connecting DBT to BigQuery are here: https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile

The easiest option is using OAuth with your Google Account. Summarized instructions are as follows:

Download and initialize gcloud SDK with your Google Account (https://cloud.google.com/sdk/docs/install)
Run the following command to provide default application OAuth access to BigQuery:

gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/iam.test

Unit Testing

This package uses pytest as a method of unit testing individual models. More details can be found in the unit_tests/README.md folder.

Overriding Default Channel Groupings

By default, this package maps traffic sources to channel groupings using the macros/default_channel_grouping.sql macro. This macro closely adheres to Google's recommended channel groupings documented here: https://support.google.com/analytics/answer/9756891?hl=en .

Package users can override this macro and implement their own channel groupings by following these steps:

Create a macro in your project named default__default_channel_grouping that accepts the same 3 arguments: source, medium, source_category
Implement your custom logic within that macro. It may be easiest to first copy the code from the package macro and modify from there.

Overriding the package's default channel mapping makes use of dbt's dispatch override capability documented here: https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch#overriding-package-macros

Multi-Property Support

Multiple GA4 properties are supported by listing out the project IDs in the property_ids variable. In this scenario, the static_incremental_days variable is required and the combined_dataset variable will define the dataset (in your profile's target project) where source data will be copied.

vars:
  ga4:
    property_ids: [11111111, 22222222, 33333333]
    static_incremental_days: 3
    combined_dataset: "my_combined_dataset"

With these variables set, the combine_property_data macro will run as a pre-hook to base_ga4_events and clone shards to the target dataset. The number of days' worth of data to clone during incremental runs will be based on the static_incremental_days variable.

dbt Style Guide

This package attempts to adhere to the Brooklyn Data style guide found here. This work is in-progress.

dbt-ga4's People

Contributors

Stargazers

Watchers

Forkers

mowgliamu loreal-onedata kingo55 arniwesth jhnvdw clemens7haar hrvojekr iosiuk elbjunge samirliu127 vibhorj zordnajela willbryant gracie2339 sileverfall capitalhumain 9fwr geofparent stevenjrowe123 vinnienetpeak phillymedia alfredojf turboua 00kennedyp lchcapitalhumain igmndez-acc bnpdata davidsmilovitz adamribaudo-velir hgeginat altons threeventurestechnology ivan-toriya gblazq codingwithkelsey efrazier kuritah faahren precis-digital philgoon sgiostra-bitbang invisible-agents tjscarr enarayan akiyamatakanori qingsun88 qingfyndiq lowejoshua jamessharwin emelillan wasim ken6377 elyobo jgithub brekelbende eliesg tannerhopkins xhulianpapa nakamuyu charlene-p maximumv vairavelr saturo-joerg jice-lavocat farmersdog stephen986 elpovette cptshrk108 wellsterhealthtech graze florianaschroeder sproutward graeme0101 worldlyjohn geetathunberg mrdat0194 divyesh-agr1 ntt-peklo blazova aphonsoar erikverheij ericknaunayesage rbarreca emanuel-mvr mkmullens josh-spratt shreveasaurus niteshsharoff trimm1312 jovanipink aliishfaq joshua-data lviaud gughapriyaaelango osbdsd mikku-03 admariner nmeijssen marcinlaskowski-sempai yamotech

dbt-ga4's Issues

Integration tests failing

Am I the only one getting this?

ERROR    configured_file:functions.py:232 09:48:46.739587 [error] [Thread-21 ]: 3 of 3 ERROR creating view model test16662593049287124637_test_stg_ga4__events.actual  [ERROR in 2.91s]
ERROR    configured_std_out:functions.py:232 09:48:46  3 of 3 ERROR creating view model test16662593049287124637_test_stg_ga4__events.actual  [ERROR in 2.91s]
DEBUG    configured_file:functions.py:226 09:48:46.740104 [debug] [Thread-21 ]: Finished running node model.test.actual
DEBUG    configured_file:functions.py:226 09:48:46.741311 [debug] [MainThread]: Acquiring new bigquery connection "master"
INFO     configured_file:functions.py:228 09:48:46.741676 [info ] [MainThread]: 
INFO     configured_std_out:functions.py:228 09:48:46  
INFO     configured_file:functions.py:228 09:48:46.742012 [info ] [MainThread]: Finished running 2 seeds, 1 view model in 20.45s.
INFO     configured_std_out:functions.py:228 09:48:46  Finished running 2 seeds, 1 view model in 20.45s.
DEBUG    configured_file:functions.py:226 09:48:46.742314 [debug] [MainThread]: Connection 'master' was properly closed.
DEBUG    configured_file:functions.py:226 09:48:46.742480 [debug] [MainThread]: Connection 'model.test.actual' was properly closed.
INFO     configured_file:functions.py:228 09:48:46.747644 [info ] [MainThread]: 
INFO     configured_std_out:functions.py:228 09:48:46  
INFO     configured_file:functions.py:228 09:48:46.747975 [info ] [MainThread]: Completed with 1 error and 0 warnings:
INFO     configured_std_out:functions.py:228 09:48:46  Completed with 1 error and 0 warnings:
INFO     configured_file:functions.py:228 09:48:46.748269 [info ] [MainThread]: 
INFO     configured_std_out:functions.py:228 09:48:46  
ERROR    configured_file:functions.py:232 09:48:46.748532 [error] [MainThread]: Database Error in model actual (models/actual.sql)
ERROR    configured_std_out:functions.py:232 09:48:46  Database Error in model actual (models/actual.sql)
ERROR    configured_file:functions.py:232 09:48:46.748800 [error] [MainThread]:   Column medium in SELECT * EXCEPT list does not exist at [43:19]
ERROR    configured_std_out:functions.py:232 09:48:46    Column medium in SELECT * EXCEPT list does not exist at [43:19]
ERROR    configured_file:functions.py:232 09:48:46.749048 [error] [MainThread]:   compiled SQL at target/run/test/models/actual.sql
ERROR    configured_std_out:functions.py:232 09:48:46    compiled SQL at target/run/test/models/actual.sql
INFO     configured_file:functions.py:228 09:48:46.749304 [info ] [MainThread]: 
INFO     configured_std_out:functions.py:228 09:48:46  
INFO     configured_file:functions.py:228 09:48:46.749557 [info ] [MainThread]: Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3
INFO     configured_std_out:functions.py:228 09:48:46  Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3

I think this is coming from the detect_gclid block?

Problem to add custom parameters

I am a total beginner with DBT and have a problem with the custom parameter configuration.
The custom parameters only show up in the nested column of the custom events so far.

I have tried to add some custom parameter columns for some custom events to the event tables but they don't show up.

To achieve this, I have added the parameters to the dbt_project_yml in this way:

customeventname:
- name: "custom_parameter_1"
value_type: "string_value"
- name: "custom_parameter_2"
value_type: "int_value"

Any advice?

BigQuery resources exceeded error when running `dim_ga4__users` model with `include_intraday_events: true`

Running the ga4 package with the following configuration:

  ga4:
      project: <project_id>
      dataset: <dataset_id>
      start_date: "20220701" 
      include_intraday_events: true

Yields the following error in dbt:

18:41:06  Database Error in model dim_ga4__users (models/marts/core/dim_ga4__users.sql)
18:41:06    Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
18:41:06    compiled SQL at target/run/ga4/models/marts/core/dim_ga4__users.sql

After setting include_intraday_events: false, the dim_ga4__users model (and all other ga4 models) run successfully.

Is there a way to override the model's config?

Hi,

I am looking for a way to override the config used in the models, in order to adapt them to my datawarehouse. Currently, we use a tag in the config macro to store crontab strings, which is used in airflow to schedule the dbt run jobs. For instance, I would like to run the model base_ga4__events daily at 8 AM. But, I could not find a way to 'inject' the tag in the package's models.

Does anyone know if this is possible?

Evaluate ingestion-time partitioning and partition copy available in dbt 1.4.0

https://medium.com/teads-engineering/bigquery-ingestion-time-partitioning-and-partition-copy-with-dbt-cc8a00f373e3

dbt 1.4.0 now supports ingestion-time partitions (as opposed to column type partitions). It seems as there are some cost savings to be had when partitions are very large (> 1M rows per partition). It's not immediately clear to me if this benefits our package, but seems worth considering.

Another feature they introduced is the bq copy feature for handle insert_overwrite incremental merges which seems relevant.

handling of user scope dimensions / user properties

Hello,

We're using a custom user scope dimensions in GA4 and I tried to make it available in dim_ga4__users, but it didn't work.

The reason was that the script was looking for the value in "event_params" and not in "user_properties". Now I'm wondering if it's a bug or working as planned. It could be useful to have the last value of an event parameter available as user property, even when it was not what I expected first.

The user scope dimensions was available after replacing the parameter here:

I don't see the custom parameters in the view

Hi @dgitis,
I've a problem with custom parameters.

I wish to have the details of event purchase, so I added:

    purchase_custom_parameters:
      - name: "value"
        value_type: "float_value"
      - name: "currency"
        value_type: "string_value"
      - name: "transaction_id"
        value_type: "string_value"

vars:
  ga4:

of dbt_project.yml

I also added the file staging/stg_ga4__event_purchase.sql in models, made in this way:

 with purchase_with_params as (
   select *,
      {% if var("purchase_custom_parameters", "none") != "none" %}
        {{ ga4.stage_custom_parameters( var("purchase_custom_parameters") )}}
      {% endif %}
 from {{ref('stg_ga4__events')}}    
 where event_name = 'purchase'
)

select * from purchase_with_params

At end I have the view stg_ga4__event_purchase, but in the view i don't see the columns that i signed in purchase_custom_parameters.

And don't understand the cause of the issue.

Thanks

Originally posted by @Sgiostra-BitBang in #108 (comment)

dbt stg_ga4__events ERROR (0.1.4 version)

Hello Adam,

First of all, we would like to thank you for your work. We really appreciate it and it's very useful for us.

Using the latest version (0.1.4) we have this problem:

2022-07-11T08:42:37.978749Z: 08:42:37  102 of 121 ERROR creating view model z_dbt_dev_lg.stg_ga4__events............... [ERROR� in 2.10s]
2022-07-11T08:42:37.978863Z: 08:42:37  Finished running node model.ga4.stg_ga4__events
Database Error in model stg_ga4__events (models/staging/ga4/stg_ga4__events.sql)
  Queries in UNION ALL have mismatched column count; query 1 has 30 columns, query 2 has 32 columns at [12:5]
  compiled SQL at target/run/ga4/models/staging/ga4/stg_ga4__events.sql

This stops happening when we use version 0.1.3.
We have checked our events__ table, intraday__ table and everything, but we don't know what might be going on.

Could you help us or shed some light about this new version?

For now we will continue to use version 0.1.3.

Kind Regards,
Data-Crafts team

Support require partition filter

BigQuery has a require partition filter option that can only be set when creating a partitioned table (unlike table expiration which can be edited in the BigQuery console). As a result, the easiest way to set this config is via dbt when it creates or full-refreshes the models.

I expect that anyone wanting to configure this, will want to configure it for all models. The multi-site branch related to #107 has a partition header macro. It would be best to implement this configuration there after it gets merged into main.

Remove models/staging/ga4 folder and move staging models under /models/staging?

When I started this package, I put the models under /models/staging/ga4 because typically I put staging models under a folder named after the source of data. This, however, creates confusing situations where users need to select package models using this selection criteria: ga4.staging.ga4.MODEL and this project config:

models:
  ga4:
    staging:
      ga4:
        recommended_events:
          stg_ga4__event_sign_up:
            +enabled: true
          stg_ga4__event_share:
            +enabled: true

It also seems unnecessary because ALL data in this package is sourced from ga4. It would be much cleaner to remove the /models/staging/ga4 folder. But this would be a breaking change for some folks who reference those paths.

Thoughts?

Google Paid Misattributed as Organic

We've discussed this possibility before. Because Google Ads traffic is attributed through a server-side integration, the BigQuery export doesn't properly assign source, medium, campaign and default_channel_grouping attributing them to organic sources instead.

This can be verified using the following query.

SELECT  
  session_key,
  ga_session_number,
  landing_page,
  source,
  medium,
  campaign,
  default_channel_grouping
FROM `*.dim_ga4__sessions` 
WHERE source = 'google'
and landing_page like '%gclid%'

fct_ga4__sessions_daily errors on incremental builds when session conversions are configured

I'm seeing this on 2.0.1.

It seems to session conversions that are causing the problem. If I delete the ``conversion_eventsvariable from mydbt_project.yml` file then it works fine.

I can't immediately see what the problem is. Perhaps its a dbt problem when partitioned models select from partitioned models that are both partitioned on the same field.

Remove query parameters from referrer too

The query_parameter_exclusions currently affects only page_location. It should apply to page_referrer too. This flows through to the session model.

Prevent sessions from splitting when spanning partitions

This is harder to do than it seems. Here's a possible way that we can fix this.

We add a session_timeout parameter, defaulting to 30 minutes, and then check whether a session started within the session_timeout window and only reprocess sessions within that window.

-- change a single session value
case 
 when session_start_timestamp is null then <new value>
 when timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout then <new value>
 else <current value>
end 

-- only update sessions within the timeout window
select * from x
where session_start_timestamp is null or  timestamp_diff( max_partition_timestamp, max(event_timestamp)) < session_timeout

The idea being that you process a full day (or days) worth of data plus a portion of today's data. We check the max partition timestamp against session timeout and reprocess the sessions within that window (which would have been partially processed in yesterday's run) but not touch sessions from yesterday's run that fall outside the timeout period because they should be complete.

The main consequence of this approach, is that users running daily dbt runs with just the "batch" option (versus "batch+streaming" or just "streaming") will break their sessions because they will only ever have one day worth of data in each overnight run and need to reprocess, weekly for example, to fix sessions.

dbt build error related to ga_session_number

Hello,
Error message from my side following the last commit, related to the ga_session_number.

session_key changes when crossing domains

I have noticed recently the my project has been failing unique session_key test.

Looking into our data is seem to be as a result of cross domain tracking. It seems GA4 is doing a new "session_start" event when customer are going cross domains. Can anyone else validate this on their GA4.

Time on Page

I frequently get customization requests to add a time on page metric. I always push back with performance objections, but, because the clients who ask for time on page all value it very highly, they have so far always insisted.

Right now I have two requests open: one is for a small non-profit that isn't going to exceed the free tier with the request and the other is for a high-volume publisher who doesn't care how much it costs.

I'm thinking of two general design patterns for implementing this both of which involve adding a variable to the dbt_project.yml.

The first is to create an "enable_non_performant_metrics" variable with a list value of consisting of metrics that require window functions ['time_on_page', 'exits', etc...] and then enable those metrics as configured.

The second is to create an "enable_non_performant_utitlities" variable with a true-false value that adds session_first_event_id, session_last_event_id, and next_page_view_event_id (possibly last_page_view_event_id, I'm not sure if we needit) to events possibly using one of the methods below if set to true.

The session_first_event_id would use select the first event that is not session_start or first_visit as only those two events fire before the config tag loads page_referrer and other similar data. I've had to debug some session issues recently and I'm confident that this works as intended.

@willbryant and I talked about doing these sorts of lookups between events_* tables and base_ga4___events so that the window functions don't go wild looking across partitions. This goes against dbt best-practices, so I understand any resistance, but we at least have a good reason to do so.

Another possibility is to create a stg_ga4__window_functions table that runs parallel to (or just after) stg_ga4__events with a hard-set window. I'm not sure that this would work, but the idea is to maintain the stg_ga4__events dynamic querying of partitions while also allowing the use of window functions (when building but now when running dynamic queries) that don't search all partitions.

Unfortunately, the first, clearly easier method doesn't really work as a first step towards realizing the second, much more comprehensive method.

I'd like to hear what other think.

BUG?! base_ga4__events.sql --> Dynamic vs Static partitions IF clause

Context: base_ga4__events.sql -> Line #1 & Line #58

Issues: the conditional IF statement is not executing as expected
{% if var('static_incremental_days', false ) == true %}

Impact: even though I have set a variable static_incremental_days: 3 in dbt_project.yml file, the compiled code still rendering statements meant for dynamic-partitions and not static-partitions as expected

@adamribaudo I can fix, create PR to merge if you agree ?

stg_ga4__event_purchase - fields value and tax

On the definition of the view stg_ga4__event_purchase that are in the file: "dbt_packages/ga4/models/staging/ga4/recommended_events/stg_ga4__event_purchase.sql" the fields: value and tax are defined as float_value instead in the event table of GA4 schema in BQ they are double_value.
To check it You can use this query:

select ep.* from `<bq_project>.<analytics_ana_id>.events_*`
 cross join unnest(event_params) ep
 where event_name = 'purchase'
 and key in ('tax', 'value')
limit 100

As workaround i put the code of the view in the root models/staging directory of the my DBT project and replace the type of these 2 field (value, tax) with double_value.

user_key and user_pseudo_id

I think it's useful to add user_key in fct_ga4__sessions alongside with user_pseudo_id
E.g. I want to join dim_ga4__users to have some info about the user on the session level.

What do you think about that? I can make a PR.

Renaming conflicting custom parameters

If a custom parameter conflicts with one of the standard column names exported by the staging views, it's currently not possible to access it. unnest_key supports renaming so we should ideally support this in stage_custom_parameters too.

re-factor code / project structure to align better with DBT ?

https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview

intermediate folder
.yml file naming convention.
defining the fact table, grain etc.

whether its worth the efforts at this stage or is it too much over engineering to worth the effort ;)

Feature request: Support for multiple GA4 properties in a single dbt project

I've got several GA4 properties that my team works with and we use a single dbt project monorepo for our data warehouse. It would be fantastic if there were a mechanism and clear instructions to namespace GA4 properties as they're imported as dbt sources.

dim_ga4__sessions model would be easier to use with a date field

This will possibly be fixed if we move to partitioned session tables, but if we don't, some data visualization tools make it a lot easier working with data marts that have date fields or have problems using date fields that aren't joined in BQ.

Derived session properties

derived_user_properties is excellent and exactly what I needed for some of my properties that once set don't really change.

But we also have some event params that will most likely be different for each session, and putting them in derived_user_properties means we overwrite the old values for old sessions, which is not right for our data. For example, our shipping time estimates are per-session.

So I'd like to add derived_session_properties.

Do you think it's worth waiting for #46 to go in before working on this? That looks pretty major so some risk of a merge conflict, but also looks a bit stalled?

stg_ga4__event_items needs a date field

It would be more flexible if the stg_ga4__event_items model had a date field to give larger sites to partition marts built on this model.

Evaluate conforming to the Fivetran GA4 Export Schema

Fivetran recently released their connector for exported GA4 BQ data here: https://fivetran.com/docs/applications/google-analytics-4-export

In their schema, they simply break out the nested values (items, event params, user properties) into separate tables.

There would be value in our package mirroring this technique in that end-users would be able to leverage any dbt packages that Fivetran produces that utilize this schema. In other words, end users could use a combination of 1) BQ free export 2) our package 3) Fivetran's GA4 dbt package rather than paying for Fivetran's connector and then using their dbt package.

Changes to introduce:

Create new item table containing product items
Create new event_param table containing event param keys and values
Create new user_property table containing user property keys and values and set timestamps

Engagement time calculations may be badly off

GA4 average engagement time metrics aren't even close to the values I'm seeing in sum_engaged_time_msec in fct_ga4__sessions.

To get a better handle on time metrics in GA4, I exported the GA4 daily average values for engagement time and session duration and then ran this query against the base data to test different calculations.

My results for two large sites are in this sheet.

The total engagement time as calculated in this package is consistently 1/4 that of the GA4 interface.

There are two likely explanations:

Engagement time is processed data that isn't fully calculated in the streaming export even though there are some engagement_time_msec parameters in the GA4 streaming export.
Our assumptions about engagement time are wrong.

Looking at a GA4-BigQuery integration using the batch export (that I don't have permission to share), I'm actually seeing the reverse where our package is returning engagement times that are 4x the engagement time reported in GA4. However, the data varies enough day-by-day that I'd like to see more data from other sites.

Create user_pseudo_id to user_id mapping table. Replaces current user_key implementation.

As @willbryant found, user_id's can change mid sessions which then cause user_keys to change mid-session. The intention of the user_key is to have a stable identifier for users as they they move from 1 to more-than-1 device. The current implementation is too brittle.

A better implementation would use a mapping table to keep track of user_pseudo_ids and user_ids. The mapping table would then be used in the marts when users need to be related to sessions. user_pseudo_id will be a sufficient field to use for session and page-level metrics. Only user level metrics will need this mapping table.

My suggestion would be to only support mapping between a pseudo_user_id to 0 or 1 user_id. In this implementation, we would track the 'last seen' user_id for each pseudo_user_id, rather than attempting to merge multiple user_ids into a single entity.

Use cluster on event_name field of base_ga4__events in order to reduce costs

Hello!

I am beginning to try this package and this is awesome! I see that the base_ga4__events table is already partitioned by time (event_date_dt field), but there are many views that depend on this table and filter it only by event_name (for instance stg_ga4__event_video_start). In order to reduce costs of BigQuery, a simple solution is to use the cluster_by option of the dbt's config statement:

{{
    config(
        materialized = 'incremental',
        incremental_strategy = 'insert_overwrite',
        partition_by={
            "field": "event_date_dt",
            "data_type": "date",
        },
        partitions = partitions_to_replace,
        cluster_by=['event_name']
    )
}}

Maybe there could be an option, in order to allow even other clustered fields. Currently, in my company's datawarehouse, we also cluster by user id, and stream id.

Events not associated with session correctly when user_id changes

We're noticing that some events are associated with session_key values that do not appear in dim_ga4__sessions.

If I run this query:

SELECT event_date_dt, event_name, user_pseudo_id, user_id, user_key, ga_session_id, session_key
from `delivereasy-data-prod`.`bi_dbt_ga4`.`stg_ga4__events`
where user_pseudo_id = '1hzXd+Z2R1....' and event_date_dt = '2022-09-01'
order by event_timestamp

We can see why:

GA4 has not managed to copy the user_id back to the start of the session (which I thought it was meant to do TBH, but there you go).

Therefore, the user_key has been reset when the user logs in and their user_id goes from NULL to set. This, in turn, causes session_key to change.

Finally, because this new session does not have a session_start event, because it is not a new session from GA4's perspective, the second session is generated in the fct_ga4__sessions table but not in the dim_ga4__sessions table.

As a result, queries that join across to dim_ga4__sessions drop the event rows (or miss data, if we use a left join).

Event parameters can be recorded as multiple data types, our unnest_keys() macro doesn't account for this

Flagged by a user in Measure Slack here: https://measure.slack.com/archives/CHC9RDUSG/p1676045661507979

From Simo Ahava: "If the value is numeric it ends up in the int (or double) value columns, if it’s something else then in the string column"

Currently, the unnest_keys macro only looks for values in 1 field (meaning 1 data type).

Other BQ users have decided to use COALESCE to find the first non-null value. We should think through the ramifications.

Example using Snowflake

Is it possible to use this package with Snowflake as the data warehouse? If so, what would the settings be?

Thanks in advance!

Add 'New User' dimension to session/user dimension models

Google provided some sample queries that demystify how they make certain counts. We can use the following to determine if a user is considered a new user as part of the session/user dim models

https://developers.google.com/analytics/bigquery/basic-queries?authuser=0

WITH
  UserInfo AS (
    SELECT
      user_pseudo_id,
      MAX(IF(event_name IN ('first_visit', 'first_open'), 1, 0)) AS is_new_user
    -- Replace table name.
    FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
    -- Replace date range.
    WHERE _TABLE_SUFFIX BETWEEN '20201101' AND '20201130'
    GROUP BY 1
  )
SELECT
  COUNT(*) AS user_count,
  SUM(is_new_user) AS new_user_count
FROM UserInfo;

Counting distinct and new users within fct_ga4__pages is incorrect outside of hour-grain

We have the following select statement in fct_ga4__pages:

        extract( hour from (select  timestamp_micros(event_timestamp))) as hour,
        page_key,
        ...
        count(distinct user_pseudo_id ) as distinct_user_pseudo_ids,
        sum( if(ga_session_number = 1,1,0)) as new_user_pseudo_ids,

Won't distinct_user_pseudo_ids be a count of distinct users within the hour? Summing across hours will not longer yield a distinct count. Similar issue with the new_user_pseudo_ids as summing across the hours will double-count these users.

I'd love to open a wider discussion about how to handle distinct counts across multiple grains. What are the best practices? If you pre-aggregate, you can't change the grain.

event_date and event_timestamp are in UTC. Is that expected?

As pointed out here, the event_timestamp field is in UTC: https://support.google.com/analytics/answer/7029846?hl=en#zippy=%2Cevent

I assume the event_date field is as well. My preference would be to convert these to the relevant reporting time zone as early in the pipeline as possible. Otherwise, users will need to handle this conversion in the BI layer which I'm certain will be overlooked frequently.

I propose a 'timezone' var that can be used to convert the event_timestamp and generate a new event_date_dt field in the base layer. (bigquery uses time zone names located here: http://www.iana.org/time-zones )

Curious to get input from others.

Not enough resources for query planning

In the default configuation building the fct_ga4__pages model fails due to this message:

Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex

It's a really big database and I'm using the daily+streaming setting. After defining materialize table for base_ga4__events_intraday it works again, but I'm not sure if it's the best way to solve it. What would you recommend?

Multi-site support

I'm currently deploying a customized multi-site installation and want to integrate it into the main package so that the site can update.

The architecture I currently have set up looks like this.

In this draft, base_ga4__events is a View that union all each site. Each site gets its own base_ga4__events_* incremental table.

I'll also need to integrate intraday tables and configure the sessions and users tables to flag sessions and users from each site. That shouldn't be necessary with pages and other tables as the page_location column should identify the separate sites.

unit tests with REPEATED field (i.e. array)

We are adding additional tests under the directory "unit_test" but are failing to make the test work in case one the result field is an array. We can see in the following file that someone started this kind of test and put some hints:

unit_tests/test_stg_ga4__events.example

Has anyone pulled this off ? I can definitely contribute but not sure what would be the best design here. It seems also that the function "check_relations_equal" does not work with arrays.

Last non-direct attribution

I have this as a client requirement so I'll be working on this soon. Either I'm missing something, or it's a surprisingly complicated problem so I wanted to give others a chance to weigh in.

I see three basic approaches to this problem:

We do last non-direct in a View that sits on top of the session martsand only lookup the last non-direct session within the partitions being queried. This is efficient but has some obvious data quality flaws.
We lookup the last non-direct session while building the session marts. The data quality in this case will be much better but it will be horribly inefficient on large sites as it requires a lookup that we may not be able to constrain by partitions.
We offload last non-direct to the client and add last_non_direct_source, last_non_direct_medium and last_non_direct_campaign parameters and use those to define last_non_direct_channel_grouping and filter all of those parameters up to the session models. This suffers from some data quality issues in that cookies can be deleted and it only works at the client level (not the user level) without using lookups that compromise any efficiency gains.

The second option is the one that I find superficially most attractive, but the more I think about the details of that implementation, the less practical I think it is.

If we were to lookup the sessions table on user_key to find the last non-direct session, we would create a circular dependency with the sessions table referencing the sessions table.

We could store the last non-direct session attribution values in the user table.

At first I thought this would be enough, but then I realized that all of that users' sessions would change to the most recent non-direct session when a user has a new non-direct session.

So, we could store the last non-direct session values in the user table, and then lookup that value when building the session tables.

This would mostly work when running daily builds. But what happens when a user has two different non-direct sessions in a day (with direct sessions in between)? And also what happens when we do a full-refresh on the data warehouse?

Given the problems with these two approaches, the only solution I can see is to create a new table for storing non-direct sessions (timestamp, user_key, last_non_direct_source, last_non_direct_medium, last_non_direct_campaign, and last_non_direct_channel_grouping). This will still be a pretty beefy table on large sites, but it will work.

I feel like there's a more elegant solution that I'm missing.

If anyone can think of a better approach, I'd love to hear of it. Otherwise, what approach do you prefer? Did I miss any issues or advantages? @adamribaudo-velir @willbryant

Add documentation & instructions for creating custom events with custom event params

From #123, there was a suggestion to provide guidance around adding custom events and custom event params.

Incremental + streaming strategy behaves oddly when tables are rotated

@adamribaudo-velir @dgitis since we are discussing the event loading model, I will open this issue for another headache that my business users noticed last week.

At somewhere around ~10-11am my local time, GA4 creates the new events_ table for the previous day, and some time after that drops the events_intraday_ table for the same date.

What they noticed is that the events for today are present throughout the day, and up to that magic time the next morning, but then disappear!

The cause is of course that I haven't re-run the DBT job, and whereas the streaming/intraday table is a direct view onto events_intraday_*, the non-streaming base table isn't a view onto events_*, the events have actually been copied out.

I don't have a suggestion on how to solve this yet. The best I could do right now is set up the glue necessary to rerun the DBT pipeline when the new events_* table is created, making sure we are de-duping with the intraday table successfully (which may be the other fatal flaw in my idea of de-duping on the way into the base events table).

The other approach is to give up on direct access to the intraday tables, and expect a DBT run to pull in that data too. Obviously that is in many respects a step backwards, but to be honest, it's what I actually expected before using this project - most of my downstream models have to be materialized, so I had assumed I would only be getting updates when I ran the pipeline.

Rows processed is abnormally high when using static_incremental_days

Steps to reproduce:

Run dbt run --full-refresh and record how many MB were processed for the base_ga4__events model
Set static_incremental_days: 3 in your dbt_project.yml
Run dbt run -m base_ga4__events and record how many MB were processed

I'm seeing the same amount of MB processed in steps 1 and 3. When I look at the compiled SQL ,though, it looks correct:

and parse_date('%Y%m%d', event_date) in (date_sub(current_date, interval 1 day),date_sub(current_date, interval 2 day),date_sub(current_date, interval 3 day))

Not sure what's going on, but when I use dynamic partitions (by removing the static_incremental_days var) then I see far fewer MB processed which is expected. Even if I delete 3 days of data and run dbt run, I get 15MB processed rather than 350MB processed.

Can't do per-target variables easily

We have separate GA4 instances for our test & prod sites, which therefore have different property IDs, resulting in different BQ dataset names (in different GCP projects in our case).

The docs say to configure "using the following variables which must be set in your dbt_project.yml file."

vars:
    ga4:
        project: "your_gcp_project"
        dataset: "your_ga4_dataset"
        start_date: "YYYYMMDD" # Earliest date to load

But dbt_project.yml doesn't support variables itself, and so the normal pattern is to override these on a per-environment basis using --vars arguments.

However, --vars defines the variables in the global namespace, not ga4. For example, if we use a command like:

dbt build --target $CI_ENVIRONMENT_NAME --vars "ga4: {project: $DBT_GCP_PROJECT, dataset: $DBT_GA4_DATASET, start_date: $DBT_GA4_START_DATE}"

The variables don't get used, as they're nested under "ga4", rather than being un-nested and presented only to the ga4 module:

03:31:40  Running with dbt=1.1.1
03:31:41  Unable to do partial parsing because config vars, config profile, or config target have changed
03:31:46  Encountered an error:
Compilation Error
  Could not render {{var('project')}}: Required var 'project' not found in config:
  Vars supplied to <Configuration> = {
      "ga4": {
          "dataset": "... value here ...",
          "project": "... value here ...",
          "start_date": ... value here ...
      }
  }

We can pass them without the ga4 nesting:

dbt build --target $CI_ENVIRONMENT_NAME --vars "{project: $DBT_GCP_PROJECT, dataset: $DBT_GA4_DATASET, start_date: $DBT_GA4_START_DATE}"

But then these variables are fully global and potentially conflict with other variables used in the project (eg. start_date in particular).

Event deduplication

It is possible for Google to send duplicate events that are actual duplicates that should be removed. However, because of how Google batches and sends events, it is also possible for Google to send identical event payloads for legitimate events.

We need a way to remove the bad duplicates without removing the good duplicates.

OWAS GA4 data modelling

Hello,

We are starting to use this package and were wondering the difference between the data modelling stated here:

https://support.owox.com/hc/en-us/articles/10477064761236-1-Sessions-stg-data-preparation-temp-sessions-

Are we right to assume that the facts and dims created in this dbt package sessionize through the following fact tables:

fct_ga4__sessions
fct_ga4__sessions_daily

Following tables are present in the website above and would be of interest to add in this package according to us:

Events --> fact table
Transactions --> fact table

Thank you for your feedback

dim_ga4__sessions model relies on flaky session_start event

It should use the sessions_first_last_pageview model instead.

Keeping base_ga4__events_intraday and base_ga4__events in sync is not DRY

We have duplicated select statements in base_ga4__events_intraday and base_ga4__events. This is because they need to be unioned together and base_ga4__events_intraday needs to remain a view while base_ga4__events needs to remain a table.

This has caused a few issues at this point when the statements aren't kept in sync. Generally speaking, we want to keep code DRY (don't repeat yourself) which can avoid issues.

Any ideas for how to keep them aligned?

One idea: Use a macro that spits out the select statement. No input parameters necessary, just have it output text.

Querying `stg_ga4__events` does not take advantage of base table partitions

Querying 1 day of data in stg_ga4__events seems to scan all base_ga4__events data. It should only scan 1 partition.

Uniqueness test failures on main

Since updating to fdd19c2, the following tests are failing:

10:45:43  Failure in test unique_stg_ga4__users_first_last_events_user_key (models/staging/ga4/stg_ga4.yml)
10:45:43    Got 452 results, configured to fail if != 0
10:45:43  
10:45:43    compiled SQL at target/compiled/ga4/models/staging/ga4/stg_ga4.yml/unique_stg_ga4__users_first_last_events_user_key.sql
10:45:43  
10:45:43  Failure in test unique_dim_ga4__sessions_session_key (models/marts/core/core.yml)
10:45:43    Got 11 results, configured to fail if != 0
10:45:43  
10:45:43    compiled SQL at target/compiled/ga4/models/marts/core/core.yml/unique_dim_ga4__sessions_session_key.sql
10:45:43  
10:45:43  Failure in test unique_stg_ga4__users_first_last_pageviews_user_key (models/staging/ga4/stg_ga4.yml)
10:45:43    Got 15 results, configured to fail if != 0

velir / dbt-ga4 Goto Github PK

dbt-ga4's Introduction

GA4 DBT Package

Models

Seeds

Installation & Configuration

Install from DBT Package Hub

Install From main branch on GitHub

Install From Local Directory

Required Variables

Required Variables (Multi-Project Instance)

Optional Variables

Query Parameter Exclusions

Query Parameter Extraction

Custom Parameters

User Properties

Derived User Properties

Derived Session Properties

GA4 Recommended Events

Conversion Events

Session Attribution Lookback Window

Custom Events

Connecting to BigQuery

Unit Testing

Overriding Default Channel Groupings

Multi-Property Support

dbt Style Guide

dbt-ga4's People

Contributors

Stargazers

Watchers

Forkers

dbt-ga4's Issues

Recommend Projects

Recommend Topics

Recommend Org