Giter Site home page Giter Site logo

rudderlabs / dbt-sessionization Goto Github PK

View Code? Open in Web Editor NEW
18.0 3.0 7.0 20 KB

Using DBT for Creating Session Abstractions on RudderStack - an open-source, warehouse-first customer data pipeline and Segment alternative.

Home Page: https://rudderstack.com/

License: MIT License

dbt rudderstack sessionization

dbt-sessionization's Introduction

Sessionization using DBT and RudderStack

This repository contains a sample DBT project for RudderStack. It can be applied on the RudderStack data residing in BigQuery. This DBT project creates "session" abstractions on top of RudderStack's track event data. The materialized DBT tables and views are used for the same.

Overview

This project builds on top data from the tracks table which is created by default in all the RudderStack warehouse destinations. Developers having access to a RudderStack BigQuery destination can start using this code by simply changing the dataset and schema names in the following files:

  • models/rudder/dbt_aliases_mapping.sql
  • models/rudder/dbt_mapped_tracks.sql

For example, a statement like ... from big-query-integration-poc.RudderAutoTrack.tracks will need to be changed to ... from MyBigQueryDataSet.MyRudderSchema.tracks

How to Use This Repository

This project was created on DBT Cloud (https://cloud.getdbt.com). Hence there is no profiles.yml file with the connection information. Developers who want to execute the models on the Command Line Interface (CLI) mode will need to create additional configuration files by following the directions provided here

Note: While this code has been tested for Google BigQuery, it should also be usable for other RudderStack-supported data warehouses like Amazon Redshift and Snowflake. The only differences that might arise are with regards to functions related to timestamp handling and analytics. Even then, we believe the code should be usable by just replacing the BigQuery functions with their counterparts from Redshift or Snowflake as required.

Sequence of Commands

The sequence in which the DBT models should be executed for a fresh run is as follows:

  • dbt_aliases_mapping

This model/table has two attributes/columns - alias and dbt_visitor_id. This table captures the linkages between one or more anonymous_id values (alias) and a user_id (dbt_visitor_id).

  • dbt_mapped_tracks

This table has columns - event_id, anonymous_id, dbt_visitor_id, timestamp, event, idle_time_minutes. event represents the actual event name. timestamp corresponds to the instant when the event was actually generated. idle_time_minutes captures the time gap between the event and the immediate preceeding one.

  • dbt_session_tracks

This table contains columns - session_id, dbt_visitor_id, session_start_at, session_sequence_number, next_session_start_at.

The data in the dbt_mapped_tracks table is partitioned first by dbt_visitor_id. It is then partitioned farther into groups of events where within one group the time-gap i.e. idle_time_minutes is not more than 30. In other words - if idle_time_minutesfor an event is more than 30, a new group is created.

These groups of sequential events are essentially the sessions.The value of 30 can be modified in the model definition. The session_sequence_number represent the order of the session for a particular user. The session_id is of the form session_sequence_number - dbt_visitor_id.

  • dbt_track_facts

This table has columns - anonymous_id, timestamp, event_id, event, session_id, dbt_visitor_id, track_sequence_number.

In this table, the information from dbt_session_tracks is tied back to the records in the dbt_mapped_tracks table. Each event is now tied to a session_id and within the session also, the event is assigned a track_sequence_number.

  • dbt_session_track_facts

The columns in this table are - session_id, ended_at, num_pvs, cnt_viewed_product, cnt_signup. num_pvs captures the number of distinct events in that session. cnt_viewed_product captures the total number of times 'view_product' events were triggered. This is only for illustrative purposes; developers might want to monitor statistics for a different event type.

  • dbt_user_session_facts

The columns in this table are dbt_visitor_id, first_date, last_date, number_of_sessions. This table captures the time period for which the user has been active and the number of sessions they have created in that time.

  • dbt_session_duration

This table captures the duration for each session of a user; the associated columns are dbt_visitor_id, session_id and session_duration.

  • dbt_tracks_flow

The columns in this table are - event_id, session_id, track_sequence_number, event, dbt_visitor_id, timestamp, event_2, event_3, event_4, event_5. This is essentially a table where each event and 4 subsequent events are represented in each record.

Note: A sample analysis query is provided at analysis/dbt_top_users_by_avg_session_duration.sql

What is RudderStack?

RudderStack is a customer data pipeline tool for collecting, routing and processing data from your websites, apps, cloud tools, and data warehouse.

More information on RudderStack can be found here.

Contact Us

If you come across any issues while configuring or using this project, please feel free to start a conversation on our Slack channel. We will be happy to help you.

dbt-sessionization's People

Contributors

ameypv-rudder avatar dipanjanatrudderlabs avatar dipanjanb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.