Giter Site home page Giter Site logo

activityschema's Introduction

The Activity Schema data model is a standard designed for modern data warehouses. It is designed to make data modeling and analysis substantially simpler, faster, and more reliable than existing methodologies (e.g. a star schema)

The core approach is to separate modeling from querying. Models should be fully independent: asking new data questions should not require creating new models. Instead, queries can combine any number of existing models to build anything they need.

This is done by modeling all data in the warehouse as a single time series table.

For more info also see the home page or read the spec.


How it Works

At its core an activity schema consists of transforming raw tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies.


Traditional Data Modeling - many custom-structured tables

image



Activity Schema - one standard table

image

Business concepts are represented as entity doing an activity ('a customer completed an order') instead of facts or nouns (orders, products). Activities are built directly from source tables, store only their own data, and are the single source of truth for each concept.

All queries run against an activity stream table to assemble data for analysis, BI, and reporting. Instead of traditional foreign key joins, queries combine activities using relationships in time (e.g. all customers who completed an order and submitted a support ticket before their next completed order).



Advantages

Maintainable data models

  • fewer models — one (and only one) business concept per activity means fewer models to manage, understand, and maintain
  • easier to build - no joins between models means no need to tie disparate source systems together.
  • easily accommodate changes to source data — only need to update a single activity
  • simple data lineage — a single data layer makes tracing data provenance and debugging far easier
  • faster updates — time-series modeling means incremental updates (rather than full rebuilds) by default
  • lower data latency - no dependencies means data is available upon insertion, rather than waiting for cascading rebuilds
  • no data dictionaries — fewer models, with one concept each, makes them vastly easier to document

Faster analysis and querying

  • single source of truth — because activity represents a single concept (like a 'page view' or 'completed order'), it's always clear which activity to use
  • query across many source systems — time-based joins means any activity can be queried and combined with another without defining foreign keys
  • reusable analyses — a standard data model means that any analysis can be reused across companies. This means a customer acquisition cost calculation for one company can be shared with another
  • autogenerated queries — a standard data model means that queries don't have to be written by hand
  • true ad-hoc querying — because all activities are related in time, swapping one activity for another requires no structural changes to queries.
  • high performance — queries run substantially faster against an activity stream table, which has fewer columns, requires fewer joins, and can be easily partitioned / indexed by time.


Spec

Learn more by reading the full activity schema specification.


Resources

The activity schema home page has more info on the approach and its benefits


Building an activity schema

The implementation page covers how to build an activity schema in production.

A new activity schema dbt package has some helpful macros for building activity stream tables.


Known Implementations

Narrator provides a full implementation of the activity schema as a service. The Narrator team directly supports the growth of the activity schema, and is always happy to provide suggestions on how to build one yourself.


Community

The dbt Slack has a #modeling-activity-schema channel for discussion about the activity schema.

activityschema's People

Contributors

ahmedelsamadisi avatar cdussud avatar fronkan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

activityschema's Issues

Proposal: rename `feature_json` to `attributes`

This may be pedantic, but are you open to renaming references to features as attributes? Feature feels very much like borrowing from ML terminology, and while this framework can be used to power model training/serving pipelines, it also serves other use cases where the term feature is more foreign. I think adopting a term like attribute would make the framework a bit more intuitive and accessible to data/analytics engineers. Open to other terminology suggestions as well. Would love to hear the motivation for why the term feature was picked as well, as I'm sure other options were considered, and every option has tradeoffs.

How to model v2 JSON in Redshift

How can you effectively design a table in Redshift to accommodate the JSON column type proposed in Schema 2.0? Redshift does not directly support the JSON data type, so what alternative approaches do you suggest considering that storing it as VARCHAR may impact performance negatively?

See https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html

We recommend using JSON sparingly. JSON isn't a good choice for storing larger datasets because, by storing disparate data in a single column, JSON doesn't use the Amazon Redshift column store architecture. Though Amazon Redshift supports JSON functions over CHAR and VARCHAR columns, we recommend using SUPER for processing data in JSON serialization format. SUPER uses a post-parse schemaless representation that can efficiently query hierarchical data. For more information about the SUPER data type, see Ingesting and querying semistructured data in Amazon Redshift.

Clarification to the activity_occurrence column

In the specification you state that:

The column activity_occurrence represents the number of times a given entity has done that activity, at the time that activity occurred (starting with 1). activity_repeated_at is the timestamp of the next time that same activity occurred for that entity. Both should be NULL if another activity has not yet occurred.

I think that there is a discrepancy between above statement and proposed usage of the activity_occurrence column in the examples section:

They can be used to easily get the first time each customer did an activity (activity_occurrence = 1) and the last time (activity_repeated_at = null). Since these two expressions each return only one row per customer, they're also a very efficient way to get every unique customer that has done an activity.

You'd have to query activity_occurrence = 1 OR activity_occurrence is NULL in order to get the first activity for when there is one or several activities of the same kind for the same customer?

I would appreciate a clarification to this issue. Is it a bug in the spec or something that I do not understand?

Proposal: Add Support for `between_before` Temporal Join

Description

Stemming from this Slack thread, I'm proposing a change to the Activity Schema spec to make it more straightforward to derive transformations for temporal joins between the previous and current occurrences of a primary activity.

Example

I work for a recruiting software company and I’m attempting to analyze trends in interview volume in relation to hiring. I have two activities (the entity is jobs) - completed interview and hired applicant - and I’d like to analyze the additional number of interviews completed before each incremental hire on the job. So if, for a given job, the sequence of activities went:

completed interview --> completed interview --> hired applicant --> completed interview --> hired applicant

then I want to generate the following dataset from the sample sequence:

hired_applicant_occurrence    completed_interviews_before
1                             2
2                             1

While this output can be achieved in Narrator by appending an activity and filtering it in relation to another activity, that interface feels cumbersome for a common transformation use case when building a dataset. As a user, intuitively, if I can look ahead to the next activity occurrence, I would also expect to be able to look back to the previous occurrence.

Proposal

I'm proposing the following changes:

  • add a between_before temporal join (for first, last, and aggregate) which filters the joined activity to records that occurred between the previous and current occurrences of the primary activity
  • rename the between temporal join to between_after for naming convention consistency
  • add a field called activity_last_occurred_at to the spec, which is null when activity_occurrence = 1, and is the timestamp of the previous activity occurrence for all other rows

This implementation should be straightforward, and should be easy to maintain in production data pipelines since it’s an immutable value (i.e. a lagged value is always known at runtime), so no changes will ever need to be upserted/merged to this column for incrementally built activity tables.

Open Questions

  • What are other pros and cons to this proposal?
  • If this is done, would it make sense to update the guidance on Snowflake cluster keys from activity_occurrence in (1, NULL) to activity_last_occurred_at is NULL?
  • If cluster keys are updated, would it make sense to tweak relevant queries under the hood (e.g. first_ever) to filter/join on activity_last_occurred_at is NULL to take advantage of updated cluster keys?

Combining Entities

Hi there.
How to combine multiple entities/dimensions?

For example, supose I have a supermarket chain and I want to know:

  • The average money spent by customers in grossery items in California

Grossery items (product family) and State should be added to the features attributes of the customer_stream activity? What if I want to zoom in trough these dimensions (product name and city)?

How to deal with changes in Customer identifier (email)

Since you recommend Customer to be an email address and not an ID what approach would you suggest for a system where user's email can change eventually? I think this is pretty common pattern.

Would you go back to the Warehouse and update records old records? This could be expensive operation - yet maybe acceptable for such rare case. What do you suggest?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.