activityschema / activityschema Goto Github PK

Repository for the ActivitySchema spec and supporting materials

License: Apache License 2.0

data data-modeling data-science data-analytics analysis

activityschema's Introduction

ActivitySchema

The Activity Schema data model is a standard designed for modern data warehouses. It is designed to make data modeling and analysis substantially simpler, faster, and more reliable than existing methodologies (e.g. a star schema)

The core approach is to separate modeling from querying. Models should be fully independent: asking new data questions should not require creating new models. Instead, queries can combine any number of existing models to build anything they need.

This is done by modeling all data in the warehouse as a single time series table.

For more info also see the home page or read the spec.

How it Works

At its core an activity schema consists of transforming raw tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies.

Traditional Data Modeling - many custom-structured tables

Activity Schema - one standard table

Business concepts are represented as entity doing an activity ('a customer completed an order') instead of facts or nouns (orders, products). Activities are built directly from source tables, store only their own data, and are the single source of truth for each concept.

All queries run against an activity stream table to assemble data for analysis, BI, and reporting. Instead of traditional foreign key joins, queries combine activities using relationships in time (e.g. all customers who completed an order and submitted a support ticket before their next completed order).

Advantages

Maintainable data models

fewer models — one (and only one) business concept per activity means fewer models to manage, understand, and maintain
easier to build - no joins between models means no need to tie disparate source systems together.
easily accommodate changes to source data — only need to update a single activity
simple data lineage — a single data layer makes tracing data provenance and debugging far easier
faster updates — time-series modeling means incremental updates (rather than full rebuilds) by default
lower data latency - no dependencies means data is available upon insertion, rather than waiting for cascading rebuilds
no data dictionaries — fewer models, with one concept each, makes them vastly easier to document

Faster analysis and querying

single source of truth — because activity represents a single concept (like a 'page view' or 'completed order'), it's always clear which activity to use
query across many source systems — time-based joins means any activity can be queried and combined with another without defining foreign keys
reusable analyses — a standard data model means that any analysis can be reused across companies. This means a customer acquisition cost calculation for one company can be shared with another
autogenerated queries — a standard data model means that queries don't have to be written by hand
true ad-hoc querying — because all activities are related in time, swapping one activity for another requires no structural changes to queries.
high performance — queries run substantially faster against an activity stream table, which has fewer columns, requires fewer joins, and can be easily partitioned / indexed by time.

Spec

Learn more by reading the full activity schema specification.

Resources

The activity schema home page has more info on the approach and its benefits

Building an activity schema

The implementation page covers how to build an activity schema in production.

A new activity schema dbt package has some helpful macros for building activity stream tables.

Known Implementations

Narrator provides a full implementation of the activity schema as a service. The Narrator team directly supports the growth of the activity schema, and is always happy to provide suggestions on how to build one yourself.

Community

The dbt Slack has a #modeling-activity-schema channel for discussion about the activity schema.

activityschema's People

Contributors

Stargazers

Watchers

activityschema's Issues

Proposal: rename `feature_json` to `attributes`

This may be pedantic, but are you open to renaming references to features as attributes? Feature feels very much like borrowing from ML terminology, and while this framework can be used to power model training/serving pipelines, it also serves other use cases where the term feature is more foreign. I think adopting a term like attribute would make the framework a bit more intuitive and accessible to data/analytics engineers. Open to other terminology suggestions as well. Would love to hear the motivation for why the term feature was picked as well, as I'm sure other options were considered, and every option has tradeoffs.

How to model v2 JSON in Redshift

How can you effectively design a table in Redshift to accommodate the JSON column type proposed in Schema 2.0? Redshift does not directly support the JSON data type, so what alternative approaches do you suggest considering that storing it as VARCHAR may impact performance negatively?

See https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html

We recommend using JSON sparingly. JSON isn't a good choice for storing larger datasets because, by storing disparate data in a single column, JSON doesn't use the Amazon Redshift column store architecture. Though Amazon Redshift supports JSON functions over CHAR and VARCHAR columns, we recommend using SUPER for processing data in JSON serialization format. SUPER uses a post-parse schemaless representation that can efficiently query hierarchical data. For more information about the SUPER data type, see Ingesting and querying semistructured data in Amazon Redshift.

Clarification to the activity_occurrence column

In the specification you state that:

The column activity_occurrence represents the number of times a given entity has done that activity, at the time that activity occurred (starting with 1). activity_repeated_at is the timestamp of the next time that same activity occurred for that entity. Both should be NULL if another activity has not yet occurred.

I think that there is a discrepancy between above statement and proposed usage of the activity_occurrence column in the examples section:

They can be used to easily get the first time each customer did an activity (activity_occurrence = 1) and the last time (activity_repeated_at = null). Since these two expressions each return only one row per customer, they're also a very efficient way to get every unique customer that has done an activity.

You'd have to query activity_occurrence = 1 OR activity_occurrence is NULL in order to get the first activity for when there is one or several activities of the same kind for the same customer?

I would appreciate a clarification to this issue. Is it a bug in the spec or something that I do not understand?

BigQuery supports clustered tables

Seems relevant for the implementation guide for BigQuery, since normal partions can only be done on a single column

https://cloud.google.com/blog/products/data-analytics/new-bigquery-partitioning-and-clustering-recommendation

https://cloud.google.com/bigquery/docs/clustered-tables

Proposal: Add Support for `between_before` Temporal Join

Description

Stemming from this Slack thread, I'm proposing a change to the Activity Schema spec to make it more straightforward to derive transformations for temporal joins between the previous and current occurrences of a primary activity.

Example

I work for a recruiting software company and I’m attempting to analyze trends in interview volume in relation to hiring. I have two activities (the entity is jobs) - completed interview and hired applicant - and I’d like to analyze the additional number of interviews completed before each incremental hire on the job. So if, for a given job, the sequence of activities went:

completed interview --> completed interview --> hired applicant --> completed interview --> hired applicant

then I want to generate the following dataset from the sample sequence:

hired_applicant_occurrence    completed_interviews_before
1                             2
2                             1

While this output can be achieved in Narrator by appending an activity and filtering it in relation to another activity, that interface feels cumbersome for a common transformation use case when building a dataset. As a user, intuitively, if I can look ahead to the next activity occurrence, I would also expect to be able to look back to the previous occurrence.

Proposal

I'm proposing the following changes:

add a between_before temporal join (for first, last, and aggregate) which filters the joined activity to records that occurred between the previous and current occurrences of the primary activity
rename the between temporal join to between_after for naming convention consistency
add a field called activity_last_occurred_at to the spec, which is null when activity_occurrence = 1, and is the timestamp of the previous activity occurrence for all other rows

This implementation should be straightforward, and should be easy to maintain in production data pipelines since it’s an immutable value (i.e. a lagged value is always known at runtime), so no changes will ever need to be upserted/merged to this column for incrementally built activity tables.

Open Questions

What are other pros and cons to this proposal?
If this is done, would it make sense to update the guidance on Snowflake cluster keys from activity_occurrence in (1, NULL) to activity_last_occurred_at is NULL?
If cluster keys are updated, would it make sense to tweak relevant queries under the hood (e.g. first_ever) to filter/join on activity_last_occurred_at is NULL to take advantage of updated cluster keys?

Example SQL Schema

I've created following gist with schema in MySQL (as an example) based on the docs https://gist.github.com/vladaman/300b23b4f47a31ae618506afcd3f7594

I have few questions:

Should the activity_occurrence be default 1 ?
Should the revenue_impact be default 0 ?

Combining Entities

Hi there.
How to combine multiple entities/dimensions?

For example, supose I have a supermarket chain and I want to know:

The average money spent by customers in grossery items in California

Grossery items (product family) and State should be added to the features attributes of the customer_stream activity? What if I want to zoom in trough these dimensions (product name and city)?

How to deal with changes in Customer identifier (email)

Since you recommend Customer to be an email address and not an ID what approach would you suggest for a system where user's email can change eventually? I think this is pretty common pattern.

Would you go back to the Warehouse and update records old records? This could be expensive operation - yet maybe acceptable for such rare case. What do you suggest?

[question] can we use this in traditional RDBMS?

Must this be used in columnar database?

What drawbacks if used in RDBMS?

Must this be used for OLAP?

What drawbacks if used in OLTP?