Giter Site home page Giter Site logo

Comments (11)

relud avatar relud commented on August 28, 2024

not sure how much space an array of bools takes up, but we could also treat an INT64 as an array of 64 bools. That would probably make this even more magic to understand, but if we are hiding access behind UDFs that would be fine.

example:

CREATE TEMPORARY FUNCTION udf_days_since(seen INT64) AS ((
  SELECT
    MIN(days_since)
  FROM
    UNNEST(GENERATE_ARRAY(0,27)) AS days_since
  WHERE
    (seen >> days_since) & 1 = 1));
WITH
  raw AS (
  SELECT
    0x0534 AS seen)
SELECT
  BIT_COUNT(seen & 0x7F) AS days_seen_last7,
  udf_days_since(seen) AS days_since_seen
FROM
  raw

from bigquery-etl.

relud avatar relud commented on August 28, 2024

that said, I would be in favor of adding this as an additional column, and not replacing days_since_*

from bigquery-etl.

jmccrosky avatar jmccrosky commented on August 28, 2024

So days_seen would have a length of 28? Looks like it should work :)

My only suggestion is that we find a way to get away from needing a schema change every time we add a new usage criteria. I expect this to happen regularly. Perhaps a "reshape" and have a single usage_criteria column and then we multiply the rows by the number of usage criteria that we have.

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

find a way to get away from needing a schema change every time we add a new usage criteria

It's indeed unfortunate that adding new usage criteria is fairly heavy-weight. I expect we can massage this in later-stage ETL so that it's more convenient for end users doing analysis to adapt to integrating new usage criteria, but at the earlier stages of ETL, establishing a reasonable process for backfilling a new criterion seems like the critical piece. Establishing and documenting that process will be powerful for making this pipeline productive for evolving analyses.

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

that said, I would be in favor of adding this as an additional column, and not replacing days_since_*

That sounds reasonable.

from bigquery-etl.

jmccrosky avatar jmccrosky commented on August 28, 2024

find a way to get away from needing a schema change every time we add a new usage criteria

It's indeed unfortunate that adding new usage criteria is fairly heavy-weight. I expect we can massage this in later-stage ETL so that it's more convenient for end users doing analysis to adapt to integrating new usage criteria, but at the earlier stages of ETL, establishing a reasonable process for backfilling a new criterion seems like the critical piece. Establishing and documenting that process will be powerful for making this pipeline productive for evolving analyses.

Ease of setting up a new usage criteria is a factor too, as we hope encourage, for example, a very small team that builds a particular browser feature, to start caring about growth, which will depend on the barrier to entry for enabling growth tracking of that feature.

I don't understand why it doesn't make sense to just have a feature_id column and, instead of writing to a new set of columns for each usage criteria, write to a new set of rows, while setting feature_id appropriately. That lack of understanding may just be because I'm not an engineer though ;)

from bigquery-etl.

relud avatar relud commented on August 28, 2024

establishing a reasonable process for backfilling a new criterion seems like the critical piece

So I learned this weekend that this kind of backfill is difficult. Backfills can't be done in a single query, because they take so long. Just calculating a new column from existing rows costs $1185, because the table is 237TiB.

from bigquery-etl.

relud avatar relud commented on August 28, 2024

In terms of adding criteria, maybe what we need is to have all pings in the last 28 days, so that derived tables can efficiently apply criteria post-facto

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

In terms of adding criteria, maybe what we need is to have all pings in the last 28 days, so that derived tables can efficiently apply criteria post-facto

@relud and I discussed this over video, and I'm very interested to see us pursue this path. It will be somewhat more expensive in terms of data storage and scan costs, but cheaper to maintain in terms of backfills and delaying definition of usage criteria to the exact_mau28_by_dimensions stage for flexibility in adding new criteria.

from bigquery-etl.

relud avatar relud commented on August 28, 2024

I said all pings in the last 28 days, but I meant all clients_daily_v6 rows in the last 28 days

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

We've done this!

from bigquery-etl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.