Giter Site home page Giter Site logo

Comments (8)

jklukas avatar jklukas commented on August 28, 2024 2

Sounds like we have a way forward here. Instead of creating views, we will populate _stable tables from the Parquet datasets. Steps to proceed:

  1. Create a reasonably performant JS UDF for crc32 and sample_id (likely we should store the crcTable as a blob on GCS so we don't have to generate it each time).

  2. Develop a query on top of telemetry_core_parquet that gets data into the right shape for telemetry_stable.core_v10, including using TIMESTAMP_MICROS(DIV(metadata.timestamp, 1000)) AS submission_timestamp.

  3. Run incremental queries to backfill all available days into core

  4. Generalize to other tables

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

I'm attempting to write this view for core pings right now and realizing some issues:

  1. We don't have submission_timestamp in the Parquet data, only submission_date; we have a timestamp field that I believe is client-provided and thus not guaranteed to fall on submission_date; would we manufacture some value like <submission_date> 12:00:00?
  2. We don't have sample_id in the parquet data; would we want to just use NULL for the Parquet data? We don't currently have a way of calculating sample_id in BigQuery due to lack of crc32 support, but we could probably make a JS UDF
  3. We are documenting in DTMO that _stable tables are guaranteed to have no duplicate document_ids within a partition, but we don't have that guarantee for the Parquet data

It's possible to address (2) and (3) if we abandon the idea of making a view and instead copy data into the telemetry_stable.core_v* tables. This would give us a materialized sample_id and allow us to perform deduplication.

Fully solving (1) might require falling back to Heka data (does that exist for core pings? do the Heka messages even contain an ingestion timestamp?). I don't really like the idea of manufacturing a timestamp, as this might be very confusing to users. Not sure what to do there.

from bigquery-etl.

relud avatar relud commented on August 28, 2024

i'm not 100% sure, but I think metadata.timestamp is populated by nginx at receipt time, not provided by the client, making it the correct source for submission_timestamp

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

i'm not 100% sure, but I think metadata.timestamp is populated by nginx at receipt time, not provided by the client, making it the correct source for submission_timestamp

@whd Can you confirm how metadata.timestamp is populated for d2p datasets? Is it a reasonable analogue for submission_timestamp in the GCP pipeline?

from bigquery-etl.

whd avatar whd commented on August 28, 2024

i'm not 100% sure, but I think metadata.timestamp is populated by nginx at receipt time, not provided by the client, making it the correct source for submission_timestamp

This is correct. See https://github.com/mozilla-services/nginx_moz_ingest/blob/master/src/ngx_http_moz_ingest_module.c#L613-L625.

@whd Can you confirm how metadata.timestamp is populated for d2p datasets? Is it a reasonable analogue for submission_timestamp in the GCP pipeline?

It is. See https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/parquet/sandboxes/heka/output/s3_parquet.lua#L59-L63 and e.g. https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/core/core.9.parquetmr.txt#L4.

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

It would appear that metadata.timestamp always falls on the submission_date, which bodes well.

WITH
  times AS (
  SELECT
    submission_date,
    TIMESTAMP_MICROS(DIV(MIN(metadata.timestamp), 1000)) mintime,
    TIMESTAMP_MICROS(DIV(MAX(metadata.timestamp), 1000)) maxtime
  FROM
    `moz-fx-data-derived-datasets.telemetry_derived.telemetry_core_parquet_v3`
  WHERE
    submission_date >= "2019-01-01"
  GROUP BY
    1
  ORDER BY
    1),
  inbounds AS (
  SELECT
    *,
    DATE(mintime) = submission_date AS goodmin,
    DATE(maxtime) = submission_date AS goodmax
  FROM
    times )
SELECT
  *
FROM
  inbounds
WHERE
  NOT (goodmin
    AND goodmax)

Produces no rows showing a timestamp outside the submission_date.

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

Verifying that a CRC32 calculation in JS UDF is feasible:

-- Based on https://stackoverflow.com/a/18639999/1260237

create temp function crc32(data STRING) returns int64 language js AS """
var makeCRCTable = function(){
    var c;
    var crcTable = [];
    for(var n =0; n < 256; n++){
        c = n;
        for(var k =0; k < 8; k++){
            c = ((c&1) ? (0xEDB88320 ^ (c >>> 1)) : (c >>> 1));
        }
        crcTable[n] = c;
    }
    return crcTable;
}

var crc32 = function(str) {
    var crcTable = makeCRCTable();
    var crc = 0 ^ (-1);

    for (var i = 0; i < str.length; i++ ) {
        crc = (crc >>> 8) ^ crcTable[(crc ^ str.charCodeAt(i)) & 0xFF];
    }

    return (crc ^ (-1)) >>> 0;
};

return crc32(data);
""";

--select MOD(crc32("51baf8b4-75d1-3648-b96d-809569b89a12"), 100)
select MOD(crc32("6612376a-ac10-4cf8-a341-22340fb1b48e"), 100)

Both of these give correct sample_id.

from bigquery-etl.

jklukas avatar jklukas commented on August 28, 2024

Done

from bigquery-etl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.