Giter Site home page Giter Site logo

mozilla / gcp-ingestion Goto Github PK

View Code? Open in Web Editor NEW
75.0 28.0 31.0 15.54 MB

Documentation and implementation of telemetry ingestion on Google Cloud Platform

Home Page: https://mozilla.github.io/gcp-ingestion/

License: Mozilla Public License 2.0

Dockerfile 0.27% Python 12.33% Shell 2.21% Java 85.16% Lua 0.02% HTML 0.01% Mermaid 0.01%
telemetry-ingestion gcp mozilla-telemetry

gcp-ingestion's Introduction

Telemetry Ingestion on Google Cloud Platform

CircleCI

A monorepo for documentation and implementation of the Mozilla telemetry ingestion system deployed to Google Cloud Platform (GCP).

For more information, see the documentation.

gcp-ingestion's People

Contributors

abdelrahman-ik avatar acmiyaguchi avatar akkomar avatar anich avatar badboy avatar benwu avatar cbguder avatar chelseatroy avatar curtismorales avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar edugfilho avatar fbertsch avatar jasonthomas avatar jklukas avatar kik-kik avatar lelilia avatar marlene-m-hirose avatar mdboom avatar mikaeld avatar mkaply avatar mreid-moz avatar quiiver avatar relud avatar scholtzan avatar sean-rose avatar standard8 avatar whd avatar wlach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gcp-ingestion's Issues

Enable dependency update scanning

Per https://github.com/mozilla-services/foxsec/blob/master/README.mediawiki#Security_Checklist

  • enable security scanning of 3rd-party libraries and dependencies
    • ...
    • For Python, enable pyup security updates:
      • Add a pyup config to your repo (example config: https://github.com/mozilla-services/antenna/blob/master/.pyup.yml)
      • Enable branch protection for master and other development branches. Make sure the approved-mozilla-pyup-configuration team CANNOT push to those branches.
      • From the "add a team" dropdown for your repo /settings page
        • Add the "Approved Mozilla PyUp Configuration" team for your github org (e.g. for mozilla and mozilla-services)
        • Grant it write permission so it can make pull requests
      • notify [email protected] to enable the integration in pyup

Add a test helper for running `main` and validating output on the filesystem

As the project grows and number of test cases increases, we're going to want to handle the following issues:

  1. Escaped JSON strings in Java source code are difficult to read and modify
  2. We will want to be able to test the option parsing and pipeline building logic of the entry classes' main methods
  3. We will want to test interaction with external datastores like GCS and Pubsub

This issue doesn't address (3), but we can address (1) and (2) by adding a test helper class that can read in newline-delimited JSON documents from Java resource files representing expected input and output, and compare to the results of an invocation of main. This helper would only support running pipelines with input and output types of file.

We'd likely want to use the FileAssert utilities of assertj or implement a hamcrest matcher.

We'd also likely need to add a pipeline option for setting number of shards (so that tests can set nShards=1) or make our matcher capable of handling output lines being spread across many output files.

Add metrics to Sink and Validate

I'm not sure if Beam includes default metrics, but there are a few things we should be at least counting to understand failure modes.

Beam provides an example pipeline that uses beam Metrics we can follow as a guide.

Some things we may want to count:

  • How far we get in extracting location info in Validate; we could increment one counter on every event, another counter after getting a hostname from IP, another after extracting city, etc.

Preserve timezone for BigQuery timestamp inserts

@sunahsuh reports:

I just checked on bigquery and it appears it’ll allow any two-digit offset (technically, anything that fits (+|-)H[H][:M[M]])
however, it normalizes to utc and then throws away the timezone info
as far as i can tell there’s no way to store timezone info in a native time column type

We should make sure the format in which we send timestamps allows us to understand the original timezone.

Fix stdout regression

Since #42, using stdout as the output destination leads to exceptions when actually running the pipeline locally:

java.lang.IllegalArgumentException: unable to serialize DoFnAndMainOutput{doFn=com.mozilla.telemetry.transforms.Foreach$Fn@2f639a92, mainOutputTag=Tag<output>}
...
Caused by: java.io.NotSerializableException: org.fusesource.jansi.AnsiConsole$2

Apparently, maven loads the jansi project which alters System.out. Basically, we need to not serialize System.out and rather serialize some flag. We need to delay actually calling System.out.println to the innermost function so that it doesn't get serialized.

Add fileformat=base64

Do we have production use cases in mind for text format, or do we expect to always use pubsub JSON? If we do have use cases for text format in production, I'm a bit concerned about text's lossiness (pubsubmessage payloads can contain newlines, which becomes multiple messages in text format).

Should we add input and output file formats for base64-encoded payloads without the PubsubMessage JSON wrapper?

cc @relud

Add a mechanism to stage bigquery and cloud storage partitions until complete

Problem: backfilling data after gcp ingestion downtime requires updating any derived data sets that may have read the partition before the backfill, including scheduled queries from re:dash.

Solution: for non-streaming tables (in both bigquery and cloud storage), require that partitions be staged then promoted atomically (or as close to atomic as possible), and document suspending promotion.

Implementation: Deliver ingestion data to a staging location and use airflow to schedule operations that promote completed partitions. Before promoting data the operation should attempt to validate that data is complete within some threshold. Make sure that both data ops and data platform engineers are trained in suspending promotion in the case of potential incomplete data, such as from downtime.

Test bigquery output in ingestion-beam

bq doesn't have an emulator, so i think this will have two parts.

  1. use a fake http service and hit localhost. may not work.
  2. use cloud-build on restricted branches and test actual bq output. needs to generate unique table names and reliably clean up all resources after running.

Ensure that the edge spec covers what to do with "bad" submissions

I just closed Bug 1353076 since we won't do it on the AWS infra, but we should make sure the new edge spec covers:
What (if any) submissions are explicitly rejected by the edge
What (if any) submissions are accepted, but published to topics without any consumers

We currently are not planning to drop anything at the edge, but we should document that. We may need to do so in the future if we see malicious submissions or activity.

Support Legacy Systems

In particular, sslreports and DSMO.

sslreports doesn't follow the URI spec, all requests hit /submit/sslreports. DSMO in addition to not following the URI spec also uses GET requests.

We'll probably implement these last as special cases, but the required changes should be minimal.

Fork google_check.xml

google_check.xml is pretty lax about warning vs error on some things, and it's not configured to allow ignores, which I think we will come to want.

Fix DecoderOptions.parse

An exception occured while executing the Java class. Method [parse] has multiple definitions [public static com.mozilla.telemetry.decoder.DecoderOptions$Parsed com .mozilla.telemetry.decoder.DecoderOptions.parse(com.mozilla.telemetry.decoder.DecoderOptions), public static com.mozilla.telemetry.options.SinkOptions$Parsed com.mozilla.telemetry.options.SinkOptions.parse(com.mozilla.telemetry.options.SinkOptions)] with different return types for [com.mozilla.telemetry.decoder.DecoderOptions]. -> [Help 1]

Consolidate error handling logic

We have several transforms now that catch exceptions and duplicate toError logic to parse the exception and payload to a PubsubMessage. We should factor that out to ensure consistent formatting of error outputs.

Handle permanent failures when flushing ingestion-edge queue

If a transient error occurs before a permanent error it may cause a message with permanent failures to be queue and then retried indefinitely.

we should:

  1. throw out permanently failing messages from the queue
    • or upload them to s3
  2. check that we surface permanent failures before queuing where reasonable

Periodically update schemas from mozilla-pipeline-schemas

Currently, we fetch mozilla-pipeline-schemas from GitHub at ingestion-beam build time and include the content in the jar. This means we can only get schema updates by rebuilding, draining the existing Dataflow job, and instantiating a new job with the new code.

We could try to spin up a periodic task (every 5 minutes, perhaps?) to get the latest content from GitHub and update the collection of schemas. Perhaps this could be expressed nicely as a side input?

Support optional doc id and auto-uuid generation for non-legacy doctypes

We're already planning on supporting this for some legacy systems. We might want to also support it for newer systems since it would probably be low effort to support, and at least one person whose opinion on data is well-respected thought that it was in fact optional.

This would be an update to the edge spec and possibly downstream systems, depending on where the document ID is assigned. It would make for non-idempotent reprocessing of such systems, but that may be acceptable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.