mozilla / gcp-ingestion Goto Github PK

View Code? Open in Web Editor NEW

75.0 28.0 31.0 15.54 MB

Documentation and implementation of telemetry ingestion on Google Cloud Platform

Home Page: https://mozilla.github.io/gcp-ingestion/

License: Mozilla Public License 2.0

Dockerfile 0.27% Python 12.33% Shell 2.21% Java 85.16% Lua 0.02% HTML 0.01% Mermaid 0.01%

telemetry-ingestion gcp mozilla-telemetry

gcp-ingestion's Introduction

Telemetry Ingestion on Google Cloud Platform

A monorepo for documentation and implementation of the Mozilla telemetry ingestion system deployed to Google Cloud Platform (GCP).

For more information, see the documentation.

gcp-ingestion's People

Contributors

Stargazers

Watchers

gcp-ingestion's Issues

Enable dependency update scanning

Per https://github.com/mozilla-services/foxsec/blob/master/README.mediawiki#Security_Checklist

enable security scanning of 3rd-party libraries and dependencies

...

For Python, enable pyup security updates:

Add a pyup config to your repo (example config: https://github.com/mozilla-services/antenna/blob/master/.pyup.yml)

Enable branch protection for master and other development branches. Make sure the approved-mozilla-pyup-configuration team CANNOT push to those branches.

From the "add a team" dropdown for your repo /settings page

Add the "Approved Mozilla PyUp Configuration" team for your github org (e.g. for mozilla and mozilla-services)

Grant it write permission so it can make pull requests

notify [email protected] to enable the integration in pyup

Use a standard failure output transform

To match the upstream patch @jklukas did for failure handling, I think using it should look like:

FailureOutput.withSuccessTag(success).withFailureTag(tag, output).withFailureTag(...)

or perhaps instead of framing this around failures, frame it as side outputs, e.g.

SideOutputs.withOutput(output, tag, ...).withOutput(...).getLastOutput()

Replace Flask with Sanic in edge

@jezdez tells me Sanic is more performant, and their docs tell me that configuration looks a lot like Flask

Add a test helper for running `main` and validating output on the filesystem

As the project grows and number of test cases increases, we're going to want to handle the following issues:

Escaped JSON strings in Java source code are difficult to read and modify
We will want to be able to test the option parsing and pipeline building logic of the entry classes' main methods
We will want to test interaction with external datastores like GCS and Pubsub

This issue doesn't address (3), but we can address (1) and (2) by adding a test helper class that can read in newline-delimited JSON documents from Java resource files representing expected input and output, and compare to the results of an invocation of main. This helper would only support running pipelines with input and output types of file.

We'd likely want to use the FileAssert utilities of assertj or implement a hamcrest matcher.

We'd also likely need to add a pipeline option for setting number of shards (so that tests can set nShards=1) or make our matcher capable of handling output lines being spread across many output files.

Add static type checker

As discussed in #11 (comment), we should run mypy or similar to check that our type annotations are valid.

Add metrics to Sink and Validate

I'm not sure if Beam includes default metrics, but there are a few things we should be at least counting to understand failure modes.

Beam provides an example pipeline that uses beam Metrics we can follow as a guide.

Some things we may want to count:

How far we get in extracting location info in Validate; we could increment one counter on every event, another counter after getting a hostname from IP, another after extracting city, etc.

Preserve timezone for BigQuery timestamp inserts

@sunahsuh reports:

I just checked on bigquery and it appears it’ll allow any two-digit offset (technically, anything that fits (+|-)H[H][:M[M]])
however, it normalizes to utc and then throws away the timezone info
as far as i can tell there’s no way to store timezone info in a native time column type

We should make sure the format in which we send timestamps allows us to understand the original timezone.

Fix stdout regression

Since #42, using stdout as the output destination leads to exceptions when actually running the pipeline locally:

java.lang.IllegalArgumentException: unable to serialize DoFnAndMainOutput{doFn=com.mozilla.telemetry.transforms.Foreach$Fn@2f639a92, mainOutputTag=Tag<output>}
...
Caused by: java.io.NotSerializableException: org.fusesource.jansi.AnsiConsole$2

Apparently, maven loads the jansi project which alters System.out. Basically, we need to not serialize System.out and rather serialize some flag. We need to delay actually calling System.out.println to the innermost function so that it doesn't get serialized.

Add fileformat=base64

Do we have production use cases in mind for text format, or do we expect to always use pubsub JSON? If we do have use cases for text format in production, I'm a bit concerned about text's lossiness (pubsubmessage payloads can contain newlines, which becomes multiple messages in text format).

Should we add input and output file formats for base64-encoded payloads without the PubsubMessage JSON wrapper?

cc @relud

Document how to configure the edge server

Add compression option to landfill

or prove that it's enable by default

ingestion-beam output paths should enforce default values

@whd has experienced various exceptions. We should either provide a default NULL replacement for placeholders or require that all placeholders include a default (like ${foo:-defaultvalue}).

rename edge and prefix with ingestion-

as per slack discussion with @jklukas

Investigate yauaa-beam for user agent parsing

We already use yauaa for user agent parsing, but it appears that the project has recently sprouted an Beam module listed on the Beam docs for third party libs.

We should look at refactoring to use that transform if it simplifies our code.

Add a mechanism to stage bigquery and cloud storage partitions until complete

Problem: backfilling data after gcp ingestion downtime requires updating any derived data sets that may have read the partition before the backfill, including scheduled queries from re:dash.

Solution: for non-streaming tables (in both bigquery and cloud storage), require that partitions be staged then promoted atomically (or as close to atomic as possible), and document suspending promotion.

Implementation: Deliver ingestion data to a staging location and use airflow to schedule operations that promote completed partitions. Before promoting data the operation should attempt to validate that data is complete within some threshold. Make sure that both data ops and data platform engineers are trained in suspending promotion in the case of potential incomplete data, such as from downtime.

Make edge server's metadata headers configurable

Test bigquery output in ingestion-beam

bq doesn't have an emulator, so i think this will have two parts.

use a fake http service and hit localhost. may not work.
use cloud-build on restricted branches and test actual bq output. needs to generate unique table names and reliably clean up all resources after running.

Implement deduplication in decoder service

Ensure that the edge spec covers what to do with "bad" submissions

I just closed Bug 1353076 since we won't do it on the AWS infra, but we should make sure the new edge spec covers:
What (if any) submissions are explicitly rejected by the edge
What (if any) submissions are accepted, but published to topics without any consumers

We currently are not planning to drop anything at the edge, but we should document that. We may need to do so in the future if we see malicious submissions or activity.

Support Legacy Systems

In particular, sslreports and DSMO.

sslreports doesn't follow the URI spec, all requests hit /submit/sslreports. DSMO in addition to not following the URI spec also uses GET requests.

We'll probably implement these last as special cases, but the required changes should be minimal.

Fork google_check.xml

google_check.xml is pretty lax about warning vs error on some things, and it's not configured to allow ignores, which I think we will come to want.

Implement Dataflow jobs with templates

https://cloud.google.com/dataflow/docs/templates/overview

As outlined for staging and production deploys it seems like this is the way to go, and moreover templates are the only supported mechanism for building dataflow jobs with terraform: https://www.terraform.io/docs/providers/google/r/dataflow_job.html.

Fix DecoderOptions.parse

An exception occured while executing the Java class. Method [parse] has multiple definitions [public static com.mozilla.telemetry.decoder.DecoderOptions$Parsed com .mozilla.telemetry.decoder.DecoderOptions.parse(com.mozilla.telemetry.decoder.DecoderOptions), public static com.mozilla.telemetry.options.SinkOptions$Parsed com.mozilla.telemetry.options.SinkOptions.parse(com.mozilla.telemetry.options.SinkOptions)] with different return types for [com.mozilla.telemetry.decoder.DecoderOptions]. -> [Help 1]

Consolidate error handling logic

We have several transforms now that catch exceptions and duplicate toError logic to parse the exception and payload to a PubsubMessage. We should factor that out to ensure consistent formatting of error outputs.

Prepare a talk for a technical review of ingestion

Summary: Interactive code review and explanation of how everything works and how to extend it.

30 minute talk, 30 minutes for questions.

@jklukas has prior experience, hit him up for tips.

Add flake8-docstrings dependency for ingestion-edge

This will make sure we consistently use and style docstrings.

See:

Add options to landfill to read from and/or write to gcs in avro format

Add MPL header to checkstyle for ingestion-beam

Test ingestion with real gcp resources using cloud-build

this is a meta-issue, should file sub-issues to do individual tests and list them below

#95 test bq output in ingestion-beam

Pre-load all schemas in ingestion-beam

walk resources/schemas and add each schema file found

Generate TOC in README

We're seeing some drift in the ingestion-beam README between the TOC and actual contents. We can use something like https://github.com/ekalinin/github-markdown-toc to automate updating the table of contents.

Use msgpack to serialize queue in ingestion-edge

code submitted to persist-queue for making serializer configurable on Queue: peter-wangxu/persist-queue#63

needs a follow-up PR to persist-queue for making serializer configurable on SQLiteAckQueue

Test ingestion-beam decode's main() method

in prod this will be run with file and pubsub IO

Rename Validation Service to Decoder

@jklukas feels that normalize is a better word for what it does, and I am inclined to agree.

r? @whd before I actually do this, do we have any good reasons not to change the name?

Use AckQueue from persist-queue

In ingestion-edge

because a raw file queue is faster than a sqlite queue

AckQueue needs to be implemented first: peter-wangxu/persist-queue#68

implement decoder service with Beam

https://github.com/mozilla/gcp-ingestion/blob/master/docs/decoder.md

steps in decoder:

#46 parse uri
#41 decode body gzip
#52 validate schema https://github.com/everit-org/json-schema
#37 resolve geoip https://github.com/maxmind/GeoIP2-java
#43 parse user_agent
#58 add attributeMap to message as metadata

deduplication is in issue #59

Handle permanent failures when flushing ingestion-edge queue

If a transient error occurs before a permanent error it may cause a message with permanent failures to be queue and then retried indefinitely.

we should:

throw out permanently failing messages from the queue
- or upload them to s3
check that we surface permanent failures before queuing where reasonable

Refactor Sink to use FileIO.writeDynamic

docs: https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/beam/sdk/io/FileIO.html#writeDynamic--

adjust landfill to accept a configuration indicating how to use pubsub attributes to route a message to a gcs directory, plus an 'other' bucket.

Replace PubsubMessage with POJO in ingestion-beam

Add build information to telemetry metadata

Buildhub can be used to determine whether someone is on an official build of firefox. @fbertsch has requested we add a boolean field like official_build or is_official to relevant telemetry pings.

example: https://github.com/mozilla/missioncontrol/blob/master/missioncontrol/etl/builds.py

Convert Sink from Scala to Java and document why

Also documentation about why we are choosing java over scala, what it brings us, and some comparisons.

Update ingestion-edge to use hashed requirements

Evaluate msgpack for ingestion-edge disk queue

https://medium.com/devoops-and-universe/serde-in-python-7a2dbf962e33

Write tests for local IO in ingestion-beam's Sink pipeline

specifically we should test

using local file paths, ideally with junit tests.

Import mozilla-pipeline-schemas to ingestion-beam/src/main/resources

or provide the schemas as a jar dependency

Periodically update schemas from mozilla-pipeline-schemas

Currently, we fetch mozilla-pipeline-schemas from GitHub at ingestion-beam build time and include the content in the jar. This means we can only get schema updates by rebuilding, draining the existing Dataflow job, and instantiating a new job with the new code.

We could try to spin up a periodic task (every 5 minutes, perhaps?) to get the latest content from GitHub and update the collection of schemas. Perhaps this could be expressed nicely as a side input?

Add options to landfill optionally read and/or write only message payload

so that we can configure landfill to create raw datasets

Support optional doc id and auto-uuid generation for non-legacy doctypes

We're already planning on supporting this for some legacy systems. We might want to also support it for newer systems since it would probably be low effort to support, and at least one person whose opinion on data is well-respected thought that it was in fact optional.

This would be an update to the edge spec and possibly downstream systems, depending on where the document ID is assigned. It would make for non-idempotent reprocessing of such systems, but that may be acceptable.

Make edge tests use docker-compose

Move ingestion-beam decode tests

current decode tests are against the transforms, so they should be separated based on the main file they are testing

mozilla / gcp-ingestion Goto Github PK

gcp-ingestion's Introduction

Telemetry Ingestion on Google Cloud Platform

gcp-ingestion's People

Contributors

Stargazers

Watchers

Forkers

gcp-ingestion's Issues

Recommend Projects

Recommend Topics

Recommend Org