Giter Site home page Giter Site logo

mchmarny / pubsub-to-bigquery-pump Goto Github PK

View Code? Open in Web Editor NEW
7.0 4.0 2.0 19.58 MB

Simple utility combining Cloud Run and Stackdriver metrics to drain JSON messages from PubSub topic into BigQuery table

License: Apache License 2.0

Dockerfile 1.72% Shell 48.93% Go 49.36%
pubsub bigquery stackdriver cloudrun golang metrics events

pubsub-to-bigquery-pump's Introduction

Drain PubSub topic messages to BigQuery table

Simple utility combining Cloud Run and Stackdriver metrics to drain JSON messages from PubSub topic into BigQuery table

Prerequisites

If you don't have one already, start by creating new project and configuring Google Cloud SDK. Similarly, if you have not done so already, you will have set up Cloud Run.

How to Use It

The quick way to deploy includes:

Configuration

The bin/config file includes many parameters but the only ones you have to change are:

  • TOPIC_NAME - this is the name of the PubSub topic from which you want to drain messages into BigQuery
  • DATASET_NAME - name of the existing BigQuery dataset in your project
  • TABLE_NAME - name of the existing BigQuery table that resides in above defined dataset

Note, this service assumes your BigQuery table schema matches the names of JSON message fields. Column names are not case sensitive and JSON fields not present in the table will be ignored. You can use this service to generate BigQuery schema from a single JSON message of your PubSub queue

This service also creates two trigger metrics on the topic to decide when to batch insert messages into BigQuery table. These are age of the oldest unacknowledged message (TOPIC_MAX_MESSAGE_AGE) or maximum number of still undelivered messages (TOPIC_MAX_MESSAGE_COUNT). There is some delay but basically as soon as one of these thresholds is reached, the service will be triggered.

Additional parameters are set to resealable defaults, change them as needed. I've provided comments for each to help you set this to optimal value for your use-case.

Why Custom Service

Google Cloud has an easy approach to draining your PubSub messages into BigQuery. Using provided template you create a job that will consistently and reliably stream your messages into BigQuery.

gcloud dataflow jobs run $JOB_NAME --region us-west1 \
  --gcs-location gs://dataflow-templates/latest/PubSub_to_BigQuery \
  --parameters "inputTopic=projects/${PROJECT}/topics/${TOPIC},outputTableSpec=${PROJECT}:${DATASET}.${TABLE}"

This approach solves many of the common issues related to back pressure, retries, and individual insert quota limits. If you are either dealing with a constant stream of messages or need to drain your PubSub messages immediately into BigQuery, this is your best option.

The one downside of that approach is that, behind the scene, Dataflow deploys VMs. While the machine types and the number of VMs are configurable, there will always be at least one VM. That means that, whether there are messages to process or not, you always pay for VMs.

However, if your message flow is in-frequent or don't mind messages being written in scheduled batches, you can avoid that cost by using this service.

Building Image

The service uses pre-build image from a public image repository gcr.io/cloudylabs-public/pubsub-to-bigquery-pump. If you prefer to build your own image you can submit a job to the Cloud Build service using the included Dockerfile and results in versioned, non-root container image URI which will be used to deploy your service to Cloud Run.

bin/image

Cleanup

To cleanup all resources created by this sample execute

bin/cleanup

Disclaimer

This is my personal project and it does not represent my employer. I take no responsibility for issues caused by this code. I do my best to ensure that everything works, but if something goes wrong, my apologies is all you will get.

pubsub-to-bigquery-pump's People

Contributors

mchmarny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.