Giter Site home page Giter Site logo

joddm / example_dataproc_twitter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from willianfuks/example_dataproc_twitter

0.0 2.0 0.0 392 KB

This repository is used as source code for the medium post about implementing a Twitter recommender system using GCP.

License: MIT License

Python 86.28% Shell 2.29% Jupyter Notebook 11.43%

example_dataproc_twitter's Introduction

Implementation of a Twitter Based Recommender System Fully Serverless Using GCP

This repository is used as source code for the medium post about implementing a Twitter recommender system using GCP.

We used 5 main tools from Google Cloud Platform: Dataflow, Dataproc, AppEngine, Bigquery and Cloud Storage

AppEngine (GAE)

Basically it all starts in the gae folder. There you will find definitions of yaml files for both of our environments standard and flexible. These are the yamls for the standard:

  1. cron.yaml: defines the cron executions and their respective times to run.
  2. queue.yaml: defines the rate at which queued tasks are executed.
  3. main.yaml: this is our main service that receives the cron requests and build scheduled tasks that are put into queue. In this project, we worked with push queues.
  4. worker.yaml: this service is the one that executes our tasks in background. The cron triggers main.py that by turn calls worker.py that finally is responsible for sending tasks to the queue.

And finally this is our only service defined in flexible environment since it uses Cython to speed up the computation of recommendations:

  1. recommender.yaml: Makes final recommendations for customers.

We have two distinct requirements for GAE, requirements.txt is targeted to the flexible deployment. Notice also that we have a config_template.py that should be used as a guide to create the file /gae/config.py.

Here we have the following available crons:

  • Exportage of customers data from BigQuery to Google Cloud Storage (GCS), defined under the route /export_customers in worker.py
  • Creation of Dataproc Cluster, execution of DIMSUM algorithm, deletion of cluster and initializaiton of Dataflow pipeline in route /dataproc_dimsum also in worker.py.

To deploy these files just run:

gcloud app deploy app.yaml worker.yaml
gcloud app deploy cron.yaml
gcloud app deploy queue.yaml
gcloud app deploy recommender.yaml

Unit Tests for GAE

Running unit testing in this project is quite simple, just install nox by running:

pip install nox-automation

And then we have general unit tests and system unit tests (which makes real connections to BigQuery and so on).

To run regular unit tests just run:

nox -s unit_gae

And system tests:

nox -s system_gae

  • Unit test requires gcloud to be installed in /gcloud-sdk/ for simulating AppEngine server locally.
  • System test requires a service key with Editor access to BigQuery and GCS.

Dataproc

Right after the cron exporting data from BQ to GCS is executed it starts the Dataproc one. The main subfolder in /dataproc is the /jobs where we have 3 main jobs:

  1. naive.py: This is the naive implementation with O(mL*L)
  2. df_naive.py: This was an attempt to implement naive approach but using Spark Dataframes. It failed ;)...
  3. dimsum.py: DIMSUM implementation in PySpark following the work of Rezah Zadeh.

There's no config file here as the setup is received as input by the cron job and the config in GAE folder. For instance (as in cron file):

/run_job/run_dimsum/?url=/dataproc_dimsum&target=worker&extended_args=--days_init=30,--days_end=1,--threshold=0.1&force=no

Important note: threshold is equivalent to the inverse of "gamma" value discussed in Rezah's paper, it basically asserts a threshold from where everything above this value is guaranteed to converge with relative bound error of 20%.

Unit Tests for Dataproc

Running unit tests will require a local Spark Cluster; this Docker image is recommended. Just run it like:

docker run -it --rm -p 8888:8888 -e GRANT_SUDO=yes --user root jupyter/pyspark-notebook bash

Once there, you can clone the repository, install nox and run the tests with the command:

nox -s system_dataproc

Dataflow

After Dataproc job is completed a task is schedule through the URL: /prepare_datastore which expects a Dataflow template to be available in a path specified in gae/config.py.

To create this template, make sure to install apache-beam by running:

pip install apache-beam
pip install apache-beam[gcp]

After that, make sure to create a file in /dataflow/config.py using /dataflow/config_template.py as a guideline with values for your project in GCP.

All this being done just run:

python build_datastore_template

and the template will be saved where the config file specified it to.

Unit Tests for Dataflow

Just make sure to have nox installed and run:

nox -s system_dataflow

Haivng a service account is also required.

example_dataproc_twitter's People

Contributors

willianfuks avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.