Giter Site home page Giter Site logo

nyzl / cj-data Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 2.0 134 KB

a data pipeline built in Python and run using Google Cloud Run

License: GNU General Public License v3.0

Dockerfile 2.51% Python 86.91% HTML 6.95% Makefile 3.63%
python flask data-pipeline google-cloud

cj-data's Introduction

Content prioritisation data pipeline

This is a Python project that collects data from various sources and sends them to Big Query. A mini data pipeline type of thing.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project.

Please read if you plan on contributing to the project: Code of conduct for this project and Contribution guidelines for this project

Prerequisites

You will need a Google Cloud account, Google Cloud SDK and Docker. Make sure you hace gcloud installed and run gcloud auth configure-docker

Environments

Installing locally

To use a local development environment you will have to download a new service account keyfile that has read permission to Google Cloud Storage. You will also have to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the location of that keyfile. eg export GOOGLE_APPLICATION_CREDENTIALS=/path/to/file.json

Hosted environjment on Goog Cloud Run

A Dockerfile is used to define the hosted environment on Google CLoud run

The Dockerfile details all the required environment variables:

gcp_project this is the Google Cloud project

bq_dataset this is the data set to send data to

advisernet_ga this is used with ga_data.py to get GA data for Advisernet

public_ga this is used with ga_data.py to get GA data for the Public site

all_ga this is used with ga_data.py to get GA data for all sites

The contents of folders creds and store will not be committed to git or included in the Docker image. The intention is that creds can be used to locally store credential files and store can be used as a local store for data files.

Deployment to Google Cloud Run

Deployment is handled via the Makefile:

make build - Builds the image on Google Container Repository

make deploy - Deploys the image on Google Cloud Run

make dev-build - Builds a development image on Google Container Repository

make dev-deploy - Deploys the development image and overwrites the env variable for the BQ dataset to write to test tables rather than writing to the production tables

The code

this bit will explain how it all works, but it's yet to be written

Authors

Ian Ansell - Initial work - Nyzl

See also the list of contributors who participated in this project.

License

This project is licensed under the GNU License - see the LICENSE.md file for details

Acknowledgments

Alec Johnson for helping with the alpha of this codebase and for being a general sounding board throughout the development. Daniel Nissenbaum for help getting the code and documentation into something approaching maintainable

cj-data's People

Contributors

nyzl avatar dependabot[bot] avatar

Stargazers

apexpromgt avatar Franklin Abiwon-Ramirez avatar Adrian Mróz avatar Alec Johnson avatar

Watchers

James Cloos avatar  avatar

Forkers

vtula2000

cj-data's Issues

move GA profile ids out of env vars

Is your feature request related to a problem? Please describe.
Currently the profile Id is hard coded in the env vars. This makes changes more cumbersome and all report parameters should live in one place, the report_list

Describe the solution you'd like
add the profile id to the source_kwargs and have the ga_data file read those instead of the env vars

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.