Content prioritisation data pipeline

This is a Python project that collects data from various sources and sends them to Big Query. A mini data pipeline type of thing.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project.

Please read if you plan on contributing to the project: Code of conduct for this project and Contribution guidelines for this project

Prerequisites

You will need a Google Cloud account, Google Cloud SDK and Docker. Make sure you hace gcloud installed and run gcloud auth configure-docker

Environments

Installing locally

To use a local development environment you will have to download a new service account keyfile that has read permission to Google Cloud Storage. You will also have to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the location of that keyfile. eg export GOOGLE_APPLICATION_CREDENTIALS=/path/to/file.json

Hosted environjment on Goog Cloud Run

A Dockerfile is used to define the hosted environment on Google CLoud run

The Dockerfile details all the required environment variables:

gcp_project this is the Google Cloud project

bq_dataset this is the data set to send data to

advisernet_ga this is used with ga_data.py to get GA data for Advisernet

public_ga this is used with ga_data.py to get GA data for the Public site

all_ga this is used with ga_data.py to get GA data for all sites

The contents of folders creds and store will not be committed to git or included in the Docker image. The intention is that creds can be used to locally store credential files and store can be used as a local store for data files.

Deployment to Google Cloud Run

Deployment is handled via the Makefile:

make build - Builds the image on Google Container Repository

make deploy - Deploys the image on Google Cloud Run

make dev-build - Builds a development image on Google Container Repository

make dev-deploy - Deploys the development image and overwrites the env variable for the BQ dataset to write to test tables rather than writing to the production tables

The code

this bit will explain how it all works, but it's yet to be written

Authors

Ian Ansell - Initial work - Nyzl

See also the list of contributors who participated in this project.

License

This project is licensed under the GNU License - see the LICENSE.md file for details

Acknowledgments

Alec Johnson for helping with the alpha of this codebase and for being a general sounding board throughout the development. Daniel Nissenbaum for help getting the code and documentation into something approaching maintainable

nyzl / cj-data Goto Github PK

cj-data's Introduction

Content prioritisation data pipeline

Getting Started

Prerequisites

Environments

Installing locally

Hosted environjment on Goog Cloud Run

Deployment to Google Cloud Run

The code

Authors

License

Acknowledgments

cj-data's People

Contributors

Stargazers

Watchers

Forkers

cj-data's Issues

Recommend Projects

Recommend Topics

Recommend Org