Giter Site home page Giter Site logo

political-spending-infographic's Introduction

political-spending-infographic

An assignment for the Data Science module as a part of my MSc in Statistics. This provides a modern, fully extensible and portable modern data stack-in-a-box proof-of-concept / template for producing consistent, reproducible, and principled data workflows, namely for the production of reports, charts, and infographics.

The project is setup with one example infographic to generate which shows some charts relating to the US Federal Senate elections in 1980.

To generate the infographic, we suggest running it through docker (via the Dockerfile) or, if on VSCode, by opening the project in a Dev Container. Once in the container, the infographic can be run with just make. (Note: Installation of the arrow package can take a while; it's not unexpected for image building to take over 10 minutes as a result since it compiles it from source!).

Alternatively, if you create a python virtual environment (and maybe you'd like to do the same for R), if you manually install in R the tidyverse, arrow, here, and ggthemes packages, and manually install in python meltano, then you should be able to just run make from the root of the project directory also (you can also install meltano via pipx if preferred)!

Stack

We follow an approach akin to that shown in Modern data stack-in-a-box. That is, we use the following tools:

  • Meltano: ELT orchestrator.
  • DBT: Transformation orchestrator.
  • DuckDB: Local analytical database.
  • R: Analysis and infographic generation (via ggplot2).

I recommend consulting each of the above's (in particular first 2) documentation to get a gist for what they do and how so that the project structure isn't too foreign!

Structure

The structure of the project mirrors both that of the above article (essentially a standard Meltano project) and the recommended DBT best-practices. Each of the below has their own README, but roughly their contents are as follows:

  • analyze: Where analysis tools and scripts live.
  • data: Where non-extract data lives.
  • extract: Where data is extracted to and from.
  • output: Where analysis outputs live.
  • plugins: Meltano plugins directory.
  • transform: Where the DBT project lives. The project is fully documented with several data tests. Once in a container (or otherwise) and having built the project, you can run dbt docs generate && dbt docs serve to generate a full documentation site with a complete DAG representation.

Style

To enforce code-style and standards, we use:

If using as a template adapt the associated . files as required! We follow the SQL style as recommended by DBT.

Data

Potential improvements

  • We could replace meltano and just rely on the duckdb adapter for dbt for reading in extracted files. The downside to this is the loss of flexibility that meltano gives in being able to handle different data-sources. Furthermore, if you wanted to setup further orchestration (e.g. scheduled refreshes, etc.), then meltano provides an immediate solution, giving better extensibility.
  • We should have specified a schema for the CSVs. This was something I discovered too late but this substantially improves ingestion speeds.
  • We could easily add on evidence (hence my preference meltano) or some other BI tool at the end of the pipeline for more interactive exploration.
  • We should have used the SQLite3 database on DIME; the issue is that this database is nearly 1 TB in size which didn't fit on my local machine during development. It would be worth exploring cloud/remote-server approaches that could be taken, especially considering this project is already dockerized.
  • There's some bug with the duckdb R package which was preventing directly reading from the database in the analysis script, hence why we materialized the analysis models to .parquet files and rely on arrow. It would be extremely nice to not have to do this, and just use dbplyr (or otherwise) to directly query the analysis tables from the duckdb database for use in the plots, especially for later years where the data volume can become challenging to load fully into memory.

political-spending-infographic's People

Contributors

bd3dowling avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.