Giter Site home page Giter Site logo

brunofaustino / ifood-data-architect-test Goto Github PK

View Code? Open in Web Editor NEW

This project forked from murilobellatini/ifood-data-architect-test

0.0 0.0 0.0 162 KB

My solution to the iFood Data Architect Test using PySpark, Jupyter and Docker in order to create e prototype data lake.

Dockerfile 0.20% Jupyter Notebook 82.69% Python 17.11%

ifood-data-architect-test's Introduction

iFood Data Architect Test

This is my solution for the iFood Test, whose proposal is to create a prototype datalake from json and csv files. Raw and trusted layers where required. My solution runs locally inside a docker container with Pyspark and all further necessary dependencies. It gets the data ingested, wrangled, processed and finally exported as parquet files. Partitioning was done based on test requirements.

The solution is split into two parts:

  1. Development: Jupyter Notebooks with a development walkthrough can be found here.
  2. Final script: main.py can be found here here

The raw data had some duplicated values which I've decided to drop after having a look into some of them and understanding it was safe to do so. Data validation was accomplished by casting data types after manually understanding each present column. For cases when I was unsure, the data was left as string in order to avoid possible crashes. Regarding anonymization I've just dropped all sensitive data columns since their owners, customers and merchants, can be identified via their unique ids anyways.

Regarding data persistency, all storage is done on the mapped docker-compose volume. In this case locally at the relative path ./dev/docker-volume inside the directory where the repo was cloned to - see below. Ideally data should be made avaiable at a shared storage file system, so multiple teams could have access to it. For simplicity purpouses it's left like stated here.

volumes:
    - ./dev/docker-volume:/home/jovyan

The complete solution was run on my local laptop, that's why the Spark session has modest configurations. But once the final application script gets deployed to a proper development environment, such as suggested Databricks, it should scale accordingly.

How to Run

Requirements

  • docker >= 19.03.9
  • docker-compose >= 1.25.0

Step-by-step

  • Place your AWS Credentials into ./dev/docker-volume/.aws/secrets.json according the following format:
{
    "PUBLIC_KEY": "<PUBLIC_KEY>",
    "SECRET_KEY": "<SECRET_KEY>"
}
  • Run docker-compose at root directory
docker-compose up --build
  • Access http://localhost:8888 at your browser

  • Then decide to run production script or developement notebooks:

  1. Production: Open terminal window within Jupyter Notebook and run python main.py.
  2. Development: Run notebooks.

Test Scope

Please find the code challenge here.

References

Folder Organization

├── README.md                   <- The top-level README.
│
├── media                       <- Folder for storing media such as images.
│
├── dev                         <- Development folder where  application scripts are stored.
│   ├── docker-volume           <- Shared volume with docker container.
│   │   ├── credentials         <- Folder for storing credentials, only one required so far is AWS account.
│   │   ├── data                <- Data for file system, here is where raw and trusted layers shall reside.
│   │   ├── notebooks           <- Development notebooks to have a walkthrough of development phase.
│   │   ├── src                 <- Custom library for storing required code.
│   │   └── main.py             <- Main application to run full pipeline from single script.
│   │
│   ├── Dockerfile              <- Defines docker container and installs dependencies.
│   ├── requirements.txt        <- Stores Python required libraries.
│   └── setup.py                <- Installs `src` custom library
│
├── .gitignore                  <- Used for ignoring uploading unecessary data to repo.
│
├── docker-compose.yml          <- Docker-compose file for runing docker containers with environemnt specs. 
│
└── TestScope.md                <- Clone of original test scope in case the original repo gets deleted.

ifood-data-architect-test's People

Contributors

murilobellatini avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.