Giter Site home page Giter Site logo

data-pipeline's Introduction

MoMa data pipeline

The objective of this project is to build a pipeline for the dataset of the MoMa's collection. The pipeline cleans the data and generate a visualization based on a problem defined beforehand.

Original data set can be found here.

Problem definition

I am working at the MoMa and I would like to create a new temporary exhibition of art works by decade. I have a limited capaity in the room I will use for that exhibition so I need to know the repartition of my art work by decade and manage the organization of this new project.

Q: What is the repartition of MoMa's collection by decade?

Process

To reach the final output I need:

  • An accurate date of creation for each art work
  • Minimumal accurate columns and data if further information are needed
  • A new column Date Range to store the art work in bins by decade

Steps used in the data pipeline:

moma data-pipeline process

Tools and way of working: I first did data exploration in jupyter notebook in order to test functions and pandas concepts. Then I created functions for each step in a python file.

Python concepts used in the pipeline

Libraries used:

  • pandas
  • numpy
  • regex
  • matplotlib
  • searborn

Concepts used:

  • drop_duplicates() and drop() functions to get off useless columns and rows
  • regex search() to catch date value and clean it
  • apply() to apply function to dataFrame
  • groupby() & agg()
  • nunique(), value_counts() and isna() to deal with value types
  • loc[] and dealing with index and columns of dataFrames
  • fillna() to fill the NaN value with the guessed ones
  • pandas .cut() functions to create bins

Results

By executing the python file moma-pipeline.py you will generate a .png visualization answering the question asked in the Problem definition part.

moma data-pipeline output

Obstacles

  • Went too far on my problem definition for the project. I needed to review it a little bit.
  • Needed to deal with data types depending on the function used (regex match or calculating mean of date) and NaN values where causing a lot of data type errors.
  • Complicated the project by replacing the missing value by the mean of each Artist because I needed to match values between two different dataFrames.

Lesson learned

  • Numpy is needed to deal with NaN values
  • We can pass several arguments in the function when using apply()
  • Create copy of the dataframe for each new function applied when I was working on jupyter notebook to make the exploration easier

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.