Giter Site home page Giter Site logo

arkya-art / end-to-end-data-science-project Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 276 KB

This project integrates the Python, SQL and Tableau and demonstrates the working of end-to-end data science project

Jupyter Notebook 99.91% Python 0.09%
machine-learning visualization data-science sql

end-to-end-data-science-project's Introduction

End to End Data Analyst Project (Integrating Python, SQL & Tableau)

Abstract

This project uses Absenteeism at work dataset for a courier company in Brazil with qualitative and Quantitative attributes. The input data is preprocessed and using Logistic regression model, the predictive analysis is done. The predicted outputs are stored within the MY-SQL database and designed a Dashboard in Tableau for qualitative analysis on inferenced outputs. The entire process is automated and stored as a python module.

Dataset

Absenteeism_at_work Dataset

The database was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil.

Input Attributes : 11 Output Attributes: 1

It contains attributes like identification ID, reason for absence ( stratified into 21 + 5 categories), Month of absence, Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6)), Seasons (summer (1), autumn (2), winter (3), spring (4)), Transportation expense, Distance from Residence to Work (kilometers), Service time, Age, Work load Average/day,Hit target, Disciplinary failure (yes=1; no=0), Education (high school (1), graduate (2), postgraduate (3), master and doctor (4)), Son (number of children), Social drinker (yes=1; no=0), Social smoker (yes=1; no=0), Pet (number of pet), Weight, Height, Body mass index, Absenteeism time in hours (target)

Folder Structure

  • Train Data: Absenteeism_data
  • Preprocessed Data: Absenteeism_predicted data
  • Predicted Data: Absenteeism_preprocessed

Data Preprocessing

Dropped ID column -> Generated encodings dataframe for reason of absence (pd.get_dummies()) considering only 1 reason provided & dropped 0th reason -> (1) Removed reason_for_absence column from initial dataset -> (2) Separating reasons into 4 groups -> (3) Concatenated grouped reasons with initial df -> converted date to datetime format -> extracted month & weekday from date & dropped date -> encoded education column

Methodology

  • Selected extreme absenteeism by extracting those values whose median absenteeism is > 3 Hours as Target. Dropped the output target variable, distance to work, Daily work load average and day of the week from dataframe.

  • Scaled the column values except reasons & Education (already encoded)

  • Applied train-test split on input dataframe in 80:20

  • Applied Logistic Regression model and calculated intercepts and coefficients for each feature

input dataframe : scaled dataframe excluding target output variable output : Calculated Absenteeism in hours of employees

Processing

  • Saved the Model by pickling the model and scaler files

Model snapshot

Folder Structure

  • Jupyter Notebook -> Data Preprocessing & Machine Learning
  • Model Snapshots -> model & scaler weights

MySQL Database

Created python module for automating the process of data cleaning and prediction

Python Module structure

Consist of entire data preprocessing and prediction part for new test data

  • Class CustomScaler (fit & tranformed for scaling new test data)
  • Class Absenteeism model (loaded dataset for same preprocessing techniques as for train data, predicted the absenteeism in hours)

Python module

Integration

  • Loaded the python module and created the instance for model weights. Passed the new test data to the python module and got the predictions

  • Imported pymysql library to connect python with the MySql database

  • Created database outline structure within the mysql workbench for different features with its datatype.

MySQL Database schemas

  • Executed all the SQL - INSERT INTO statements to insert all the predicted values by python module to database by FOR loop

Python, SQL Integration

Tableau Visualization

Connected the MySql database with the Tableau and extracted out the stored database to Tableau workbench. Plotted three important visualizations

  • Age vs Probability
  • Reasons vs Probability
  • Transportation Expense vs Probability

Tableau Public Visualization

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.