pedromlsreis / paranormal_distributions Goto Github PK

Group N project for Data Mining course 2019-2020

Python 0.70% Jupyter Notebook 99.30%

data-mining

paranormal_distributions's Introduction

Hello world 👋

My name is Pedro Reis.

   Pee-Droh Reys

I'm Portuguese, I live in Lisbon and work in Statkraft Markets - Trading & Origination - MRDT (Trading Data Analytics) @ Statkraft. Also, I have a MSc. in Data Science & Advanced Analytics from Nova Information Management School - Universidade Nova de Lisboa.

💬 About me:

I love coffee, tea, the ocean, running and swimming
Passionate about technology, challenges and business
Also, I'm a fan of to-do lists and I really like to automate boring tasks 😄

⚡ Technical Skills:

Python
Machine Learning
Algorithms
Technical management

BONUS - Riddle me this!

If you like puzzles, here's a little cryptography problem :)

Guess my favorite hobbie using the zip from the URL and contact me with the answer.

If your answer is correct, I'll make your GitHub link appear in the Hall of Fame below!

https://drive.google.com/file/d/1h_KFGkV0c93JE0kYN-rA3aROt0xNj1e6/view
MD5 checksum: D0342E7A77087B53C35C52D3A31604B0

Step 1. '4m_1_4_scr1pt_k1dd13?'

Hint

rockyou

Step 2. '0h_n0w_1_c4n_s33_cl34rly...'

Hint

steganography

Hall of Fame:

You can also find me on:

paranormal_distributions's People

Contributors

Stargazers

Watchers

paranormal_distributions's Issues

Age

if "Birthday" in df.columns: df["Age"] = 2016 - df["Birthday"] del df["Birthday"]

meant to be? 2020 - df['Birthday']?

Split the project.py in multiple <task>.py files

Split the project.py code in multiple .py files and just call them in a main .py file, like a dataextraction.py file, a preprocessing.py file, etc.

This would shrink the visual impact of the code and let us make smaller and more visible changes to it.
Also turns it easier for us to refer to the code on the report.

file structure

According to the lecture, the steps usually are:

Data preparation

Exploratory data analysis
Detecting outliers
Dealing with missing values
Data discretization
Imbalanced learning and data generation

Data preprocessing

The curse of dimensionality
Identifying informative attributes/features
Creating attributes/features
Dimensionality reduction
Relevancy
Redundancy
Data standardization

Should we create organise the files accordingly?

Random Forest tuning

The commit 01f15d8 adds a Random Forest classifier to predict the missing values (NaNs) in the categorical columns.

We should look into the classifier hyperparameters and dig a bit more into this.
We can try splitting the dataframe in train/valid and CrossValidation to tune our RF classifier hyperparameters.

Remove outliers recurrently

We should remove outliers recurrently, until pca.explained_variance_ratio_ after PCA seems okay.
This will help in the clustering process.

Clustering always returns a 2 clusters plot with one giant cluster and a small one. This issue might derive from the existence of outliers in our data, which are grouped into one small cluster.

df["Area"] as a categorical variable

TODO:

turn df["Area"] pandas dtype into object
might be good to apply One Hot Encoding to df["Area"] (create dummy variables)

[In 53]:
df.isnull().sum()
[Out 53]:
First_Policy          30
Birthday              18
Education             17
Salary                36
Area                   1
Children              21
CMV                    0
Claims                 0
Motor                 34
Household              0
Health                43
Life                 104
Work_Compensation     86
dtype: int64

TODO: Figure out how to handle missing data.

Might be good to treat it individually, column by column.