Giter Site home page Giter Site logo

paranormal_distributions's Introduction

Hello world ๐Ÿ‘‹

My name is Pedro Reis.

   Pee-Droh Reys

I'm Portuguese, I live in Lisbon and work in Statkraft Markets - Trading & Origination - MRDT (Trading Data Analytics) @ Statkraft. Also, I have a MSc. in Data Science & Advanced Analytics from Nova Information Management School - Universidade Nova de Lisboa.


๐Ÿ’ฌ About me:

  • I love coffee, tea, the ocean, running and swimming
  • Passionate about technology, challenges and business
  • Also, I'm a fan of to-do lists and I really like to automate boring tasks ๐Ÿ˜„

โšก Technical Skills:

  • Python
  • Machine Learning
  • Algorithms
  • Technical management


BONUS - Riddle me this!

If you like puzzles, here's a little cryptography problem :)

Guess my favorite hobbie using the zip from the URL and contact me with the answer.

If your answer is correct, I'll make your GitHub link appear in the Hall of Fame below!

https://drive.google.com/file/d/1h_KFGkV0c93JE0kYN-rA3aROt0xNj1e6/view
MD5 checksum: D0342E7A77087B53C35C52D3A31604B0

Step 1. '4m_1_4_scr1pt_k1dd13?'

Hint
rockyou

Step 2. '0h_n0w_1_c4n_s33_cl34rly...'

Hint
steganography

Hall of Fame:

  • ...


You can also find me on:

ย ย ย ย  Github Badge Linkedin Badge Kaggle Badge ORCID Badge Stack Overflow Badge Gmail Badge HackerRank Badge


Github Stats Top Langs


Profile views GitHub License Donate BTC ko-fi

paranormal_distributions's People

Contributors

kalrashid15 avatar pedromlsreis avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

paranormal_distributions's Issues

Age

if "Birthday" in df.columns: df["Age"] = 2016 - df["Birthday"] del df["Birthday"]

meant to be? 2020 - df['Birthday']?

Split the project.py in multiple <task>.py files

Split the project.py code in multiple .py files and just call them in a main .py file, like a dataextraction.py file, a preprocessing.py file, etc.

This would shrink the visual impact of the code and let us make smaller and more visible changes to it.
Also turns it easier for us to refer to the code on the report.

file structure

According to the lecture, the steps usually are:

Data preparation

  • Exploratory data analysis
  • Detecting outliers
  • Dealing with missing values
  • Data discretization
  • Imbalanced learning and data generation

Data preprocessing

  • The curse of dimensionality
  • Identifying informative attributes/features
  • Creating attributes/features
  • Dimensionality reduction
  • Relevancy
  • Redundancy
  • Data standardization

Should we create organise the files accordingly?

Random Forest tuning

The commit 01f15d8 adds a Random Forest classifier to predict the missing values (NaNs) in the categorical columns.

We should look into the classifier hyperparameters and dig a bit more into this.
We can try splitting the dataframe in train/valid and CrossValidation to tune our RF classifier hyperparameters.

Remove outliers recurrently

We should remove outliers recurrently, until pca.explained_variance_ratio_ after PCA seems okay.
This will help in the clustering process.

Clustering always returns a 2 clusters plot with one giant cluster and a small one. This issue might derive from the existence of outliers in our data, which are grouped into one small cluster.

First Policy before Birthday.

Around 2000 cases, where first policy of the customers is before their birthday. Should we assign null for these values, and impute?
Because this can be huge issue.!

t-SNE dimensionality reduction

We've applied Principal Component Analysis (PCA) to our data, in utils.preprocessing.

I think it'd be nice if we could test PCA versus the clustering when using T-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.

k-Nearest Neighbors imputation

We've applied a Random Forest model to impute the missing categorical values in our data, in utils.preprocessing.

It'd be nice if we tried a KNN imputation, as we're talking about categorical data. I think we can try scikit-learn's KNNImputer.

Dealing with missing values

The df has many columns containing missing values/NaNs.

[In 53]:
df.isnull().sum()
[Out 53]:
First_Policy          30
Birthday              18
Education             17
Salary                36
Area                   1
Children              21
CMV                    0
Claims                 0
Motor                 34
Household              0
Health                43
Life                 104
Work_Compensation     86
dtype: int64

TODO: Figure out how to handle missing data.

Might be good to treat it individually, column by column.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.