Giter Site home page Giter Site logo

ds_project_titanic's Introduction

Data science | Project : "Titanic"

Project guide

Before the beginning:

  • "README.txt" - Contains the project itself with visualizations. It is assumed that the reader will view it as the main input file.
  • "DataFrames" - Contains all datasets.
  • "Images" - Contains all images "README.txt".
  • "Code_Visual" - Contains the code for the entire visual (Charts and 3D).
  • "Code_Tree" - Contains the tree algorithm itself.

p.s. Additional notes have been made throughout the code for ease of reading and understanding.

Sections

  • Introduction
  • A task
  • Exploratory Data Analysis (Data analysis, more details can be viewed in Code_Visuals.py)
    • Data cleaning
    • Search for correlations
    • Visualization
    • Choice of parameters for learning
  • Classification Tree (Distribution algorithm, more details can be viewed in Code_Tree.py)
    • Cleaning and preparing data for training
    • Primary training
    • Selection of optimal tree parameters
    • Testing
  • Total
  • Sources

Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely known "unsinkable" Titanic sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats on board for everyone, resulting in the deaths of 1,502 out of 2,224 passengers and crew.

Although there was a certain element of luck in survival, it seems that certain groups of people were more likely to survive than others.

alt text

A task

In this task, we will build a predictive model that answers the question, “Which types of people are more likely to survive?” using passenger data (for example, name, age, gender, socioeconomic class, etc.). And also analyze what factors influenced survival and to what extent.

For this, 3 datasets are provided (which you can view in the "DataFrames" folder):

  • "gender_submission.csv"
  • "train.csv"
  • "test.csv"

p.s. While working on the project, I additionally created new "clean" sets (Without everything that was superfluous for me and which I did not upload to Git). These sets were made for my personal convenience, so you can see them in some parts of the code.

EDA (Exploratory Data Analysis)

After cleaning the data, it was necessary to highlight the main dependencies and conduct a small analysis of the relationships. You also need to be sure that they make sense. To start the analysis, I chose a graph to visualize correlations.

Correlations

On this graph, it is immediately worth noting several correlations that are interesting to us.

  • "Sex" (Gender)
  • "Pclass" (Class)
  • "Fare" (Fee)
  • "Age" (Age) Ignoring "SibSp" and "Parch" (this is a logical correlation of members of the same family), in this case we are most interested in the "Survived" column.

alt text

Visualization of survivors by gender

Here we test the first major relationship. In this graph, we see that the probability of survival is approximately 2 times higher if the gender is female.

alt text

Percentage of survivors by class

In this graph, we have taken the percentage of survival to compensate for the difference in the number of people in different classes. Here we note that the 1st class had an extremely high probability of survival, compared to the 2nd and 3rd. Presumably this could be due to the higher location of the cabins.

alt text

Visualization of survivors by age

The analysis of the correlation of age caused ambiguous thoughts in me, so it was decided to go deeper with the help of 3D visualization.

alt text

General 3D visualization

Having considered this visualization, I assumed that although age influenced survival, since children mostly survived, it was also necessary to take into account that there were practically no old people on the ship. But I still had a hunch that there was a class relationship worth checking. So I continued.

alt text

3D visualization "Fare"

It was necessary to check this correlation, since it raised the most questions for me. Subsequently, I still discovered it, although I had to play with different charts, the reason for this was some outliers, which in turn made the standard charts little understandable.

alt text

(In this chart, the most noticeable shift is upward along the "Fare" line, or in this case it is the z-axis)

Survivors by class versus age

And then I made graphs by age for each class, as well as their survival rate. As a result, as one would expect, the general trend showed a shift to the right side in age (that is, the higher the class, the older). Which had such an effect on the correlation of age to survival, which is essentially logical. Older people generally have more wealth, which eventually allowed them to take a higher class. As a result, you need to be more careful with age when setting up the algorithm.

alt text

(For the compilation of all graphs by age, columns in which the NA value in the "Age" column were not used)

Classification Tree

I had 2 choices, create a single tree or forest. The forest would presumably give a more accurate classification and would probably be a better solution. But in this case, the goal of my personal interest was to create a single classification tree and make it as accurate as possible.

Cleaning and preparing data for training

Before starting the task, we were given 3 data sets (Can be viewed in the "DataFrames" folder):

  • "gender_submission.csv" - Set with information on Id and survival.
  • "train.csv" - Set on which we will train and test our algorithm.
  • "test.csv" - The purpose of our task (Predict this data).

We immediately connect "gender_submission.csv" with "train.csv" by the key "PassengerId" and remove all unnecessary columns except for those that we noted for ourselves in the EDA (They have a correlation with survival).

Initial training

After the initial training, as expected, we got a retrained tree. Next, we begin to adjust the parameters and cut the possibilities of the tree in order not to give it a chance to retrain.

alt text

Selection of optimal tree parameters

For this step, we create a graph with visualization of different parameters for the tree.

alt text

Now we understand at what stage the retraining of our tree begins and we can correct it. Based on this graph, we adjust the optimal depth parameters, as well as the minimum number of samples for splitting the branch, and test. We get an accuracy result of ≈ 0.83 (Rounded according to the rules), respectively, 83%.

Testing

We conduct Cross Value Score out of 5. For more statistics and understanding of the trend in order to avoid errors.

We get: 0.77094972, 0.78089888, 0.84831461, 0.79213483, 0.81460674.

Average accuracy: 0.8014 with a standard deviation of +/- 0.0552.

Based on a decision tree classification model, I predicted whether the passengers would survive with a ≈80% chance.

Total

I added a couple more lines to visualize the tree itself, so that everything is clearer and this is what happened. The prediction percentage is not bad, given that this tree can be used to analyze any randomly taken data from the titanic. I tried to find the optimal balance between its sizes so that the tree would not be overtrained. As a result, it turned out ≈80% or 4 out of 5 the algorithm classifies correctly. It is possible to achieve higher classification values, but then the algorithm will be less universal.

alt text

(You can view the full code in Code_Tree.py)

Sources

ds_project_titanic's People

Contributors

aettio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.