Giter Site home page Giter Site logo

mangalis0 / titanic-survival-conditional-probability Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 11.0 19 KB

Simple statistical prediction of the survival chances of the passengers in the testing set, given certain conditions as input. Refer to README.md for more detail

License: MIT License

Jupyter Notebook 100.00%
statistical-modeling missing-values imputation

titanic-survival-conditional-probability's Introduction

Titanic-Survival-Conditional-Probability

Completed by Mangaliso Makhoba.

Overview: This project is using the Titanic Dataset to create a simple statitistical model that will return a conditional survival probabily of a passenger given a condition on a numerical variable from the dataset.

Problem Statement: Build a model that will return a passengers survival chance given a passengers detail as input.

Data: Titanic Kaggle Challenge

Deliverables: Probability

Topics Covered

  1. Statistical Modeling
  2. Imputation of Missing values
  3. Probability

Tools Used

  1. Scikit-learn
  2. Jupyter Notebook

Installation and Usage

Ensure that the following packages have been installed and imported.

pip install numpy
pip install pandas

Jupyter Notebook - to run ipython notebook (.ipynb) project file

Follow instruction on https://docs.anaconda.com/anaconda/install/ to install Anaconda with Jupyter.

Alternatively: VS Code can render Jupyter Notebooks

Notebook Structure

The structure of this notebook is as follows:

  • First, we'll load our data to get a view of the predictor and response variables we will be modeling.
  • We determine the number of missing values for a specific column
  • We'll then preprocess our data by imputing missing values, mean in numerical features, and mode in categorical feaures.
  • We then model the survival probabilty of a passenger given their age, class, gender and so on

Function 1: Missing Values

A function that determines the number of missing entries for a specified column in the dataset. The function should return an int that corresponds to the number of missing entries in the specified column.

Function Specifications:

  • Should take a pandas DataFrame and a column_name as input and return a int as output.
  • The int should be the number of missing entries in the column.
  • Should be generalised to be able to work on ANY dataframe.

Expected Outputs:

total_missing(df,'Age') == 177
total_missing(df,'Survived') == 0

Function 2: Imputation

Write a function that takes in as input a dataframe and a column name, and returns the mean for numerical columns and the mode for non-numerical columns.

Function Specifications:

  • The function should take two inputs: (df, column_name), where df is a pandas DataFrame, column_name is a str.
  • If the column_name does not exist in df, raise a ValueError.
  • Should return as output the mean if the specified column is numerical and return a list of the mode(s) otherwise.
  • The mean should be rounded to 2 decimal places.
  • If there is more than one mode for a given non-numerical column, the fuction should return a list of all modes.

Expected Outputs:

calc_mean_mode(df, 'Age') == 29.7
calc_mean_mode(df, 'Embarked') == ['S']

Function 3: Model

We ultimately want to predict the survival chances of the passengers in the testing set. We can start by building a simple model using the data we already have by using conditional probability ! Write a function that returns the survival probability of a passenger, given a condition on a numerical variable from the dataset. The condition will consist of a column_name, a value and a boolean_operator. Possible boolean operators include "<",">", or "==". For example, column_name = "Age", boolean_operator = ">", and value = 40 together form the condition Age > 40.

Function specifications:

  • The function should make use of the df_clean DataFrame loaded earlier in this notebook.
  • It should take a numerical column_name string, a boolean_operator string, and a value of type string as input.
  • It should return a survival likelihood as a number between 0 and 1, rounded to 2 decimal places.
  • Assume that column_name exists in df_clean.

Expected Outputs:

survival_likelihood(df_clean,"Pclass","==","3") == 0.24
survival_likelihood(df_clean,"Age","<","15") == 0.58

Conclusion

Finding an appropriate strategy to impute missing values is very important to increasing the accuracy of the model you are building.

Contributing Authors

Authors: Mangaliso Makhoba, Explore Data Science Academy

Contact: [email protected]

Project Continuity

This is project is complete

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

MIT

titanic-survival-conditional-probability's People

Contributors

mangalis0 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.