-
The famous Titanic dataset was used for exploration,preparation and modelling with Logistic Regression model.
-
Four feature selction techniques were used to select the best features to include in the model, namely :
- RFE(Recursive Feature Elimination)
- Decision Trees Feature Selection
- Correlation Analysis
- Coefficient Feature Importance(Using Logistic Regression)
-
The model obtained an accuracy of 100% using pre-processing required by Logistic Regression, which included :
- Removing outliers.
- Removing mutlicollinearity - The model asssumees that the feature variables are not correlated with each other. Highly correlated features should be removed.
- Asserting linear assumption - Feature variables need to have a linear relationship with the target variable. A log transformation is used to assert that relationship if it is not present.
- Asserting normal distribution - Feature variables need to hae a normal distribution. If they are not normally distributed a log transform or BoxCox is used to assert the distribution.
- Feature scaling - The features must be scaled as they might not be habing the same range of values, therefore redulting in features with high numbers dominating the model and appearing to be more important than other variables. Feature scaling helps us scale them to the same range and tehrefore give each feature a chance to equally contribute to the model.
mschlei-48 / titanic-data-exploration-preparation-and-modelling- Goto Github PK
View Code? Open in Web Editor NEWTitanic Dataset Exploration, Visualization and Modelling with Logistic Regression.