-
Imputed following categorical features with most frequent value
quantity
,management_group
,source_class
,basin
,payment
,payment_type
,permit
,quantity_group
,water_quality
,quality_group
,region
,extraction_type_group
,extraction_type
,source
,source_type
,waterpoint_type
,waterpoint_type_group
,scheme_management
,subvillage
,ward
,wpt_name
-
Imputed following categorical features with constant
lga
,installer
,funder
,extraction_type_class
,management
. -
Imputed all numerical features with
mean
. -
Scale all numerical features using
StandardScaler
-
Created new features
year_recorder
,yearly_week_recorder
,month_recorder
usingdate_recorded
andconstruction_year
. -
Created a new feature
age
by subtractingdate_recorded
andconstruction_year
. -
Created two new features
distance
andangle
usinglatitude
andlongitude
. -
Created two new features
distance_pca0
anddistance_pca1
by applying Principal Component Analysis tolatitude
andlongitude
. -
Applied OneHot Encoding for following features
quantity
,management_group
,source_class
-
Applied Target Encoding for following features
lga
,installer
,funder
,extraction_type_class
,management
-
Applied Label Encoding for following features
basin
,payment
,payment_type
,permit
,quantity_group
,water_quality
,quality_group
,region
,extraction_type_group
,extraction_type
,source
,source_type
,waterpoint_type
,waterpoint_type_group
,scheme_management
,subvillage
,ward
,wpt_name
-
Mutual Information scores used to select most appropriate features to train the model as feature selection technique.
- Initially used
XGBClassifier
,RandomForestClassifier
andCatBoostClassifier
. Finally selectedRandomForestClassifier
as the model to do experiments as it give best results in the selection process.