This project involves building a machine learning model to predict life expectancy based on various health and socioeconomic factors. The steps include data loading, exploratory data analysis, data cleaning, normalization, encoding, model selection, and deployment.
- Import necessary libraries: Essential libraries for data manipulation, visualization, and modeling.
- Read Dataset: Load the dataset into a pandas DataFrame.
- Sanity check of data: Initial inspection of data to understand its structure and identify any obvious issues.
- Exploratory Data Analysis (EDA): Analyze and visualize the data to uncover patterns, correlations, and insights.
- Data Cleaning: Handle missing values, outliers, and duplicate entries.
- Normalization: Standardize the data to ensure all features contribute equally to the model.
- Encoding of Data: Convert categorical data into numerical format for machine learning algorithms.
- Selecting the model: Train and evaluate different models to select the best one.
- Deployment and Monitoring: Deploy the model and set up monitoring to ensure it performs well in production.
To run this project, you'll need to install the following libraries:
pip install numpy pandas matplotlib seaborn scikit-learn
The dataset used in this project is "Life Expectancy Data.csv", which contains various health and socioeconomic indicators for different countries and years.
Here's a brief overview of the key steps in the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Life Expectancy Data.csv")
print(df.head())
print(df.tail())
print(df.shape)
print(df.info())
df.describe()
for i in df.select_dtypes(include="number").columns:
sns.histplot(data=df, x=i, kde=True)
plt.show()
sns.boxplot(data=df, x=i)
plt.show()
sns.scatterplot(data=df, x=i, y="Life expectancy ")
plt.show()
s = df.select_dtypes(include="number").corr()
plt.figure(figsize=(15, 15))
sns.heatmap(s, annot=True)
df.isnull().sum()/df.shape[0]*100
df.drop_duplicates(inplace=True)
im = KNNImputer()
for i in df.select_dtypes(include="number").columns:
df[i] = im.fit_transform(df[[i]])
def wisker(col):
q1, q3 = np.percentile(col, [25, 75])
iqr = q3 - q1
lw = q1 - (1.5 * iqr)
uw = q3 + (1.5 * iqr)
return lw, uw
for i in df.select_dtypes(include="number").columns:
lw, uw = wisker(df[i])
df[i] = np.where(df[i] < lw, lw, df[i])
df[i] = np.where(df[i] > uw, uw, df[i])
sc = StandardScaler()
df['percentage expenditure'] = sc.fit_transform(df['percentage expenditure'].values.reshape(-1, 1))
df['Measles '] = sc.fit_transform(df['Measles '].values.reshape(-1, 1))
le = LabelEncoder()
df['Status'] = le.fit_transform(df['Status'])
df['Country'] = le.fit_transform(df['Country'])
X = df.drop("Life expectancy ", axis=1)
y = df["Life expectancy "]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Linear Regression")
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
dt = DecisionTreeRegressor()
dt.fit(df.drop("Life expectancy ", axis=1), df["Life expectancy "])
predictions = dt.predict(df.drop("Life expectancy ", axis=1))
mae = mean_absolute_error(df["Life expectancy "], predictions)
mse = mean_squared_error(df["Life expectancy "], predictions)
r2 = r2_score(df["Life expectancy "], predictions)
print("Decision Tree")
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
-
Linear Regression
- Mean Absolute Error: 2.63
- Mean Squared Error: 12.05
- R-squared: 0.86
-
Decision Tree
- Mean Absolute Error: 0.0
- Mean Squared Error: 0.0
- R-squared: 1.0
For deployment, you can use services like AWS SageMaker, Google AI Platform, or Azure ML. Monitoring can be set up using tools like Prometheus and Grafana to ensure the model performs well over time.
This project demonstrates a comprehensive approach to predicting life expectancy using machine learning. The steps include data preparation, model selection, and evaluation, ensuring a robust and accurate prediction model.
This project is licensed under the MIT License - see the LICENSE file for details.