KNN ALGORITHM.

About KNN classifiers.

KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. Becouse of this the scale of variables in such a dataset is very important. Variables on a large scale will have a larger effect on the distance between the observations which also affects the KNN clasifier too.

An intutive way to handle the scalling problem in KNN classification is to standerdize the the dataset in such a way that all variables ae given a mean of zero and a sd of 1. Training algorithm:

Store all the data

Prediction Algorithm:

Calculate the distance from x to all points in your data.
Sort the points in your data by increasing the distance from x.
Predict the majority label of the "k" closest points.

About KNN and using some tools like numpy , pandas and ...

This project objects to classify the observations with respect to a target varaiable indicated at last variable.Its important to note that this one of the anonymized datasets provided by clients.This could be because of the need to protect sensitive information.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('annonimizeddataset',index_col = 0)

data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ	TARGET CLASS
0	0.913917	1.162073	0.567946	0.755464	0.780862	0.352608	0.759697	0.643798	0.879422	1.231409	1
1	0.635632	1.003722	0.535342	0.825645	0.924109	0.648450	0.675334	1.013546	0.621552	1.492702	0
2	0.721360	1.201493	0.921990	0.855595	1.526629	0.720781	1.626351	1.154483	0.957877	1.285597	0
3	1.234204	1.386726	0.653046	0.825624	1.142504	0.875128	1.409708	1.380003	1.522692	1.153093	1
4	1.279491	0.949750	0.627280	0.668976	1.232537	0.703727	1.115596	0.646691	1.463812	1.419167	1

So here the data is anonymized with meaningless labes as the raw labels.The last class is the target class which needs to be predicted.

Data exploration analysis

data.columns

Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
       'TARGET CLASS'],
      dtype='object')

data.shape

(1000, 11)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
WTT             1000 non-null float64
PTI             1000 non-null float64
EQW             1000 non-null float64
SBI             1000 non-null float64
LQE             1000 non-null float64
QWG             1000 non-null float64
FDJ             1000 non-null float64
PJF             1000 non-null float64
HQE             1000 non-null float64
NXJ             1000 non-null float64
TARGET CLASS    1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 93.8 KB

sns.heatmap(data.isnull(),yticklabels=False,cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1a25531a20>

The graph above shows clearly that there is no missing data in the set above.

Scalling Variables.

As pointed out earlier ,scalling the variables is very important in KNN .

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(data.drop('TARGET CLASS',axis = 1))

StandardScaler(copy=True, with_mean=True, with_std=True)

scaled_feat = scaler.transform(data.drop('TARGET CLASS',axis = 1))

data_feat = pd.DataFrame(scaled_feat,columns=data.columns[:-1])

data_feat.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	WTT	PTI	EQW	SBI	LQE	QWG	FDJ	PJF	HQE	NXJ
0	-0.123542	0.185907	-0.913431	0.319629	-1.033637	-2.308375	-0.798951	-1.482368	-0.949719	-0.643314
1	-1.084836	-0.430348	-1.025313	0.625388	-0.444847	-1.152706	-1.129797	-0.202240	-1.828051	0.636759
2	-0.788702	0.339318	0.301511	0.755873	2.031693	-0.870156	2.599818	0.285707	-0.682494	-0.377850
3	0.982841	1.060193	-0.621399	0.625299	0.452820	-0.267220	1.750208	1.066491	1.241325	-1.026987
4	1.139275	-0.640392	-0.709819	-0.057175	0.822886	-0.936773	0.596782	-1.472352	1.040772	0.276510

Splitting data into train and test split

from sklearn.model_selection import train_test_split

X = data_feat

y = data['TARGET CLASS']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

KNN model deployment.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

predictions = knn.predict(X_test)

Model Evaluation

from sklearn.metrics import classification_report,confusion_matrix

print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))

[[134   8]
 [ 11 147]]
____________________________________________________________
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       142
           1       0.95      0.93      0.94       158

   micro avg       0.94      0.94      0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300

this gives an accuracy of 94%.

Using the Elbow method in Improving the model.

This proces aims to extract more information by chosing a beter k value.The process will also try to iterate over many more different k values and plot their error rates.This will enable me to see which one has the lowest error rate.

errorRate = []

for kvalue in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=kvalue)
    knn.fit(X_train,y_train)
    predictions = knn.predict(X_test)
    errorRate.append(np.mean(predictions != y_test)) # average error rate

plt.figure(figsize=(10,6))
plt.plot(range(1,40),errorRate,color = "blue",linestyle = "dashed",marker = 'o')

[<matplotlib.lines.Line2D at 0x1a2528e518>]

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))

---------------------------------------------------------------------------

This gives a small improvement in accuracy.

leanerr / dataanalysis_knn_classifier Goto Github PK

dataanalysis_knn_classifier's Introduction

KNN ALGORITHM.

About KNN classifiers.

About KNN and using some tools like numpy , pandas and ...

Data exploration analysis

Scalling Variables.

Splitting data into train and test split

KNN model deployment.

Model Evaluation

Using the Elbow method in Improving the model.

dataanalysis_knn_classifier's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent