Giter Site home page Giter Site logo

leanerr / dataanalysis_knn_classifier Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 93 KB

KNN Classifire by using Python and Jupyter NoteBooks

Jupyter Notebook 14.51% HTML 84.33% Python 1.16%
knn-classification knn-model data-science jupyer-notebook python sklearn-library

dataanalysis_knn_classifier's Introduction


KNN ALGORITHM.


About KNN classifiers.

KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. Becouse of this the scale of variables in such a dataset is very important. Variables on a large scale will have a larger effect on the distance between the observations which also affects the KNN clasifier too.

An intutive way to handle the scalling problem in KNN classification is to standerdize the the dataset in such a way that all variables ae given a mean of zero and a sd of 1. Training algorithm:

  1. Store all the data

Prediction Algorithm:

  1. Calculate the distance from x to all points in your data.
  2. Sort the points in your data by increasing the distance from x.
  3. Predict the majority label of the "k" closest points.

About KNN and using some tools like numpy , pandas and ...

This project objects to classify the observations with respect to a target varaiable indicated at last variable.Its important to note that this one of the anonymized datasets provided by clients.This could be because of the need to protect sensitive information.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
data = pd.read_csv('annonimizeddataset',index_col = 0)
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ TARGET CLASS
0 0.913917 1.162073 0.567946 0.755464 0.780862 0.352608 0.759697 0.643798 0.879422 1.231409 1
1 0.635632 1.003722 0.535342 0.825645 0.924109 0.648450 0.675334 1.013546 0.621552 1.492702 0
2 0.721360 1.201493 0.921990 0.855595 1.526629 0.720781 1.626351 1.154483 0.957877 1.285597 0
3 1.234204 1.386726 0.653046 0.825624 1.142504 0.875128 1.409708 1.380003 1.522692 1.153093 1
4 1.279491 0.949750 0.627280 0.668976 1.232537 0.703727 1.115596 0.646691 1.463812 1.419167 1

So here the data is anonymized with meaningless labes as the raw labels.The last class is the target class which needs to be predicted.

Data exploration analysis

data.columns
Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
       'TARGET CLASS'],
      dtype='object')
data.shape
(1000, 11)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 11 columns):
WTT             1000 non-null float64
PTI             1000 non-null float64
EQW             1000 non-null float64
SBI             1000 non-null float64
LQE             1000 non-null float64
QWG             1000 non-null float64
FDJ             1000 non-null float64
PJF             1000 non-null float64
HQE             1000 non-null float64
NXJ             1000 non-null float64
TARGET CLASS    1000 non-null int64
dtypes: float64(10), int64(1)
memory usage: 93.8 KB
sns.heatmap(data.isnull(),yticklabels=False,cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1a25531a20>

png

The graph above shows clearly that there is no missing data in the set above.

Scalling Variables.

As pointed out earlier ,scalling the variables is very important in KNN .

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data.drop('TARGET CLASS',axis = 1))
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_feat = scaler.transform(data.drop('TARGET CLASS',axis = 1))
data_feat = pd.DataFrame(scaled_feat,columns=data.columns[:-1])
data_feat.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
WTT PTI EQW SBI LQE QWG FDJ PJF HQE NXJ
0 -0.123542 0.185907 -0.913431 0.319629 -1.033637 -2.308375 -0.798951 -1.482368 -0.949719 -0.643314
1 -1.084836 -0.430348 -1.025313 0.625388 -0.444847 -1.152706 -1.129797 -0.202240 -1.828051 0.636759
2 -0.788702 0.339318 0.301511 0.755873 2.031693 -0.870156 2.599818 0.285707 -0.682494 -0.377850
3 0.982841 1.060193 -0.621399 0.625299 0.452820 -0.267220 1.750208 1.066491 1.241325 -1.026987
4 1.139275 -0.640392 -0.709819 -0.057175 0.822886 -0.936773 0.596782 -1.472352 1.040772 0.276510

Splitting data into train and test split

from sklearn.model_selection import train_test_split
X = data_feat
y = data['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

KNN model deployment.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')
predictions = knn.predict(X_test)

Model Evaluation

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
[[134   8]
 [ 11 147]]
____________________________________________________________
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       142
           1       0.95      0.93      0.94       158

   micro avg       0.94      0.94      0.94       300
   macro avg       0.94      0.94      0.94       300
weighted avg       0.94      0.94      0.94       300

this gives an accuracy of 94%.

Using the Elbow method in Improving the model.

This proces aims to extract more information by chosing a beter k value.The process will also try to iterate over many more different k values and plot their error rates.This will enable me to see which one has the lowest error rate.

errorRate = []

for kvalue in range(1,40):
    knn = KNeighborsClassifier(n_neighbors=kvalue)
    knn.fit(X_train,y_train)
    predictions = knn.predict(X_test)
    errorRate.append(np.mean(predictions != y_test)) # average error rate
plt.figure(figsize=(10,6))
plt.plot(range(1,40),errorRate,color = "blue",linestyle = "dashed",marker = 'o')
[<matplotlib.lines.Line2D at 0x1a2528e518>]

png

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train,y_train)
print(confusion_matrix(y_test,predictions))
print("___"*20)
print(classification_report(y_test,predictions))
---------------------------------------------------------------------------

This gives a small improvement in accuracy.

dataanalysis_knn_classifier's People

Contributors

leanerr avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.