Giter Site home page Giter Site logo

agx01 / iris_kmeans Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.51 MB

Implementing K-Means Clustering on the Iris data set using Euclidean and Manhattan metrics

License: Apache License 2.0

Python 100.00%
k-means-implementation-in-python k-means-clustering python3 iris-dataset iris-classification

iris_kmeans's Introduction

K-Means Clustering on Iris Dataset

Problem Statement

The use of iris data set for the prediction of species is a classic example for classification problem. We will be implementing K-means clustering algorithm on this dataset and validate the accuracy of our model using the actual species data.

Strategy

The strategy used for K-Means is to initialize centroids using the first 3 records of the X values. Then we calculate the distance from each point to the centroid and mark that record with the closest centroid value. Then re-create the centroid by finding the mean of each cluster. Then re calculate the distances and re- create the clusters based on the closest distances. The distances changes will provide the error value, if the error value goes to zero then the clusters have not changed. This becomes the final the clusters. We repeat this process with different metrics to calculate the distances from the centroids to sample points. For this project, I am using:

  1. Euclidean Distance
  2. Manhattan Distance

Folders:

data - Iris dataset is stored in the folder

Choosing the right K- values

For our experiment, we know to use 3 clusters because of the number of classes available in the dataset. However, in actual scenario, we will not be informed of the groups available data. We use the Elbow method to choose the number of clusters in the data.

For this we run the sklearn Kmeans algorithm, and then measure WCSS value across the number of clusters picked. The Elbow Method

K-Means Clustering

K-Means clustering algorithms is used to find natural groups in the data. The training method for K-means is very simple as it only stores the data. However, the predict method is compute intensive as it calculates the distances between the points and centroids multiple times.

Main challenges of K-means algorithms:

  1. Picking the right centroids
  2. Picking the right number of clusters

Results

Euclidean Distance metrics

Using the Euclidean distance metric, we get the following results. Plotting the box-plots for the actual classes and predicted clusters, gives a relation between the cluster and the label to use as mapping.

Species Cluster Number
Iris-Setosa Cluster 2
Iris-virginica Cluster 0
Iris-versicolor Cluster 1

Actual vs Predicted Values Box Plot for the Features

Box Plot of Features (Euclidean Distance)

For this metric, we get an accuracy of 86.667%. Results of Euclidean Metric

Manhattan Distance metrics

Using the Manhattan distance metric, we get the following results. Plotting the box-plots for the actual classes and predicted clusters, gives a relation between the cluster and the label to use as mapping.

Species Cluster Number
Iris-Setosa Cluster 2
Iris-virginica Cluster 1
Iris-versicolor Cluster 0

Actual vs Predicted Values Box Plot for the Features

Box Plot of Features (Manhattan Distance)

For this metric, we get an accuracy of 86.667%. Results of Euclidean Metric

iris_kmeans's People

Contributors

agx01 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.