Giter Site home page Giter Site logo

chaitanyakasaraneni / textclustering Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 2.52 MB

Clustering the given text data using a variation of DBSCAN

Jupyter Notebook 51.58% Python 48.42%
dbscan bisecting-kmeans-clustring bisecting-kmeans dbscan-clustering

textclustering's Introduction

Text Clustering

Description

In this program, we will be implementing a variation of DBSCAN Clustering Algorithm from the scratch. For this, there are two steps involved:

  1. We'll cluster the input data using Sklearn's clustering methods such that we obtain > 100 clusters.
  2. In the second step, we'll implement a variation of the DBSCAN clustering algorithm that takes input as the clusters, rather than individual points. You can use any method to compute inter-cluster distances to figure out which points are core, border, or noise points

After that, assign each of the instances in the input data to K clusters identified from 1 to K. All objects in the training data set must be assigned to a cluster. Thus, you can either assign all noise points to cluster K+1 or apply post-processing after DBSCAN and assign noise points to the closest cluster.

Data Description

Input data (provided as training data) consists of 8580 text records in sparse format. No labels are provided.

Overview of the implementation

Usually DBSCAN doesn't perform well on very large datasets but performs well on smaller ones. Whereas k-Means is a simple algorithm that performs well on smaller as well as larger datasets. In this program, even the dataset is not that large, I used the k-means for the initial clustering. After getting the clusters, I passed the cluster centers to my variation of DBSCAN algorithm, computed the distances and based on the eps and minPts conditions divided the core, border and noise points. After the formation of these points, I calculated the distances from the other points in the dataset and merged them to the nearest clusters. The noise points were assigned to the K+1 cluster.

My Implementation for the DBSCAN

For the first step, I used the MiniBatchKMeans from sklearn.cluster.

from sklearn.cluster import MiniBatchKMeans


kmeans = MiniBatchKMeans(n_clusters=200,random_state = 0)
kmeans.fit(csrL2Normalized)

In the second step, I implemented my own version of DBSCAN, which is not so effificient and has an NMI (Normalized Mutual Information Score) of 0.4223

# Finding Core points
for i in range(len(points)):
        neighbors = []
        for p in range(0, len(points)):
            # If the distance is below eps, p is a neighbor
            if sp.spatial.distance.cosine(points[i] ,points[p]) <= eps:
                neighbors.append(p)
        neighborhoods.append(neighbors)
        # If neighborhood has at least minPts, i is a core point
        if len(neighbors) >= minPts :
            core.append(i)
            
# Finding Border Points
for i in range(len(points)):
        neighbors = neighborhoods[i]
        # Look at points that are not core points
        if len(neighbors) < minPts:
            for j in range(len(neighbors)):
                # If one of its neighbors is a core, it is also in the core point's neighborhood, 
                # thus it is a border point rather than a noise point
                if neighbors[j] in core:
                    border.append(i)
                    # Need at least one core point...
                    break
# Finding noise points
for i in range(len(points)):
    if i not in core and i not in border:
        noise.append(i)   

Note:

  • You can use any implementation to get the better NMI and NMI is an external evaluation metric, you can use any internal evaluation metric such as calinski_harabaz_score from sklearn.metrics.
  • In the bisecting k-means, I commented out the last part purposefully. While running that part of the code, make sure you close all other programs/applications as it may return a Memory Error.

Advantages

As stated above, DBSCAN doesn't perform well on very large datasets but performs well on smaller ones. Whereas k-Means is a simple algorithm that perform well on smaller as well as larger datasets. Hence, this program helps to cluster large datasets (and you can try your own implementations, this is just a very basic one).

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.