EDBSCAN

Enforced Density-Based Spatial Clustering of Applications with Noise.

This package is an extension on the DBSCAN algorithm to enable for pre-labeled data points in the clusters, or in other words to enforce certain cluster values and splits. It mimics the scikit-learn implementation of sklearn.cluster.DBSCAN.

Installation

You can either install the package through PyPI:

pip install edbscan

Or via this repository directly:

pip install git+https://github.com/RubenPants/EDBSCAN.git

Usage

The image below shows you the result of EDBSCAN on a given input. The image on the left shows you the raw input data, together with the few labeled samples. The image on the right shows the clusters found by EDBSCAN, where the light blue dots represent the detected noise.

# Load in the data
import numpy as np
data = np.load(open('data.npy'))
print(data.shape)  # (220, 2)
y = np.load(open('y.npy'))
print(y.shape)  # (220, )
print(y)  # array([None, None, …, -1, None, …, 0, None, …, 1, …], dtype=object)

# Run the algorithm
from edbscan import edbscan
core_points, labels = edbscan(X=data, y=y)
print(labels)  # array([-1, 2, 2, 4, -1, -1, 6, 3, 4, …])

As shown in the code snippet above, aside from the raw data (data), a target vector y is provided. This vector indicates the known (labeled) clusters. A None cluster label are those not yet known, that need to get clustered by the EDBSCAN algorithm.

For more detailed usages, see the notebooks present in the examples/ folder.

How EDBSCAN works

There are three concepts that define how EDBSCAN operates:

The DBSCAN algorithm on which this algorithm is based on, read the paper or the scikit-learn documentation for more.
Semi-supervised annotations, represented by the y vector in the Usage section. This vector contains three types of values:
- None if the given sample is not known to belong to a specific cluster and needs to get labeled by the EDBSCAN algorithm
- -1 if the given sample is known to be noise
- 0..N if the given sample is known to belong to cluster 0..N
Where DBSCAN expands its clusters in a FIFO fashion, will EDBSCAN expand its clusters in a most dense first fashion. In other words, the items that have the most detected nearest neighbours get expanded first. By doing so, the denser areas get assigned a cluster faster. This prevents two dense cluster that are near each other from merging if they are already assigned a different label.

Comparison

This section compares EDBSCAN to (1) other clustering algorithms as DBSCAN and HDBSCAN, and (2) on different clustering benchmarks.

1. DBSCAN, HDBSCAN, and EDBSCAN

This section compares the behaviour of the DBSCAN algorithm, the HDBSCAN and the EDBSCAN algorithm on the data shown in the Usage section. The input data looks as follows:

In each of the clustered results, light-blue data represents the detected noise.

Some observations on the DBSCAN result:

Green combined two clusters that should be separated
Purple combined two clusters that should be separated
Brown identified noise as a cluster

Some observations on the HDBSCAN result:

Yellow and Grey are now successfully separated
Brown and Pink are now successfully separated
Purple identified noise as a cluster

Some observations on the EDBSCAN result:

Grey and Orange are now successfully separated
Brown and Pink are now successfully separated
The noise that was previously detected as a cluster is now successfully identified as noise

2. Scikit-learn cluster benchmark

The following images show the results of the EDBSCAN algorithm on different scikit-learn clustering benchmarks.

rubenpants / edbscan Goto Github PK

edbscan's Introduction

EDBSCAN

Installation

Usage

How EDBSCAN works

Comparison

1. DBSCAN, HDBSCAN, and EDBSCAN

2. Scikit-learn cluster benchmark

edbscan's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent