Giter Site home page Giter Site logo

spherical-x-means's Introduction

Spherical X-Means

This is a POC of a Spherical version of XMeans outlined in "Extending K-means with Effecient Estimation of Number of Clusters", Pelleg et al., CMU.

Distributed representations of documents tend to cluster similar texts along certain directions. Spherical X-Means can hence be used to cluster and identify topics/themes in document vector space. This algorithm is validated on a sample of normalized document vectors generated by training Gensim Doc2Vec on a larger corpus of news articles.

Algorithmic Updates

  1. The algorithm makes use of Spherical K-means instead of vanilla K-Means to cluster data on unit hypersphere.

  2. At each level, the algorithm spawns "maxClusters" number of Spherical K-means and selects a model based on their BIC scores.

  3. Arc length is the measure of distance between two points on a sphere. On a unit sphere, arc length between two points is equal to the angle between them. Hence, the algorithm uses Sum of Squared Angles (in Degrees) to compute BIC instead of Sum of Squared Errors.

  4. The algorithm takes a depth first approach to cluster data. Leaf nodes are formed when

    • the model with K = 1 is selected or
    • the number of data points to be clustered on are less than "maxClusters"

Usage

# Find appropriate number of clusters in data array X (n_examples x n_features)
from XMeans import XMeansTraining

Centers,Labels = XMeansTraining(X,maxBranching = 5,norm = True) # set norm = False if X is not normalized

Example

"sampleData.json" provides 150 dimensional normalized document vectors of 11957 news articles along with their titles. Refer "Example.ipynb" for implementation details. Spherical X-Means produces 26 clusters on this data. "clusterID.txt"s at "./Clusters" contain titles of articles clustered together.

A 3D tSNE visualization of clustered document vectors:

Hits

Some of the clusters whose themes could be identified clearly were:

ClusterID Titles Themes
0 India's annual mean temperature rises by 1.2 degree since 1901
Menopause is not a pause in your life
New disposable patch can help detect sleep apnea
How You Can Save More Lives By Eating Beef Instead of Chicken
Cycle your way to a healthier you
Health
3 Countdown begins for ISRO's GSLV-Mark III rocket launch - ANI News
ISRO launches India's heaviest rocket GSLV-Mk III from Sriharikota : Oneindia News
ISRO ready to power the Monster Rocket
NASA to launch new manned Mars rovers in 2020
India's Heaviest Rocket GSLV-MkIII D1 & Heaviest Satellite GSAT-19 Successfully Launched; 4 Reasons It's A Big Deal
Space
7 Apple's iOS 11 Files app pops up on App Store ahead of WWDC
New iOS, Siri Speakers & More Expected at Apple's WWDC Tonight
Samsung Galaxy S8 'Pirates of the Caribbean' edition goes on sale; priced around Rs 56,700
Moto X Play gets Android 7.1.1 Nougat in India; is it soak test or public roll-out?
Sony Xperia XA1 Ultra listed on the Sony India website; might launch soon
Smart Phones
9 England vs New Zealand, ICC Champions Trophy 2017: Eoin Morgan says England 'completely different' since 2015 World Cup
ICC Champions Trophy: Bangladesh Team heckled at Iftar Party in England : Oneindia News
India vs Pakistan: Vijay Goel congratulates Indian cricket team
Sensors Are Being Used In Bats In The Champions Trophy. Here's Everything You Need To Know About Them
ICC Champion trophy: Vijay Mallya spotted with Sunil Gavaskar during Ind VS Pak match : Oneindia News
ICC Champions Trophy
12 Rohr, Eagles skipper meet NFF bosses over cash
Super Eagles Tormentor Rantie, Six Other Bafana Stars Doubtful For Uyo Clash
Real Madrid: 'Insatiable' Real hailed masters of Europe
La Liga: Girona promoted to Spanish top flight for first time
Bhaichung backs franchise fee waiver for EB, MB
Football
14 After Padmavati, SLB and Salman will collaborate for a film
JUST HOT! Prabhas' New Look Goes Viral; Anushka Shetty & Pooja Hegde's FIGHT For Saaho Continues!
Munna Michael trailer: 5 moments that prove Tiger Shroff-Nawazuddin Siddiqui have arrived with something exciting
Vidya Balan and her father to join hands with Farhan Akhtar
MOM APPROVED! Sara Ali Khan On A Dinner Date With Sushant Singh Rajput; Amrita Singh Joins Them Too
Bollywood
24 Saudi shuts Al-Jazeera office in Qatar row: Ministry
Saudi Arabia, Bahrain, Egypt, UAE cut ties with Qatar for supporting terror groups
Lai Mohammed: 'Those behind fake news will be prosecuted', Minister says
Emirates' flights connecting Doha suspended after Saudi-Qatar row
Qatar denounces 'unjustified' cut of Gulf tiesQatar slams Saudi Arabia, UAE, Egypt's decision to cut diplomatic ties, calls it 'campaign of lies'
Qatar Crisis

Misses

Following were some clusters whose themes were difficult to name:

ClusterID Titles Probable Themes
2 Bangalore's free Wi-Fi services to entertain bus traveler during their journey
Trump Effect: IT Employees express interest in Agriculture
Prime Minister Narendra Modi's Tweets On World Environmental Day
Will it be the end of Bengaluru's iconic cinema house, Fame Shankarnag?
Railways to reduce dependence on diesel
Joseph Muscat: Labour party leader sworn in as Malta's prime minister
Current Affairs
21 Hyderabad: After Sonu Nigam, city complaints of Noise pollution
Shiv Sena slams Maharashtra govt over farmers' stir
Kerala plants one crore saplings on environment day
MCC observes World Environment Day
State moots to make Aadhar mandatory for organ donors?
Environment

spherical-x-means's People

Contributors

elisa-aleman avatar sagarkurandwad avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.