Giter Site home page Giter Site logo

kaytony / single-pass-clustering-for-chinese-text Goto Github PK

View Code? Open in Web Editor NEW

This project forked from 173325533/single-pass-clustering-for-chinese-text

0.0 1.0 0.0 5 KB

针对中文的话题(主题)聚类,采用single pass聚类算法

License: MIT License

Python 100.00%

single-pass-clustering-for-chinese-text's Introduction

#single-pass-clustering-for-chinese-text 在话题(主题)聚类中,Single-pass聚类算法比K-means算法更为有效。Single-pass聚类算法不需要指定类目数量,通过设定相似度阈值可以控制聚类团簇的大小。

Single-pass聚类算法,是一种增量聚类算法,每篇文本只需要流过算法一次,所以被称为single-pass,效率高于K-means或KNN等算法。

single-pass算法顺序处理文本,以第一篇文档为种子,建立一个新话题。之后的文档计算与已有话题的相似度,将该文档加入到与它相似度最大的且大于一定阈值的话题中。如果与所有已有话题相似度都小于阈值,则以该文档为聚类种子,建立新的话题类别。其算法流程如下:

(1) 以第一篇文档为种子,建立一个话题;

(2) 将文档D向量化,可以采用VSM(vector space model)或doc2vec等算法

(3) 将文档D与已有的所有话题均做相似度计算;

(4) 找出与文档D有最大相似度的已有话题;

(5) 若相似度值大于阈值thres,则把文档D加入到有最大相似度的话题中,跳转至(7);

(6) 若相似度值小于阈值thres, 则文档D不属于任一已有话题, 需创建新的话题类别,同时将当前文本归属到新创建的话题类别中;

(7) 聚类结束,等待下一篇文档。

single-pass-clustering-for-chinese-text's People

Contributors

howard-hou avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.