Giter Site home page Giter Site logo

darenr / wordnet-clusters Goto Github PK

View Code? Open in Web Editor NEW
26.0 3.0 6.0 221 KB

Clustering a set of word/tags using K-Means with word2vec or wordnet distance

License: MIT License

Python 100.00%
clustering tags k-means-clustering k-means-implementation-in-python word2vec wordnet-clusters

wordnet-clusters's Introduction

Tag Clustering using wordnet and word2vec distance metrics

Clustering a set of wordnet synsets using k-means, the wordnet pair-wise distance (semantic relatedness) of word senses using the Edge Counting method of the of Wu & Palmer (1994) is mapped to the euclidean distance to allow K-means to converge preserving the original pair-wise relationship.

By toggling use_wordnet = False to True the distance metric between words will use a GloVe model glove.6B.300d_word2vec.txt (this must be in the word2vec format) and the word2vec similarity value

extras folder is proof of concept/experimentations

To Use:

  • create a newline delimited file with a list of wordnet senses (eg. data/example_tags.txt)
  • to use wordnet set use_wordnet=True, to use word2vec use_wordnet=False
  • python generate-tag-clusters.py data/example_tags.txt 25 0.7
    • 25 is the number of clusters to segment the list of wordnet senses into.
    • 0.7 is the similarity threshold, below this the words are considered not similar
  • results places into the results folder as a json file

wordnet-clusters's People

Contributors

darenr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

wordnet-clusters's Issues

Traceback error (invalid literal)

Hi

I am gotting the following error:

* loaded 2 wordnet senses, 0 rejected
 * generating dataset...
 * loading wordvector model ( glove.6B.300d_word2vec.txt )...
Traceback (most recent call last):
  File "generate-tag-clusters.py", line 192, in <module>
    data, labels = make_data_matrix(words, t)
  File "generate-tag-clusters.py", line 139, in make_data_matrix
    wordvector.append(word_to_word_distance(word_x, word_y, t))
  File "generate-tag-clusters.py", line 128, in word_to_word_distance
    distances.append(wv(w1, w2, t / 2.5))
  File "generate-tag-clusters.py", line 66, in wv
    wvmodel = KeyedVectors.load_word2vec_format(modelFile, binary=False)
  File "C:\Users\SIN\.conda\envs\scrapy-env\lib\site-packages\gensim\models\keyedvectors.py", line 1498, in load_word2vec_format
    limit=limit, datatype=datatype)
  File "C:\Users\SIN\.conda\envs\scrapy-env\lib\site-packages\gensim\models\utils_any2vec.py", line 344, in _load_word2vec_format
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
  File "C:\Users\SIN\.conda\envs\scrapy-env\lib\site-packages\gensim\models\utils_any2vec.py", line 344, in <genexpr>
    vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
ValueError: invalid literal for int() with base 10: 'the'

HOME error

Hi

I am trying to run the clusterization but I got the following error:

  File "generate-tag-clusters.py", line 23, in <module>
    modelFile = os.environ['HOME'] + "/models/" + "glove.6B.300d_word2vec.txt"
  File "C:\Users\SIN\.conda\envs\scrapy-env\lib\os.py", line 669, in __getitem__
    raise KeyError(key) from None
KeyError: 'HOME' 

Could you help me with this issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.