Giter Site home page Giter Site logo

nlpub / watset-java Goto Github PK

View Code? Open in Web Editor NEW
28.0 4.0 6.0 14 MB

An implementation of the Watset clustering algorithm in Java.

Home Page: https://doi.org/10.1162/COLI_a_00354

License: Apache License 2.0

Java 99.91% Dockerfile 0.09%
clustering nlp community-detection graph-clustering watset chinese-whispers maxmax markov-clustering semantics fuzzy-clustering graph-embeddings embeddings

watset-java's Introduction

Watset

This is an open source implementation of the Watset algorithm for fuzzy graph clustering (aka soft clustering). This implementation is multi-threaded, extensible, contains several unit tests, and can be easily embedded into any Java-based graph processing pipeline. Also, it offers a convenient command-line tool for running Watset over the edge lists in text files.

Watset

This package, written in Java, also includes implementations of the Chinese Whispers (2006), Markov Clustering (2000), and MaxMax (2013) graph clustering algorithms.

Build Status Maven Central Maintainability Javadoc

Usage

There are three ways to obtain watset-java:

  1. download the recently released version (look at the watset.jar file in the Assets section);

  2. compile the master branch from source by cloning the repository and running mvn package in the repository root (or mvn package -Dshade to build a dependency-less jar file as in the previous option);

  3. get it from Maven Central, GitHub Packages, or JitPack.

This tool has several kinds of activity titled the commands. Such commands as watset, cw, mcl, and maxmax invoke different clustering algorithms, while senses invokes graph-based word sense induction.

There are two global command-line arguments: -i (or --input) that specifies the path of input file and -o (or --output) that specifies the path of the output file. The default values for these parameters are STDIN and STDOUT, correspondingly.

The input for this tool is an undirected weighted graph represented in the ABC format. This format is a tab-separated edge list: word1 word2 weight. The output of the graph clustering algorithm is a tab-separated file containing three columns: the cluster identifier, the cluster size, and the list of elements belonging to the cluster.

a	b	1
b	c	0.5
c	d	2

The default output format is a tab-separated file, each line of which contains three elements: id, size and a comma-and-space-separated list of elements.

In fact, watset-java is not only a tool, it also features a complete API for graph clustering. This API can easily be embedded into your application or library, see examples. The complete Javadoc is available at https://nlpub.github.io/watset-java/.

Dummy Clustering

As the baseline, it is possible to use the following four dummy clustering “algorithms”:

  • empty: the empty output regardless of the input;
  • singleton: each node of the input graph is a different cluster;
  • together: one cluster that keeps all the elements of the input graph;
  • components: each cluster is a connected component of the input graph.
$ java -jar watset.jar -i graph.txt -o output.tsv singleton

k Spanning Tree Clustering

The k spanning tree clustering algorithm performs graph single-linkage clustering using minimal spanning trees.

$ java -jar watset.jar -i graph.txt -o output.tsv kst -k 2

Spectral Clustering

The spectral clustering algorithm performs clustering of the graph's spectral embedding with the k-means algorithm.

$ java -jar watset.jar -i graph.txt -o output.tsv spectral -k 2

Since this algorithm computes node embeddings, it is possible to output them in the well-known textual Word2Vec format.

$ java -jar watset.jar -i graph.txt -o output.w2v embed -k 2

Chinese Whispers

Chinese Whispers (CW) is a hard clustering algorithm that resembles a popular children's game. This tool offers three different variations of this algorithm that can be set using the -m (--mode) option:

  • top: the node weighting approach that chooses the label with the highest total edge weight in the neighborhood;
  • lin: the top mode divided by the node degree (also known as nolog in some references);
  • log: the top mode divided by the logarithm of the node degree.
$ java -jar watset.jar -i graph.txt -o output.tsv cw -m top

Markov Clustering

Markov Clustering (MCL) is a hard clustering algorithm that simulates random walks on the graph. It is possible to specify two options of this algorithm:

  • the number of expansion operations -e (the default value is 2);
  • the power coefficient -r (the default value is also 2).
$ java -jar watset.jar -i graph.txt -o output.tsv mcl -e 2 -r 2

This implementation is not optimized, so the processing of large graphs will likely be quite slow. So, for large graphs it is recommended to use the original implementation of the MCL algorithm, which is written in C and thus is really fast.

$ java -jar watset.jar -i graph.txt -o output.tsv mcl-bin -r 2 --bin /path/to/mcl

Watset

Watset is a local-global meta-algorithm for fuzzy graph clustering. It builds an intermediate undirected graph that addresses the element ambiguity by considering different senses of each element in the input graph. This package offers the best-performing variation of Watset known as Simplified Watset.

Watset requires two different clustering algorithms to be selected: the local clustering algorithm for sense induction that is specified using -l (--local) and the global algorithm for the final clustering that is specified using -g (--global). It is possible to configure the algorithms using the similar options as shown above using the -lp (--local-params) and -gp (--global-params) options, which can be inserted multiple times.

$ java -jar watset.jar -i graph.txt -o output.tsv watset -l mcl -lp e=1 -lp r=3 -g cw -gp mode=lin

In practice, the default parameters for MCL and CW work well enough, so the -lp and -gp arguments can be omitted in this example.

Watset: Word Sense Induction

Since Watset performs curvature-based word sense induction, it is possible to extract the built sense inventory using the special command of this tool.

$ java -jar watset.jar -i graph.txt -o inventory.tsv senses -l mcl

The output of this operation is a tab-separated file containing three columns: the target word, the word sense identifier, and the set of related disambiguated senses in the given word sense (the context). Each sense is represented as word#id.

Watset: Word Sense Graph

It is also possible to print the intermediate word sense graph built by Watset.

$ java -jar watset.jar -i graph.txt -o sense-graph.txt graph -l mcl

The output file will be written virtually in the same ABC format as the input graph, but each node will be provided with a numerical sense identifier preceded by the suffix #. This feature simplifies the integration into other graph processing pipelines.

Watset: Word Sense Graph Embeddings

It might be useful to obtain spectral embeddings of the Watset sense graph nodes in the Word2Vec textual format.

$ java -jar watset.jar -i graph.txt -o sense-graph.w2v embedsenses -l mcl -k 2

MaxMax

MaxMax is a soft clustering algorithm for undirected graphs that constructs intermediate representation of the input graph as the directed graph. This algorithm has no options.

$ java -jar watset.jar -i graph.txt -o output.tsv maxmax

Evaluation

This package provides implementations of two fuzzy clustering evaluation approaches: pairwise F-score and normalized modified purity. Given the set of clusters to evaluate and the set of gold clusters (“classes”), it is possible to study the clustering quality.

$ java -jar watset.jar -i output.tsv pairwise -g gold.tsv
$ java -jar watset.jar -i output.tsv purity -g gold.tsv -n -m

Citation

@article{Ustalov:19:cl,
  author    = {Ustalov, Dmitry and Panchenko, Alexander and Biemann, Chris and Ponzetto, Simone Paolo},
  title     = {{Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction}},
  journal   = {Computational Linguistics},
  year      = {2019},
  volume    = {45},
  number    = {3},
  pages     = {423--479},
  doi       = {10.1162/COLI_a_00354},
  publisher = {MIT Press},
  issn      = {0891-2017},
  language  = {english},
}

Tutorial

Watset will be featured in the Graphs, Computation, and Language class at the 33rd European Summer School in Logic, Language and Information (ESSLLI 2022)!

Contributing

  1. Fork it;
  2. Create your feature branch (git checkout -b my-new-feature);
  3. Commit your changes (git commit -am 'Added some feature');
  4. Push to the branch (git push origin my-new-feature);
  5. Create new Pull Request.

Copyright

Copyright (c) 2016–2022 Dmitry Ustalov. See LICENSE for details.

watset-java's People

Contributors

dependabot[bot] avatar dustalov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.