Giter Site home page Giter Site logo

cwtsleiden / networkanalysis Goto Github PK

View Code? Open in Web Editor NEW
138.0 11.0 33.0 608 KB

Java package that provides data structures and algorithms for network analysis.

License: MIT License

Java 100.00%
network-analysis clustering-algorithm community-detection clustering java layout layout-algorithm leiden-algorithm louvain-algorithm mapping

networkanalysis's Introduction

networkanalysis

Build master branch License: MIT Latest release Maven Central version DOI

Introduction

This Java package provides algorithms and data structures for network analysis. Currently, the package focuses on clustering (or community detection) and layout (or mapping) of networks. In particular, the package contains an implementation of the Leiden algorithm and the Louvain algorithm for network clustering and the VOS technique for network layout. Only undirected networks are supported.

The networkanalysis package was developed by Nees Jan van Eck, Vincent Traag, and Ludo Waltman at the Centre for Science and Technology Studies (CWTS) at Leiden University.

Documentation

Documentation is provided in the source code in javadoc format. The documentation is also available in a compiled format.

Installation

Maven

<dependency>
    <groupId>nl.cwts</groupId>
    <artifactId>networkanalysis</artifactId>
    <version>1.3.0</version>
</dependency>

Gradle

implementation group: 'nl.cwts', name: 'networkanalysis', version: '1.3.0'

Usage

The networkanalysis package requires Java 8 or higher. The latest version of the package is available as a pre-compiled jar on Maven Central and GitHub Packages. Instructions for compiling the source code of the package are provided below.

To run the clustering algorithms, the command-line tool RunNetworkClustering is provided. The tool can be run as follows:

java -cp networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering

If no further arguments are provided, the following usage notice will be displayed:

RunNetworkClustering version 1.3.0
By Vincent Traag, Ludo Waltman, and Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University

Usage: RunNetworkClustering [options] <filename>

Identify clusters (also known as communities) in a network using either the
Leiden or the Louvain algorithm.

The file in <filename> is expected to contain a tab-separated edge list
(without a header line). Nodes are represented by zero-index integer numbers.
Only undirected networks are supported. Each edge should be included only once
in the file.

Options:
-q --quality-function {CPM|Modularity} (default: CPM)
    Quality function to be optimized. Either the CPM (constant Potts model) or
    the modularity quality function can be used.
-n --normalization {none|AssociationStrength|Fractionalization} (Default: none)
    Method for normalizing edge weights in the CPM quality function.
-r --resolution <resolution> (default: 1.0)
    Resolution parameter of the quality function.
-m --min-cluster-size <min. cluster size> (default: 1)
    Minimum number of nodes per cluster.
-a --algorithm {Leiden|Louvain} (default: Leiden)
    Algorithm for optimizing the quality function. Either the Leiden or the
    Louvain algorithm can be used.
-s --random-starts <random starts> (default: 1)
    Number of random starts of the algorithm.
-i --iterations <iterations> (default: 10)
    Number of iterations of the algorithm.
--randomness <randomness> (default: 0.01)
    Randomness parameter of the Leiden algorithm.
--seed <seed> (default: random)
    Seed of the random number generator.
-w --weighted-edges
    Indicates that the edge list file has a third column containing edge
    weights.
--sorted-edge-list
    Indicates that the edge list file is sorted. The file should be sorted based
    on the nodes in the first column, followed by the nodes in the second
    column. Each edge should be included in both directions in the file.
--input-clustering <filename> (default: singleton clustering)
    Read the initial clustering from the specified file. The file is expected to
    contain two tab-separated columns (without a header line), first a column of
    nodes and then a column of clusters. Nodes and clusters are both represented
    by zero-index integer numbers. If no file is specified, a singleton
    clustering (in which each node has its own cluster) is used as the initial
    clustering.
-o --output-clustering <filename> (default: standard output)
    Write the final clustering to the specified file. If no file is specified,
    the standard output is used.

To run the layout algorithm, the command-line tool RunNetworkLayout is provided. The tool can be run as follows:

java -cp networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkLayout

If no further arguments are provided, the following usage notice will be displayed:

RunNetworkLayout version 1.3.0
By Nees Jan van Eck and Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University

Usage: RunNetworkLayout [options] <filename>

Determine a layout for a network using the gradient descent VOS layout
algorithm.

The file in <filename> is expected to contain a tab-separated edge list
(without a header line). Nodes are represented by zero-index integer numbers.
Only undirected networks are supported. Each edge should be included only once
in the file.

Options:
-q --quality-function {VOS|LinLog} (default: VOS)
    Quality function to be optimized. Either the VOS (visualization of
    similarities) or the LinLog quality function can be used.
-n --normalization {none|AssociationStrength|Fractionalization} (Default: none)
    Method for normalizing edge weights in the VOS quality function.
-a --attraction <attraction> (Default: 2)
    Attraction parameter of the VOS quality function.
-r --repulsion <repulsion> (Default: 1)
    Repulsion parameter of the VOS quality function.
-s --random-starts <random starts> (default: 1)
    Number of random starts of the gradient descent algorithm.
-i --max-iterations <max. iterations> (default: 1000)
    Maximum number of iterations of the gradient descent algorithm.
--initial-step-size <initial step size> (default: 1.0)
    Initial step size of the gradient descent algorithm.
--min-step-size <min. step size> (default: 0.001)
    Minimum step size of the gradient descent algorithm.
--step-size-reduction <step size reduction> (default: 0.75)
    Step size reduction of the gradient descent algorithm.
--required-quality-value-improvements <required quality value improvements>
        (default: 5)
    Required number of quality value improvements of the gradient descent
    algorithm.
--seed <seed> (default: random)
    Seed of the random number generator.
-w --weighted-edges
    Indicates that the edge list file has a third column containing edge
    weights.
--sorted-edge-list
    Indicates that the edge list file is sorted. The file should be sorted based
    on the nodes in the first column, followed by the nodes in the second
    column. Each edge should be included in both directions in the file.
--input-layout <filename> (default: random layout)
    Read the initial layout from the specified file. The file is expected to
    contain three tab-separated columns (without a header line), first a column
    of nodes, then a column of x coordinates, and finally a column of
    y coordinates. Nodes are represented by zero-index integer numbers. If no
    file is specified, a random layout (in which each node is positioned at
    random coordinates) is used as the initial layout.
-o --output-layout <filename> (default: standard output)
    Write the final layout to the specified file. If no file is specified,
    the standard output is used.

Example

The following example illustrates the use of the RunNetworkClustering and RunNetworkLayout tools. Consider this network:

    0-----1
     \   /
      \ /
       2
       |
       3
      / \
     /   \
    4-----5

The network is encoded as an edge list that is saved in a text file containing two tab-separated columns:

0	1
1	2
2	0
2	3
3	5
5	4
4	3

Nodes must be represented by integer numbers starting from 0.

Assuming that the edge list has been saved in the file network.txt, the RunNetworkClustering tool can be run as follows:

java -cp networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering -r 0.2 -o clusters.txt network.txt

In this case, clusters are identified using the Leiden algorithm. The CPM (constant Potts model) quality function is used without normalizing edge weights. A value of 0.2 is used for the resolution parameter. The resulting clustering is saved in the text file clusters.txt that contains two tab-separated columns:

0	0
1	0
2	0
3	1
4	1
5	1

The file clusters.txt shows that two clusters have been identified. The first column in the file represents a node, and the second column represents the cluster to which the node belongs. Cluster 0 includes nodes 0, 1, and 2. Cluster 1 includes nodes 3, 4, and 5.

The RunNetworkLayout tool can be run as follows:

java -cp networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkLayout -o layout.txt network.txt

In this case, the default parameter values are used for the VOS layout technique. The resulting layout is saved in the text file layout.txt containing three tab-separated columns:

0	-0.8690519467788094	-0.04001496992603245
1	-0.8690620214452673	0.040038034108640194
2	-0.4603890908313338	-2.5793522310420543E-5
3	0.46031975105512185	-1.6403462331212636E-5
4	0.8690853506388282	0.04007029704233864
5	0.86909795736146	-0.04005116424030402

The first column in the file layout.txt represents a node, and the second and third column represent the x and y coordinates of the node.

In the above example, the edges in the file network.txt have not been sorted. To provide a sorted edge list as input, include the edges in both directions and use the option --sorted-edge-list. Furthermore, edge weights can be provided by adding a third column to the file network.txt and by using the option --weighted-edges.

License

The networkanalysis package is distributed under the MIT license.

Issues

If you encounter any issues, please report them using the issue tracker on GitHub.

Contribution

You are welcome to contribute to the development of the networkanalysis package. Please follow the typical GitHub workflow: Fork from this repository and make a pull request to submit your changes. Make sure that your pull request has a clear description and that your source code has been properly tested.

Development and deployment

The latest stable version of the source code is available in the master branch on GitHub. The most recent version of the source code, which may be under development, is available in the develop branch.

Compilation

To compile the source code of the networkanalysis package, a Java Development Kit needs to be installed on your system (version 8 or higher). Having Gradle installed is optional as the Gradle Wrapper is also included in this repository.

On Windows, the source code can be compiled as follows:

gradlew build

On Linux and MacOS, use the following command:

./gradlew build

The compiled class files will be output to the directory build/classes. The compiled jar file will be output to the directory build/libs. The compiled javadoc files will be output to the directory build/docs.

There are two main methods, one in the class nl.cwts.networkanalysis.run.RunNetworkClustering and one in the class nl.cwts.networkanalysis.run.RunNetworkLayout. After compiling the source code, the RunNetworkClustering tool can be run as follows:

java -cp build/libs/networkanalysis-<version>.jar nl.cwts.networkanalysis.run.RunNetworkClustering

The RunNetworkLayout tool can be run as follows:

java -cp build/libs/networkanalysis-<version>.jar nl.cwts.networkanalysis.run.RunNetworkLayout

References

Traag, V.A., Waltman, L., & Van Eck, N.J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z

Van Eck, N.J., Waltman, L., Dekker, R., & Van den Berg, J. (2010). A comparison of two techniques for bibliometric mapping: Multidimensional scaling and VOS. Journal of the American Society for Information Science and Technology, 61(12), 2405-2416. https://doi.org/10.1002/asi.21421

Waltman, L., Van Eck, N.J., & Noyons, E.C.M. (2010). A unified approach to mapping and clustering of bibliometric networks. Journal of Informetrics, 4(4), 629-635. https://doi.org/10.1016/j.joi.2010.07.002

Van Eck, N.J., & Waltman, L. (2009). How to normalize co-occurrence data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635-1651. https://doi.org/10.1002/asi.21075

networkanalysis's People

Contributors

ludowaltman avatar neesjanvaneck avatar vtraag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

networkanalysis's Issues

Seed doesn't affect the results

Tried to use seed option, but results every run are slightly different. Could you please help me?

all_knn <- RcppHNSW::hnsw_knn(expression_scaled, k = k, distance = 'l2',
                               n_threads = n_threads)
ind <- all_knn$idx

# Parallel Jaccard metric
links <- FastPG::rcpp_parallel_jce(ind)

links <- FastPG::dedup_links(links)
links[,1] <- links[,1] - 1
links[,2] <- links[,2] - 1

jar_path <- "networkanalysis-1.1.0.jar"
network_path <- paste0(tempdir(), "/network.txt")
clusters_path <- paste0(tempdir(), "/clusters.txt")
withr::with_options(c(scipen = 10), write.table(links, network_path, row.names = FALSE, col.names = FALSE, sep = "\t"))
system(paste("java  -Xmx30g -cp", jar_path, "nl.cwts.networkanalysis.run.RunNetworkClustering -q Modularity --seed 5024 --weighted-edges -o", clusters_path, network_path))

error message : duplicate values while creating network

Great tool !!!

I get the following error when I try to run the RunNetworkClustering.jar a large network .

** Error while creating network: For each node, corresponding elements of neighbors array must not include duplicate values.**

May I propose to make the error message more explicit ? Does it means that the program found at least one node (source) with same target multiple times or something else ?

Just for clarification : I double checked my input file. It does not have any duplicates (distinct) nor self-loops (source_node<> source_target).

setResolution does not set resolution for local moving algorithm

The resolution is not set for the localMovingAlgorithm for the LeidenAlgorithm and the LouvainAlgorithm. We should simply add

    public void setResolution(double resolution)
    {
        super.setResolution(resolution);
        this.localMovingAlgorithm.resolution = resolution;
    }

for both classes.

eighbors array must not include duplicate values.

Why it report ""Error while creating network: For each node, corresponding elements of neighbors array must not include duplicate values.""?I believe there is no duplicate values in neighbors array
image

Question about clustering results

Hi,

Thanks for this fantastic package!
I have an issue related to the output of the clustering algorithm.
I have runned the following code in python:

os.system("java -cp /Applications/networkanalysis/networkanalysis-1.3.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering"
                      " -n AssociationStrength -r 1 -m 20 --sorted-edge-list"
                      " -o " "clusters.txt"
                      " data_net.txt")

The output message announces 829 clusters and 760 clusters after removing clusters consisting of fewer than 20 nodes.
However when I open the file clusters.txt:

  • the number of clusters is 681 (the clusters number from 0 to 759 with holes).
  • many clusters contain less than 20 nodes.

Many thanks in advance for your explanation.

Self-loops are not properly considered when using modularity

We ignore all self-loops upon reading a particular network. Although this works fine for CPM (as self-loops have no effect there) this is not entirely correct for modularity. In particular, the problem is that the nodeWeights are calculated using getTotalEdgeWeightPerNodeHelper, but this is called only after we have removed the self-loops. In other words, the nodeWeights reflect the degree without the self-loops.

this.nodeWeights = (nodeWeights != null) ? nodeWeights.clone() : (setNodeWeightsToTotalEdgeWeights ? getTotalEdgeWeightPerNodeHelper() : nl.cwts.util.Arrays.createDoubleArrayOfOnes(nNodes));

In addition, we do consider self-loops when calculating the "proper" resolution parameter:

double resolution2 = useModularity ? (resolution / (2 * network.getTotalEdgeWeight() + network.getTotalEdgeWeightSelfLinks())) : resolution;

Both issues should be corrected in order to make modularity work correctly when self-loops are present in a network.

Can't get results

Hello! I deployed the jar file using the following command: java -cp networkanalysis-1.1.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering -r 1 -o out.txt input.txt

and I get this error: "Error while creating network: Elements of neighbors array must have non-negative values."

Could you please help out? Thank you!

Clustering with normalization methods

Hello Mr. Traag,

Thank you very much for give your implementation of the Leiden algorithm.

I'm using this package to cluster COVID-19 co-occurrence data.

Firstly, I used the jar archive to cluster my documents. I have observed that I obtain the same results than VOSViewer only with non-normalized edges (parameter: -w). I have also developed a unit test for that. However, with normalized data as input, I have the same number of clusters as the number of documents and poor quality. The input data and the VOS viewer output are accessible below.

In this version 1.0, is the jar only clusters data with non-normalized edges?

After that, I have analyzed the code. I have seen that the raw (non-normalized) edges are needed as input. Also methods are not used (eg. createNormalizedNetworkUsingAssociationStrength).

I plan to develop this functionality by following the steps below:

  1. Add a new parameter after -w with one of these values: {No normalization, Association strength, Fractionalization, Lin/log modularity} like VOSViewer
  2. Modify the RunNetworkClustering class and particularly the readEdgeList method
  3. Use the already prepared method (eg. createNormalizedNetworkUsingAssociationStrength) in the Network class

Can you please give me some advice on the implementation I plan to do?

Thank you very much.

Technical details:
RunNetworkClustering.jar version 1.0.0
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
Windows 10 Professional

Link to data:
https://drive.switch.ch/index.php/s/FxICtNgJO8B7s74

java.lang.ArrayIndexOutOfBoundsException when using -q Modularity

The following command
java -cp ~/Bioinformatics/networkanalysis/build/libs/networkanalysis-1.1.0-5-ga3f342d.jar nl.cwts.networkanalysis.run.RunNetworkClustering -q CPM -m 1 -w -o /tmp/edges_clustering.txt edges.txt
runs just fine. However, if the -q option is set to "Modularity", I get the following crash:

RunNetworkClustering version 1.1.0
By Vincent Traag, Ludo Waltman, and Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
Reading edge list from 'edges.txt'.
Reading edge list took 0s.
Network consists of 18386 nodes and 423926 edges with a total edge weight of 24546.50103795767.
Using singleton initial clustering.
Running Leiden algorithm.
Quality function: Modularity
Resolution parameter: 1.0
Minimum cluster size: 1
Number of random starts: 1
Number of iterations: 10
Randomness parameter: 0.01
Random number generator seed: random
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 5 out of bounds for length 5
at nl.cwts.networkanalysis.LocalMergingAlgorithm.findClustering(LocalMergingAlgorithm.java:190)
at nl.cwts.networkanalysis.LeidenAlgorithm.improveClusteringOneIteration(LeidenAlgorithm.java:228)
at nl.cwts.networkanalysis.LeidenAlgorithm.improveClusteringOneIteration(LeidenAlgorithm.java:276)
at nl.cwts.networkanalysis.IterativeCPMClusteringAlgorithm.improveClustering(IterativeCPMClusteringAlgorithm.java:91)
at nl.cwts.networkanalysis.run.RunNetworkClustering.main(RunNetworkClustering.java:413)

I've attached the input file here
edges.txt
I'm not sure what property of this input file would violate the preconditions of the software. The nodes are 0-indexed numbers, there are no duplicate edges, the lefthand node ID is always less than the righthand node ID, etc.

Warning: NetworkClustering fills the gap nodes from your input network as isolated clusters

Input edgelists ; output cluster information for each nodes.

I have isolates in the network and so the edgelists will not contain any information about them.
However the RunNetworkClustering.jar would assign cluster numbers sequentially for all nodes including those that do not exist in the input edgelists. That is, NetworkClustering will fill the gap nodes as isolated clusters from your input network.
I have attached my edgelist input (node ranges from 0 to 76 without 36) and clusters output here (contain cluster for node 36).

Networkcluster warning example-edgelist.txt
clusters.txt

Documentation Error: Output Formatting

FYI, It seems that the single-column output format specified in your README does not match the actual output format (2 tab-delimited columns: node, cluster).

Effect of edge weights in resulting communities

Hi,

Thank you for developing this nice library! I have been trying to understand how weights affect the resulting detected communities and I found something which I did not expect.

Using the example graph you provided in the readme file, I created two versions containing weights on the edges - the edge weights in one version are scaled by a constant relative to the other version:

v1:

0	1	1
1	2	1
2	0	1
2	3	0.5
3	5	1
5	4	1
4	3	1

v2:

0	1	4
1	2	4
2	0	4
2	3	2
3	5	4
5	4	4
4	3	4

I ran the Leiden algorithm using the following options
java -cp networkanalysis-1.1.0.jar nl.cwts.networkanalysis.run.RunNetworkClustering -r 0.2 -w

For v1, the same communities are detected as for the example in the readme file:

0	0
1	0
2	0
3	1
4	1
5	1

For v2, all nodes are part of one community:

0	0
1	0
2	0
3	0
4	0
5	0

I found this a bit surprising, as I didn't expect the scale of the weights to be so important and I expected to obtain the same results for these two networks.

I can see that in v1.1 of the library, two normalization methods have been introduced for the edge weights, but I couldn't find an explanation for why / when this might be needed and how the different normalization methods might impact the result.

Would it be possible to provide a brief explanation for this?

Thank you.

Best,
Paula Tataru

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.