The cluster_leiden() function can be heavily reliant on a random seed. If no seed is g

cluster_leiden() implicit seed dependancy about rigraph HOT 4 CLOSED

TimBMK commented on August 23, 2024

cluster_leiden() implicit seed dependancy

from rigraph.

Comments (4)

szhorvat commented on August 23, 2024

As this dependancy is not made explicit in the documentation, it can be confusing to the user and make later reproduction of results impossible. Setting a random seed manually solves the issue.

Be careful here. "Reproducibility" in this dogmatic sense is not useful, but dangerous.

Many of the heuristic community detection methods are stochastic in nature, which is a great benefit, as it allows you do check if your results are sane. If multiple runs of the algorithm do indeed give significantly different results, then can we say that there is a community structure in the network? No, we cannot.

This issue seemingly becomes more pronounced for larger and more complex networks.

This is not true. If you observe this, then your networks likely do not possess much of a community structure.

Unfortunately, there is a tremendous amount of misuse of community detection by empirical researchers, especially in biology. People obtain a single clustering with a single random seed, and go with that, then call it "reproducible" as the seed was fixed. Do not fall in this trap, this is not a reproducible result! First you need to check that the result is valid, for example:

Do you get similar clusters regardless of the seed? If not, we have no basis to claim a community structure.
Does the value of the objective function drop in a statistically significant way if you randomize the network? (You can randomize based on the same null model as the objective function was based on, e.g. keeping degrees with modularity or full rewiring with CPM.) If it does not, then we have no basis to claim a community structure.

The result should only be considered reproducible, if it can be reproduced without reliance on very specific settings such as a particular seed. It should only be considered valid if it passes basic sanity checks like the above two.

I recommend you read this: https://arxiv.org/abs/1608.00163

Thus a seed parameter will not be added, as it is not necessary. It would only reinforce flawed usage patterns that typically lead to dubious results.

Neither do we guarantee that igraph's stochastic community detection functions will continue to return the same result for the same seed across versions, even bugfix versions.

from rigraph.

TimBMK commented on August 23, 2024

Thank you for the clarification. And I agree, having vastly different results in repeated runs points to problems in the network structure or problematic assumptions about the algorithm. However, I do not think that making the function results non-reproducible by default will solve any of these issues. For example, multiple runs will almost always yield slightly different results, even when the general community structure is stable. The fluctuating numerical labels assigned to the communities can be problematic when different researchers intend to hand-label the communities independently. Also, the differences in group sizes can be confusing to users. And this is the case even in the reprex above which utilizes an algorithm for network generation specifically intended to produce community structures. At the very least, a clear pointer to the non-deterministic nature of the algorithm or a clearer distinction between stochastic and non-stochastic community detection algorithms in the documentation would be helpful. I believe that arguing that "users should know better", rather than pointing them at potential flaws in their assumptions about the tools they are using, will lead to even more dubious results.

from rigraph.

szhorvat commented on August 23, 2024

However, I do not think that making the function results non-reproducible by default will solve any of these issues.

You are using the word "reproducible", but what you propose has nothing to do with scientific reproducibility. It's no different from saying that "I get the same experimental results for as long as I always take the third test tube from the left on the shelf, but not otherwise." You wouldn't put that in a paper, right? A result can be called reproducible if it can be repeated robustly without any arcane dependence on your tools or uncontrollable circumstances.

Also, the differences in group sizes can be confusing to users.

You seem to want a stochastic algorithm to appear deterministic, even if it's not really (the correct term is deterministic here, not reproducible). This is what would be, what already is confusing to inexperienced people.

Do keep in mind that you can set a random seed whenever you have a valid technical reason for needing determinism. However, I strongly caution you against doing so when you are looking to obtain publishable scientific results. There's no way around the fact that correct use of these methods requires the kind of analysis I recommended in my answer above.

At the very least, a clear pointer to the non-deterministic nature of the algorithm or a clearer distinction between stochastic and non-stochastic community detection algorithms in the documentation would be helpful. I believe that arguing that "users should know better", rather than pointing them at potential flaws in their assumptions about the tools they are using, will lead to even more dubious results.

The igraph documentation cannot try to be a textbook on network analysis. There are many good textbooks and other written resources you can refer to. I linked to one above.

That said we are acutely aware of the fact that many people's first exposure to network analysis is through experimenting with software tools (often without even reading the documentation) rather than through reading a book or attending a lecture that explains the basics. Within limits, we do try to design the library to protect people from shooting themselves in the foot. In part, this is done by making sure that users won't miss potential surprises, such as the stochastic nature of these algorithms.

Notice that I did not argue that you should know better. Instead I took the time to explain how to do a robust analysis, and linked to further relevant resources.

We also welcome PRs that improve the documentation, which is a lot of work to write. igraph has four interfaces, each with its own documentation.

from rigraph.

TimBMK commented on August 23, 2024

Yes, and again, thank you for taking the time to explain the reasoning behind your design decisions. I 100% agree with you that a) multiple runs should be done to assure stability of the results, and b) that in order to publish results that utilize a certain algorithm, the author should familiarize themselves with the algorithm and its limitations. My point was less concerning the reproducability of scientific results, but more the reliability of the tool to produce identical results on multiple runs if needed. This concerns the reproduction of results already checked for reliability in order to reliably hand-label communities, reproduce results for peer-review, etc. My suggestion was to make the reliance on a random seed more explicit, and potentially offer a shortcut for a number of niche-cases where the exact reproduction of results is desired. However, I can understand your concern in terms of making the algorithm appear deterministic and, in doing so, deepening misconceptions of its nature. Thank you for taking the time for this discussion and the clarifications!

from rigraph.

cluster_leiden() implicit seed dependancy about rigraph HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent