rafguns / linkpred Goto Github PK

View Code? Open in Web Editor NEW

140.0 4.0 47.0 299 KB

Easy link prediction tool

License: Other

Python 100.00%

python network-analysis link-prediction research maintenance-mode

linkpred's People

Contributors

Stargazers

Watchers

Forkers

royshan wangdongfrank helloworld163 tythonlee niutyut minghao2016 hkharryking akastrin williamqzc rsantana-isg hauldhut fandongwan balabalaserena songfgh fajri91 archsaber majiga haishiwo ploblomteam zhuloucao sanrenyimu yihe-zhu nubiofs yannan1212 wuyang556 little-peng 30lm32 makarandnadendla govindjsk subpath rubinyao zzwloveai cirdans-home strunge29 wct1998 weixionglin drugintelligence nilsdenter hyphinslutionter recuban16 rxa4 livia-noi mr852 rune159 gowri94 beytullahaksoy kahei-cjp

linkpred's Issues

how to explain the input file format

Hi,I am learning the project's code. But I don't know the labels in the input file means?
Some labels like this:
1 "Pereira, JCR"
2 "Peters, HPF"
3 "Widhalm, C"
4 "Verbeek, A"
5 "Salvador, P"

UndefinedError: Measure is undefined if there are no relevant or retrieved items

I am getting the undefined error when plotting ROC Curve of an evaluation using the following code:

n = len(test)
num_universe = n * (n - 1) // 2

test_set = set(linkpred.evaluation.Pair(u, v) for u, v in test.edges())
evaluation = linkpred.evaluation.EvaluationSheet(cn_results, test_set, num_universe)
plt.plot(evaluation.fallout(), evaluation.recall())
plt.show()

Port test suite to pytest

Nose has been unmaintained for several years. It makes sense to move the test suite to pytest.

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

error when running linkpred from terminal

Hi Rafguns,

I get an error when trying to run linkpred from the mac terminal. The errors says: object of type generator has no len().

the complete error is this:

Stefans-MacBook-Air:lib stefan$ linkpred /Users/stefan/linkprediction/network1.graphml -p SimRank --output recall-precision
14:50:06 - INFO - Reading file '/Users/stefan/linkprediction/network1.graphml'...
14:50:06 - INFO - Successfully read file.
14:50:06 - INFO - Starting preprocessing...
Traceback (most recent call last):
File "/Users/stefan/anaconda3/bin/linkpred", line 64, in
main()
File "/Users/stefan/anaconda3/bin/linkpred", line 57, in main
linkpred.preprocess()
File "/Users/stefan/anaconda3/lib/python3.7/site-packages/linkpred/linkpred.py", line 169, in preprocess
self.training = preprocessed(self.training)
File "/Users/stefan/anaconda3/lib/python3.7/site-packages/linkpred/linkpred.py", line 163, in
without_selfloops(G), minimum=self.config['min_degree'])
File "/Users/stefan/anaconda3/lib/python3.7/site-packages/linkpred/preprocess.py", line 85, in without_selfloops
"Removing...".format(len(loops)))
TypeError: object of type 'generator' has no len()
Stefans-MacBook-Air:lib stefan$

Do you perhaps know what i am doing wrong here?

Kind regards,

Stefan Bloemheuvel

Simplify evaluation to plain functions

sklearn.metrics is in a way much simpler, using plain fuctions. Can we do something analogous or even depend on scikit-learn for stuff like ROC, recall-precision etc.?

AssertionError with self-loops from Python but not from command line

The code given in the README is issuing a AssertionError: Predicted link (981, 981) is a self-loop!, but the terminal command with the exact same training file is doing the job.

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
 in 
      7 
      8 simrank = linkpred.predictors.SimRank(training, excluded=training.edges())
----> 9 simrank_results = simrank.predict(c=0.5)
     10 
     11 evaluation = linkpred.evaluation.EvaluationSheet(simrank_results, test.edges())

~/anaconda3/envs/social-media/lib/python3.7/site-packages/linkpred/predictors/base.py in predict_and_postprocess(*args, **kwargs)
     65                 for u, v in self.excluded:
     66                     try:
---> 67                         del scoresheet[(u, v)]
     68                     except KeyError:
     69                         pass

~/anaconda3/envs/social-media/lib/python3.7/site-packages/linkpred/evaluation/scoresheet.py in __delitem__(self, key)
    193 
    194     def __delitem__(self, key):
--> 195         return dict.__delitem__(self, Pair(key))
    196 
    197     def process_data(self, data, weight='weight'):

~/anaconda3/envs/social-media/lib/python3.7/site-packages/linkpred/evaluation/scoresheet.py in __init__(self, *args)
    125                 "__init__() takes 1 or 2 arguments in addition to self")
    126         # For link prediction, a and b are two different nodes
--> 127         assert a != b, "Predicted link (%s, %s) is a self-loop!" % (a, b)
    128         self.elements = self._sorted_tuple((a, b))
    129 

AssertionError: Predicted link (381, 381) is a self-loop!
>>>

Update the networkx version

the networkx version is too old

Allow specifying one's own file name for evaluation

At the moment, the listeners in linkpred.evaluation.listeners use fixed file names (well, they change, depending on dataset, predictor and time stamp, but it's not really possible to specify your own name). That sucks.

My original thought was that base class Listener would just accept an extra argument in its ctor, which could then be used by all descendant classes. The problematic cases, however, are CachePredictionListener and CacheEvaluationListener, since they can actually generate multiple files (e.g., one per predictor). Possibilities:

change them to deliver all their output to one file
figure out how we'd like to handle them

Typo while importing a module in Community class

The file misc.py located in linkpred/predictors/ contains a typo while importing generate_dendrogram from the community package.

Line 24 reads from community import generate_dendogram, partition_at_level but there's a missing "r" in dendogram.

Obs: I have installed linkpred using pip install linkpred

example data meaning

I want to know what's the meaning of the third column?

Add Scoresheet.from_file() and Scoresheet.to_file() methods

It should be possible to save Scoresheets to a CSV-like format and easily create them as well. This would also allow us to replace the code in CachePredictionListener.on_prediction_finished with

def on_prediction_finished(self, scoresheet, dataset, predictor):
    self.fname = _timestamped_filename("%s-%s-predictions" % (dataset, predictor))
    scoresheet.to_file(self.fname)

Added bonus: subclasses of Scoresheet could change the serialization; it would no longer be presupposed that scoresheet keys are tuples.

EvaluationSheet : fully understand the universe terme

I don't think I understand the "universe" term that is used as params, or how do I choose it in linkpred/evaluation/static/StaticEvaluation() also in EvaluationSheet() , you stated that this param is important to return the accuracy

Also, how do i get the confusion matrix, recall, precision and accuracy?

Concerning the accuracy do I pick the max value, like this : evaluation.accuracy().max() or is this wrong
or should i do this : acc = (sum(evaluation.tp + evaluation.tn))/(sum(evaluation.tp + evaluation.tn + evaluation.fp + evaluation.fn)) (also i imported 'division from future')

I want to use sklearn but what's confusiing me is how do I retrieve the y_true and y_pred from a graph sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)
how do I get these data from the graph to use them in other Machine learning algorithms such as SVM

this is my full code :


`import linkpred
import random
from matplotlib import pyplot as plt

random.seed(100)

# Read network
G = linkpred.read_network('BUP_full.net')

# Create test network
test = G.subgraph(random.sample(G.nodes(), 33))

# Exclude test network from learning phase
training = G.copy()
training.remove_edges_from(test.edges())

simrank = linkpred.predictors.SimRank(training, excluded=training.edges())
simrank_results = simrank.predict(c=0.5)

test_set = set(linkpred.evaluation.Pair(u, v) for u, v in test.edges())
evaluation = linkpred.evaluation.EvaluationSheet(simrank_results, test_set, simrank_results)

plt.plot(evaluation.recall(), evaluation.precision())`

Thank you

Clarify `excluded` parameter to `Predictor.init`

#12 showed that the intended use of excluded can be misunderstood. As I write there:

Note that the excluded argument to a predictor (SimRank in this case) is intended to exclude certain edges from appearing in the results, not to exclude them during training.

So I think two things need to happen:

rename this to be clearer
document what it's for

I plot ROC curve successfully, but the range of the X axis isn't 0 to 1

Add CommonNeighbors in addition to CommonNeighbours

I think this would simply need a CommonNeighbors = CommonNeighbours somewhere.

What link prediction method is used by the sample code in the linkpred folder

SimRank fails under networkx 2.0

See #12 (comment)

SimRank expects that G.nodes()returns a list, whereas it actually returns a NodeView now.

Check SimRank implementation

Our current implementation is based on equations and an algorithm in Antonellis et al. (2008) (see Appendix A and Algorithm 1). While it yields fairly good results, it seems that this is not equal to SimRank as defined by Jeh and Widom (2002). Unit tests disagree with results obtained therein.

test_katz fails in CI

On both 3.8 and 3.11. Here's the output for 3.11:

 =================================== FAILURES ===================================
  __________________________________ test_katz ___________________________________
  
      def test_katz():
          G = nx.Graph()
          G.add_weighted_edges_from(
              [(1, 2, 1), (0, 2, 5), (2, 3, 1), (0, 4, 2), (1, 4, 1), (3, 5, 1), (4, 5, 3)]
          )
      
          beta = 0.01
          I = np.identity(6)
          for weight in ("weight", None):
              katz = Katz(G).predict(beta=beta, weight=weight)
      
              nodes = list(G.nodes())
              M = nx.to_numpy_array(G, nodelist=nodes, weight=weight)
              K = np.linalg.matrix_power(I - beta * M, -1) - I
      
              x, y = np.asarray(K).nonzero()
              for i, j in zip(x, y):
                  if i == j:
                      continue
                  u, v = nodes[i], nodes[j]
  >               assert K[i, j] == pytest.approx(katz[(u, v)], abs=1e-5)
  E               assert 0.010038160831933126 == 0.010101010100000004 ± 1.0e-05
  E                 comparison failed
  E                 Obtained: 0.010038160831933126
  E                 Expected: 0.010101010100000004 ± 1.0e-05
  
  tests/test_predictors_path.py:28: AssertionError
  ----------------------------- Captured stdout call -----------------------------
  Computing matrix powers: [............................................................] 0/5
  Computing matrix powers: [############................................................] 1/5
  Computing matrix powers: [########################....................................] 2/5
  Computing matrix powers: [####################################........................] 3/5
  Computing matrix powers: [################################################............] 4/5
  Computing matrix powers: [############################################################] 5/5

Weirdly enough, this test passes on my computer. Perhaps this is due to a differnece in version of some package, like numpy?

Complete example

I would like to use your link prediction library, but I am missing the complete example, including evaluation code. For now I have the following code:

import linkpred
import random

# Read network
G = linkpred.read_network('linkpred-master/examples/inf1990-2004.net')

# Create test network
test = G.subgraph(random.sample(G.nodes(), 300))

# Exclude test network from learning phase
simrank = linkpred.predictors.SimRank(G, excluded=test.edges())
simrank_results = simrank.predict(c=0.5)

Could you please provide full example, i.e., how to calculate precision, recall, ROC curve, etc.

linkpred is in maintenance mode

I have decided to put linkpred in maintenance mode. Here's some information on what that means as well as the reasons behind this decision.

What can you expect? I will fix critical bugs and may also fix minor bugs if feasible. Furthermore, I will also try to keep the package usable in a modern Python installation, i.e. make changes to keep it working under newer Python versions and recent versions of numpy, networkx etc. The general idea is that if you have been using linkpred before, you shouldn't be required to keep an ancient installation lying around just to be able to use it.

What not to expect? Most likely, I will not implement any new features. Neither will I make big architectural changes to how linkpred works.

Why? Several reasons:

De facto linkpred has been in maintenance mode for a few years at least. It's time to make this explicit, so users can make an informed decision on whether or not to use this software.
I haven't done any link prediction work for several years. Linkpred started as a way of 'scratching my own itch', but that doesn't apply anymore.
The previous point also entails that I'm not up-to-date on current evolutions in link prediction research. So even if I wanted to, say, implement newer link prediction algorithms, I wouldn't be able to without significant effort.
Some fairly big architectural changes are needed to get the package to a state where new feature development would really pay off (especially on the evaluation side, see #3, #10). I lack the time and energy to put in all that work for something I don't use myself.

ROC Plot

I followed the example provided #12 using an edgelist as input and common neighbours as predictor but the roc plot is empty. Maybe i dont pass the correct arguments to ROCPlotter but i couldn't find any example.My code is this


import linkpred
import random
from matplotlib import pyplot as plt
import math

random.seed(100)

# Read network
G = linkpred.read_network('FollowGraph.edgelist')

testSize = math.ceil(len(G.nodes())*0.2)

# Create test network
test = G.subgraph(random.sample(G.nodes(), testSize))

# Exclude test network from learning phase
training = G.copy()
training.remove_edges_from(test.edges())

cn = linkpred.predictors.AdamicAdar(training, excluded=training.edges())
cn_results = cn.predict()

test_set = set(linkpred.evaluation.Pair(u, v) for u, v in test.edges())
evaluation = linkpred.evaluation.EvaluationSheet(cn_results, test_set)

linkpred.evaluation.listeners.ROCPlotter(evaluation)

plt.show()

Any suggestion?
Thank you in advance

README should have full (standalone) code example

We should give a full standalone code example in the README, including evaluation.

Spinoff from #12. The example I gave there is this (slightly reworked to take advantage of a7121f8):

import linkpred
import random
from matplotlib import pyplot as plt

random.seed(100)

# Read network
G = linkpred.read_network('examples/inf1990-2004.net')

# Create test network
test = G.subgraph(random.sample(G.nodes(), 300))

# Exclude test network from learning phase
training = G.copy()
training.remove_edges_from(test.edges())

simrank = linkpred.predictors.SimRank(training, excluded=training.edges())
simrank_results = simrank.predict(c=0.5)

evaluation = linkpred.evaluation.EvaluationSheet(simrank_results, test.edges())

plt.plot(evaluation.recall(), evaluation.precision())

Implement Dice predictor

Dice is proportional to Jaccard, but it might be nice having them both as a convenience.

Make linkpred easier to use from within Python

The prediction part is fairly straightforward, howver the evaluation part is terribly convoluted. Much of the heavy lifting is done by the LinkPred object. I can think of two steps:

Make LinkPred easier to use. For instance:
- It should be possible to just call LinkPred() without any arguments.
- It supposes that the training and test networks still have to be read from a file, which is unlikely if used within Python
- Preprocessing is hard to control exactly.
Simplify the whole evaluation part. See #3.

How to do link prediction on a large graph without 'MemoryError'

Hi,
I used this library and what i wanna really do is to load a large graph and use simrank for link prediction, but i get the following error:

Traceback (most recent call last):
  File "pred.py", line 12, in <module>
    results = model.predict(c=0.4)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/predictors/base.py", line 64, in predict_and_postprocess
    scoresheet = func(*args, **kwargs)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/predictors/eigenvector.py", line 88, in predict
    sim = simrank(self.G, nodelist, c, num_iterations, weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/network/algorithms.py", line 73, in simrank
    M = raw_google_matrix(G, nodelist=nodelist, weight=weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/network/algorithms.py", line 87, in raw_google_matrix
    weight=weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/networkx/convert_matrix.py", line 369, in to_numpy_matrix
    M = np.zeros((nlen,nlen), dtype=dtype, order=order) + np.nan
MemoryError

could you help me?

Error when clling linkpred in linux

I am trying to use it but when I call, it shows the following error:
/usr/bin/env: ‘python\r’: No such file or directory

How to use linkpred for evaluation

HI,I am a newcomer in the field of link prediction. I want to know how to use the tool ’linkpred‘ to evaluate related indicators. I found that there are a lot of evaluation functions built in, I call these functions but I don't get a value, but a series values,.Such as precision, will return a list of precisions. I don't know what this means? I found out in the question that you can use the sklearn package for evaluation. I don't know how to do it. Finally, is this tool related to the documentation or manual?

Error when nodes name are string type and int type

def _sorted_tuple(t):
    a, b = t
    return (a, b) if a > b else (b, a)

TypeError: '>' not supported between instances of 'str' and 'int'

Get rid of linkpred.external.dispatch

It probably doesn't work in Python 3 and is untested. Smokesignal (https://github.com/shaunduncan/smokesignal) looks like it's a modern, maintained and simple signaling package.

Depending on smokesignal might even help with issue #3?

Update test infrastructure

Move CI to Github actions
Test with newer Python versions
Test with newer package versions

Change default dtype in Katz.predict to np.float64

It seems that the current default can lead to erroneous results, like negative weights.

Enable testing with tox

Something like waf may do the trick. I'm thinking of e.g.:

# Create a tarball
git archive --format=tar.gz --prefix=linkpred/ HEAD > linkpred.tar.gz

# Create an installer
pyinstaller linkpred.spec

# Do all tests and count coverage
nosetests --with-doctest --with-coverage --cover-package=linkpred

ZeroDivisionError

linkpred/linkpred/predictors/neighbour.py

Line 28 in ae7adc8

class AdamicAdar(Predictor):

in AdamicAdar i got ZeroDivisionError : floar division by zero

Any idea on how to fix this ? Thank you

Test Community predictor

We should add tests for the Community predictor. This is easier now that python-louvain is pip-installable (URL should be updated in the docstring as well).