vhranger / nodevectors Goto Github PK

Fastest network node embeddings in the west

License: MIT License

Python 81.43% C 18.57%

nodevectors's Issues

When saving large graph, creating a temporary folder will cause the system disk resources to be exhausted.

with tempfile.TemporaryDirectory() as temp_dir:
    joblib.dump(self, os.path.join(temp_dir, self.f_model), compress=True)
    with open(os.path.join(temp_dir, self.f_mdata), 'w') as f:
        json.dump(meta_data, f)
    filename = shutil.make_archive(filename, 'zip', temp_dir)

It should be written directly to the destination.

Old parameter shows up in Word2Vec call

nodevectors/nodevectors/node2vec.py

Line 132 in e98df00

size=self.n_components,

This line refers to the old size parameter in gensim Word2Vec. It looks like the parameter was renamed to vector_size ref.

Getting this error:

    129 # Train gensim word2vec model on random walks
--> 130 self.model = gensim.models.Word2Vec(
    131     sentences=self.walks,
    132     size=self.n_components,
    133     **self.w2vparams)
    134 if not self.keep_walks:
    135     del self.walks

TypeError: __init__() got an unexpected keyword argument 'size'

When running with gensim==4.3.2

Node2Vec IndexError: list index out of range

Hi,

After the fix of VHRanger/CSRGraph#3. I was successfully able to load my dataset in CSRGraph. But when I ran the following command, I get an error -
from nodevectors import Node2Vec
g2v = Node2Vec()
g2v.fit(G)

Error - ---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in
3 # way faster than other node2vec implementations
4 # Graph edge weights are handled automatically
----> 5 g2v.fit(G)

~/SageMaker/CSRGraph/nodevectors/nodevectors/node2vec.py in fit(self, nxGraph)
93 node_names = list(nxGraph)
94 G = cg.csrgraph(nxGraph, threads=self.threads)
---> 95 if type(node_names[0]) not in [int, str, np.int32, np.uint32,
96 np.int64, np.uint64]:
97 raise ValueError("Graph node names must be int or str!")

IndexError: list index out of range

Ids in my datafile are int64 datatype. Interestingly when I run the following command. I can execute successfully.
from nodevectors import GGVec
ggvec_model = GGVec()
embeddings = ggvec_model.fit_transform(G)

Node2Vec in a large graph

Hi,
Thanks for the clarification to solve the Issue number 27. Now that works fine after I upgrade the csrgraph to version 0.1.27. Now the next issue is that I got Segmentation fault while running node2vec. Is there any suggestion to fix this?

G = csr_matrix(G)
n2v_model = nodevectors.Node2Vec()
n2v_model.fit(G)

Segmentation fault

Do I need to update the nodevectors package as well after I update the csrgraph ? If so which version is needed?

Thanks in advance !

TypeError: 'NoneType' object is not subscriptable` Node2Vec

Hi,

Thanks for this great module. I have a large sparse csr graph of 10GB and I wanted to learn the node embedding using Node2Vec. However, I am keep getting this error:
TypeError: 'NoneType' object is not subscriptable

To reproduce this error in my machine here is my toy script:

from scipy.sparse import csr_matrix
import numpy as np
import nodevectors
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 1, 1, 1, 1, 1])
G = csr_matrix((data, (row, col)), shape=(3, 3))
n2v_model = nodevectors.Node2Vec()
n2v_model.fit(G)

Isn't it true that Node2Vec() module directly works with csr_matrix? I even tried the converting CSR matrix to CSRGraphs but stll get the same error. Any help would be great?

import csrgraph as cg
G = cg.csrgraph(G)
n2v_model = nodevectors.Node2Vec()
n2v_model.fit(G)

TypeError: 'NoneType' object is not subscriptable

TypeError: 'method' object is not iterable

I am getting this error when trying to run the unit tests.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-b3bd86b0b5de> in <module>()
      1 # Fit embedding model to graph
      2 g2v = Node2Vec()
----> 3 g2v.fit(G)

/home/dionysis/Documents/git_repos/graph2vec/graph2vec/graph.py in fit(self, nxGraph, verbose)
    356             Whether to print output while working
    357         """
--> 358         node_names = list(nxGraph.nodes)
    359         if type(node_names[0]) not in [int, str, np.int32, np.int64]:
    360             raise ValueError("Graph node names must be int or str!")

TypeError: 'method' object is not iterable

Do you think this line:
https://github.com/VHRanger/graph2vec/blob/8474f7ccf5d9b34d82fbf5ac16f04bcc37143cd6/graph2vec/graph.py#L358

Should change to this:

 node_names = list(nxGraph.nodes())

Problem with underlying Word2vec

Hello,

I just tryied to fit Node2Vec object and got error

129         # Train gensim word2vec model on random walks
130         self.model = gensim.models.Word2Vec(
131             sentences=self.walks,
132             size=self.n_components,

I found out some advise that curretly to use word2vec from gensim parametr must be named vector_size instead of size

https://stackoverflow.com/questions/53195906/getting-init-got-an-unexpected-keyword-argument-document-this-error-in

Setting value of seed to make Node2vec embedding repeatable.

Running nodevectors.Node2Vec.fit for the same nx_graph gives different embedding.

word2vec parameters changed

Hi!

In node2vec.py, you should modify the 'iter' parameter to 'epochs' and the 'size' parameter to 'vector_size'.

(And thank you for the library, I use it extensively in my research!)

defining random state or seed option parameters

I need the option to assign random state or seed values to get stable results. I don't think there is such an option.
Unfortunately, my attempts to fix the general seed that I have listed below did not solve the problem.

import random
random.seed(1)
from numpy.random import seed
seed(1)

What can be done about it? Do you have any advice?
thanks in advance

NetworkX 3.0 remove adj_matrix in version

remove adj_matrix from linalg/graphmatrix.py (#5753)

some problem on the accuracy

Hi,

I an testing the code with blogcatelog datasets(download from the OpenNe a repository of github) with your work.
Additionally, I have compare the multi-class result with the code in OpenNe.
In my test, if I use 10 percent of data as training data, the result of your work is
{'micro': 0.25313039723661485, 'macro': 0.12076017464146425};

In the same time, the code of OpenNe has achieve
{'micro': 0.2903713298791019, 'macro': 0.1674684546080052};

I am very confused about this, because I find both of you use the gensim. I simplely think the problem occur in the node samples.

I haven't do a deeper job right now but your code is really inspired me that it could used in a huge number of nodes. A good spark in Graph process.

Best regards,
Tade

fit_transform tries to query non-existent node "0"

from nodevectors import Node2Vec
import networkx as nx

G = nx.Graph()
G.add_edge("1", "2")
n2v = Node2Vec(n_components=128)
n2v.fit_transform(G)

Output:

Making walks... Done, T=3.98
Mapping Walk Names... Done, T=0.07
Training W2V... WARNING: gensim word2vec version is unoptimizedTry version 3.6 if on windows, versions 3.7 and 3.8 have had issues
Done, T=0.39
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-43e45de9791e> in <module>
      2 G.add_edge("1", "2")
      3 n2v = Node2Vec(n_components=128)
----> 4 n2v.fit_transform(G)

~/miniconda3/envs/graphs/lib/python3.7/site-packages/nodevectors/node2vec.py in fit_transform(self, G)
    151             pd.DataFrame.from_records(
    152             pd.Series(np.arange(len(G.nodes)))
--> 153               .apply(self.predict)
    154               .values)
    155         )

~/miniconda3/envs/graphs/lib/python3.7/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   4106             else:
   4107                 values = self.astype(object)._values
-> 4108                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4109 
   4110         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

~/miniconda3/envs/graphs/lib/python3.7/site-packages/nodevectors/node2vec.py in predict(self, node_name)
    166         if type(node_name) is not str:
    167             node_name = str(node_name)
--> 168         return self.model.wv.__getitem__(node_name)
    169 
    170     def save_vectors(self, out_file):

~/miniconda3/envs/graphs/lib/python3.7/site-packages/gensim/models/keyedvectors.py in __getitem__(self, entities)
    351         if isinstance(entities, string_types):
    352             # allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
--> 353             return self.get_vector(entities)
    354 
    355         return vstack([self.get_vector(entity) for entity in entities])

~/miniconda3/envs/graphs/lib/python3.7/site-packages/gensim/models/keyedvectors.py in get_vector(self, word)
    469 
    470     def get_vector(self, word):
--> 471         return self.word_vec(word)
    472 
    473     def words_closer_than(self, w1, w2):

~/miniconda3/envs/graphs/lib/python3.7/site-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    466             return result
    467         else:
--> 468             raise KeyError("word '%s' not in vocabulary" % word)
    469 
    470     def get_vector(self, word):

KeyError: "word '0' not in vocabulary"

Fitting and then predicting works fine:

n2v.fit(G)

for node in G:
    print(n2v.predict(node))

Output:

Making walks... Done, T=0.00
Mapping Walk Names... Done, T=0.06
Training W2V... WARNING: gensim word2vec version is unoptimizedTry version 3.6 if on windows, versions 3.7 and 3.8 have had issues
Done, T=0.38
[ 0.01669522  0.01119813 -0.00566072 -0.0134473   0.01121703  0.00379648
  0.01170088 -0.0121789  -0.01429367 -0.00849178  0.00943886 -0.00981773
  0.00337284 -0.0013884  -0.01287963 -0.00460479 -0.00217993 -0.01019352
  0.00615602 -0.00658679  0.01679845 -0.00747446  0.0019177  -0.00912566
 -0.01688758  0.00983168  0.00286994  0.00739604  0.01249113  0.00116864
  0.00235101 -0.01515406 -0.00786685 -0.01675885 -0.01421799 -0.00829282
 -0.00385966 -0.00779916 -0.00067812  0.01312324  0.0154448  -0.0107193
 -0.00059914 -0.00439935 -0.01970238 -0.00585162 -0.01741348 -0.00118494
 -0.01365886 -0.007099    0.00806013 -0.00448715 -0.00633816 -0.009869
  0.01835089  0.01462685  0.00408294  0.01042183  0.00773886  0.00500051
  0.00697436 -0.00052141 -0.00307364  0.00916708 -0.0059573  -0.00794462
  0.00316458 -0.01120937  0.00820292 -0.00175512 -0.00426679  0.00403081
  0.0036373  -0.00538955  0.00169757 -0.00476247  0.00011785 -0.00015604
 -0.02005355  0.00293106 -0.00457922  0.01199162 -0.01039407 -0.00975906
 -0.00386479  0.00380202  0.0150509   0.00117078  0.01009431 -0.01518334
 -0.01550014 -0.00316153 -0.01638743  0.00911983 -0.00656796 -0.01130522
  0.00696332  0.00222521 -0.01348531  0.01745371 -0.01043333  0.00377076
  0.00168364 -0.01029514 -0.01187336 -0.00047892  0.01747731  0.01539742
 -0.00317966  0.01036133  0.00348293  0.00357884  0.01691393 -0.01314759
 -0.00387712  0.01349622  0.00886216  0.01269572 -0.014981    0.01047694
 -0.01591979  0.00815849  0.0053769  -0.01705019  0.00478466 -0.00967307
  0.00100743 -0.00627678]
[ 1.74459908e-02  9.29250382e-03 -5.62654436e-03 -1.58256646e-02
  6.62352284e-03 -1.04596815e-03  7.46087125e-03 -1.52283600e-02
 -1.47760203e-02 -4.99586575e-03  8.37715156e-03 -1.14215305e-02
  8.03218782e-03 -4.57122130e-03 -1.37374401e-02 -6.70122309e-03
  5.60258329e-03 -1.36625227e-02  2.69854977e-03 -2.01221928e-03
  1.41100660e-02 -1.21530667e-02  7.38256099e-03 -7.29203923e-03
 -1.45003749e-02  8.89602769e-03 -1.07536477e-03  1.66074419e-03
  7.48369843e-03  8.18155764e-04  3.80413979e-03 -1.41491415e-02
 -1.12004904e-03 -1.57257933e-02 -1.23076690e-02 -9.28518735e-03
 -5.15399221e-03 -5.42826438e-03  9.19695070e-04  9.03129764e-03
  1.57911442e-02 -5.36569115e-03 -1.36574614e-03 -2.82609137e-03
 -1.89300030e-02 -5.67972986e-03 -1.65421404e-02 -3.22455773e-04
 -1.18535999e-02 -7.90045224e-03  9.72144585e-03 -7.91174080e-03
 -4.45207767e-03 -1.19799254e-02  1.93504207e-02  1.06750363e-02
  4.26934101e-03  1.17199738e-02  6.25003641e-03  1.98470801e-03
  4.88949660e-03  7.53012951e-04 -8.29974841e-03  6.85363356e-03
 -2.72968784e-03 -5.58869634e-03  1.48452440e-04 -8.40961654e-03
  3.35645187e-03 -3.52724968e-03  3.98239447e-03 -2.40911031e-03
  4.06429684e-03 -3.92150227e-03  6.94983220e-03 -8.35845713e-03
  9.88924527e-04 -1.79716619e-03 -1.90840866e-02  2.46768352e-03
 -4.37452644e-03  1.30511560e-02 -6.40019309e-03 -1.33609995e-02
  3.72520881e-04  5.42262476e-03  1.41993044e-02  7.35963322e-03
  1.08134123e-02 -1.49347940e-02 -1.22990599e-02 -9.69778374e-03
 -1.74602009e-02  8.74316972e-03 -5.31877764e-03 -7.91502465e-03
  3.98375420e-03  4.59250668e-03 -1.26426788e-02  1.60577614e-02
 -1.03733260e-02  4.70442930e-03  6.72380021e-03 -1.34339379e-02
 -1.50517235e-02  3.45687894e-03  1.50700649e-02  1.58219878e-02
  4.28991532e-03  9.33015719e-03  7.03065936e-03  3.41207208e-03
  1.49237625e-02 -1.07398266e-02 -1.00340396e-02  9.12039913e-03
  1.27081424e-02  1.08739929e-02 -1.16528282e-02  4.42440435e-03
 -1.53663196e-02  3.64650693e-03  5.37529076e-03 -1.76296048e-02
  3.67483153e-05 -7.88922701e-03 -5.40610822e-03 -1.80462585e-03]

Embedding a VERY LARGE graph, upcoming?

node2vec uses CBOW instead of skip-gram

Node2vec and DeepWalk original proposals are built upon the skip-gram model. By default, nodevectors does not set the parameter w2vparams["sg"] to 1, therefore the underlying Word2Vec model uses the default value of 0, which means using CBOW instead of skip-gram. This has major consequences in the quality of the embeddings.

Rename repo/package

Unfortunately, graph2vec has already been used in 2017 in a paper on representation learning for whole graphs (not nodes). Link: https://arxiv.org/abs/1707.05005, implementations at https://github.com/MLDroid/graph2vec_tf (author) and https://github.com/benedekrozemberczki/graph2vec (reimplementation)

Also, graph2vec has already been taken on PyPI by another project https://pypi.org/project/graph2vec, but I think you're aware of this.

I think the solution was to rename this to graph2vec-learn but I would encourage you pick a more informative name because this doesn't alleviate the original name conflict.

Either way, could you please update the name of this repo so the PyPI project matches the repo and folder inside the repo?

Error reading in CSR graph

I am trying to load a 150MB edgelist in csr graph using the command G = cg.read_edgelist("samplelist.edgelist", sep="\t")
But I get the following error:
ValueError Traceback (most recent call last)
in
2 import nodevectors
3
----> 4 G = cg.read_edgelist("samplelist.edgelist", sep="\t")

~/anaconda3/envs/python3/lib/python3.6/site-packages/csrgraph/graph.py in read_edgelist(f, sep, header, **readcsvkwargs)
457 SRC: {elist.src.max()}, {elist.src.min()}
458 DST: {elist.dst.max()}, {elist.dst.min()}
--> 459 """)
460 elist.src = elist.src.astype(np.uint32)
461 elist.dst = elist.dst.astype(np.uint32)

ValueError:
Invalid uint32 value in node IDs. Max/min :
SRC: 8278237827, 15830
DST: 8237827382738273827382, 2111364

Node2Vec Segmentation Fault

Hi,

Thanks for solving previous issue #19. However, now I am receiving segmentation fault error on running
from nodevectors import Node2Vec
g2v = Node2Vec()
g2v.fit(G)

Additionally, when I pip3 install CSRGraph and nodevectors, installation completes, but when I import them, I get No module found error.

Minor issues with the new release

There seems to be init.py missing in the evaluation folder which causes an error on import.

Additionally, umap is missing from the requirements.

Also, a small suggestion - when I ran into this issue today I tried installing the last version that worked for me (0.1.12), which is also broken since you don't specify package versions in your requirements. In this case your other package CSRGraph created a compatibility issue, so maybe you only need to specify the CSRGraph version since you're frequently updating it.

I really appreciate the work you've put into this package, when I was looking for a node2vec implementation many months ago yours was by far the cleanest and fastest. Thanks!

Load into W2V does not work

Awesome work! Unfortunately, when I load my bin file, I get the following error message:
ValueError: invalid vector on line 0 (is this really the text format?)

Any suggestions? There are spaces in the node names (e.g., 'Leonardo da Vinci').

How to get node's list

I have trained and saved the model with

    import csrgraph as cg
    import nodevectors
    G = cg.read_edgelist("edges.txt", directed=False, sep=' ')
    ggvec_model = nodevectors.GGVec()
    embeddings = ggvec_model.fit_transform(G)
    ggvec_model.save("embeddings.emb")

Now I want to load and iterate over the embeddings but I'm unable to find any method that returns the nodes list.

import nodevectors
ggvec_model = nodevectors.GGVec()
ggvec_model.load("embeddings.emb.zip")

Dealing with unseen nodes

Hi! Thanks for your library!

I'm using it to vectorize network graph - graph of IPs communicated with each other. What do you think might be an approach when dealing with new previously unseen IPs (nodes)?
It seems like there are no other options than retrain n2v model from scratch.

In my case skipping them is not an option, and I don't see how I can use tricks from NLP like using synonymous to the unseen word.

I would be grateful for any thoughts or suggestions.

Cheers,
Alex

Print training progression (node2vec)?

Hi, is there any way to monitor training progression? Even with verbose=True, nothing gets printed out after "Mapping Walk Names... Done" (and if the training can be expected to take several hours, it's a bit annoying to have no idea if anything is actually happening).

G.mat got an asymmetric sparse matrix

Hello! Thanks for the great great work!
I encountered an issue while using nodevectors to train the prone embeddings:
I ran
G = cg.read_edgelist("..", directed=True, sep=',')
g2v = ProNE()
g2v.fit(G)

and I got:

ValueError Traceback (most recent call last)
Input In [34], in <cell line: 2>()
1 g2v = ProNE()
----> 2 g2v.fit(G)

File ~/miniforge3/envs/alphaA/lib/python3.8/site-packages/nodevectors/prone.py:82, in ProNE.fit(self, graph)
78 G = cg.csrgraph(graph)
79 features_matrix = self.pre_factorization(G.mat,
80 self.n_components,
81 self.exponent)
---> 82 vectors = ProNE.chebyshev_gaussian(
83 G.mat, features_matrix, self.n_components,
84 step=self.step, mu=self.mu, theta=self.theta)
85 self.model = dict(zip(G.nodes(), vectors))

File ~/miniforge3/envs/alphaA/lib/python3.8/site-packages/nodevectors/prone.py:154, in ProNE.chebyshev_gaussian(G, a, n_components, step, mu, theta)
151 return a
152 print(G.shape)
--> 154 A = sparse.eye(nnodes) + G
155 DA = preprocessing.normalize(A, norm='l1')
156 # L is graph laplacian

File ~/miniforge3/envs/alphaA/lib/python3.8/site-packages/scipy/sparse/base.py:414, in spmatrix.add(self, other)
412 elif isspmatrix(other):
413 if other.shape != self.shape:
--> 414 raise ValueError("inconsistent shapes")
415 return self._add_sparse(other)
416 elif isdense(other):

ValueError: inconsistent shapes

I further check the error and it showed that the G.mat is an asymmetric sparse matrix with shape (830421x830420)
Could you please give me any clue on this?

How to increase number of components(features) in output vectors

Currently, n_components is set to 32 in all available algorithms like node2vec, GGVec etc. How can I increase to 128? I tried modifying the .py files of these algorithms to increase from 32 to 128. But it did not work. Once I set n_components=128 in .py files and imported package again, running algorithm still outputs vector that has 32 components.

ProNE option: "inconsistent shapes" error

I get an:

raise ValueError("inconsistent shapes")

from:
./nodevectors/prone.py line 61 in fit_transorm
./nodevectors/prone.py line 152, in chebyshev_gaussian
.../scipy/sparse/_base.py line 471, in add

the defaults work for small graph, ~ tens of thousands, but fail for 7M nodes and 50M edges graph

Issue with gensim 4.0.0+

It appears one of the argument names has changed in the newly released version of GenSim. This has also caused some pain in other libraries using this package for node2vec implementations (e.g., krishnanlab/PecanPy#16)

Traceback (most recent call last):
  File "embed_nodevectors.py", line 150, in <module>
    main()
  File "/Users/cthoyt/.virtualenvs/indra/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cthoyt/.virtualenvs/indra/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/cthoyt/.virtualenvs/indra/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cthoyt/.virtualenvs/indra/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "embed_nodevectors.py", line 137, in main
    model.fit(graph)
  File "/Users/cthoyt/.virtualenvs/indra/lib/python3.8/site-packages/nodevectors/node2vec.py", line 130, in fit
    self.model = gensim.models.Word2Vec(
TypeError: __init__() got an unexpected keyword argument 'size'

Node2Vec:About the return_weight and neighbor_weight

Dear author,
I read the source code of the Node2vec, and found that the default value of return_weight and neighbor_weight is equal to 1, Isn't that deepwalk？
However, if I change the value of the return_weight and neighbor_weight, then the speed will be very slow，I want to customize the embedding of BFS and DFS, how to keep it fast？

Bug when Train gensim word2vec model on random walks

import networkx as nx
from nodevectors import Node2Vec
# the edgelist file has 895608 lines
nx.read_weighted_edgelist('edgelist',create_using=nx.DiGraph)
g2v = Node2Vec(n_components=dimension,verbose=True)
g2v.fit(G)

Here is the error trace.

File "./lib/utils/twitter_data.py", line 410, in _learn_node2vec_nodevectors
g2v.fit(G)
File "./venv/lib64/python3.6/site-packages/nodevectors/node2vec.py", line 133, in fit
**self.w2vparams)
File "./venv/lib64/python3.6/site-packages/gensim/models/word2vec.py", line 591, in init
self.wv = Word2VecKeyedVectors(size)
File "./venv/lib64/python3.6/site-packages/gensim/models/keyedvectors.py", line 380, in init
super(WordEmbeddingsKeyedVectors, self).init(vector_size=vector_size)
File "./venv/lib64/python3.6/site-packages/gensim/models/keyedvectors.py", line 218, in init
self.vectors = zeros((0, vector_size), dtype=REAL)
TypeError: 'str' object cannot be interpreted as an integer

Jupyter notebook kernel dies while computing ggvec embeddings

I am loading about 7MM edges in a graph object using networkx and then running
import nodevectors
ggvec_model = nodevectors.GGVec()
embeddings = ggvec_model.fit_transform(G)

After running for a few minutes jupyter notebook kernel dies. Is there any way forward in this scenario ?

numba dependency is not pinned; latest version of numba breaks nodevectors

Recently numba removed jitclass from it's module __init__. This breaks the import of jitclass.
numba/numba@4976953

Nodevectors imports jitclass so the dependency needs to be pinned.

Pinning to 0.51.2 fixes the import of nodevectors.

w2vparams["batch_word"] default parameter cripples node2vec's performance

The Node2Vec class constructor sets the default value of w2vparams["batch_words"] to 128. The default value in gensim's lib is 10000. According to their docs:

batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

I don't know what exactly it does behind the scenes, but using the current default value of 128 severely affects the training performance.

Line of code:

nodevectors/nodevectors/node2vec.py

Line 28 in 5acc519

"batch_words":128}):

对于加权图是否适用？

如题..

I run the given short example... (partial success)

Hello Mr. Matt Ranger,

I installed the nodevectors package on my Mac OS Sierra, I verified to have all the required Python packages available with 'pip list' and then tried to run the given short example as a filename.py file. Here the CL trace:

% python networkx-test.py
Making walks... Done, T=2.94
Mapping Walk Names... Done, T=0.08
Training W2V... Done, T=0.85
Traceback (most recent call last):
File "networkx-test.py", line 19, in
g2v = Node2vec.load('node2vec.pckl') # it gets blocked at this point.
NameError: name 'Node2vec' is not defined

...any hint/feedback/re-testing would be appreciated.
Thank you, BR

Tuning model

Is there a way to pass parameters (e.g., epoch=100) in the command line?

For example:
g2v = Node2Vec()
g2v.fit(GCC, walklen=30, epochs=100 )

Thanks!

ProNE multithread

Just want to know if ProNE is multithreaded? Is there a way to control the number of threads like the implemented Node2Vec?

enable export walks

Very nice project. Here is a suggestion: Would be great to be able to call n2v.walks and get a list of all generated random walks after running the fit(). I think it should be an easy upgrade :)

incorrect parameter mentioned in Node2Vec docstring: "explore_weight"

Node2Vec accepts neighbor_weight parameter, however docstring mentions it as explore_weight parameter. Doc needs to be updated probably.

About painting

Hello，Could you share the Wikipedia 6M.png and 3d graph.png drawing code?

Why is generating walks so slow with non-default parameters?

I initially arrived at this code via your blog post https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/#note-3-692 - and indeed the speedup with default parameters (q=1,p=1) is impressive.

But as you also mention in the readme, much of that is lost when using non-default parameters. I have a network of 100k nodes and 1M edges, and the "default" walk generation takes 14 seconds, while trying different parameters takes well over 10 hours. Is there anything that can be done to improve speed for different values of p and q? much of the flexibility of Node2Vec comes form being able to capture local vs. global information by tuning the parameters, and even the Node2Vec paper shows that the best results are usually obtained with values for p and q that are different from 1.

Handling edge weights?

Hello,

First, thanks for a great package. The performance boost compared to other implementations is pretty incredible.

One thing I don't see is support for using edge weights in the input graph. Is there a way to do this now, or are there plans to add this functionality?

All the best,
Chad

Could not broadcast input array

Hi VHRanger,
Thanks for your great works. I am trying to run node2vec with around 4 million nodes and more than 48 million edges. But I got this issue. Can you give me some advice to deal with this big graph?

sys:1: DtypeWarning: Columns (1) have mixed types. Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "node2vec/graph_builder/csr.py", line 26, in <module>
    gr = CSRGraphNode2Vec()
  File "node2vec/graph_builder/csr.py", line 9, in __init__
    self.graph = cg.read_edgelist(file_path, directed=False, sep=',')
  File "/data/quocpbc/anaconda3/lib/python3.8/site-packages/csrgraph/graph.py", line 523, in read_edgelist
    G = methods._edgelist_to_graph(
  File "/data/quocpbc/anaconda3/lib/python3.8/site-packages/csrgraph/methods.py", line 31, in _edgelist_to_graph
    new_src[1:] = np.cumsum(np.bincount(src, minlength=nnodes))
ValueError: could not broadcast input array from shape (2147483649) into shape (4790294)

pypi

Hi,
Your package is great, but you should really put it on PyPi to make the installation easier.

Reading the edgelist using CSRGraphs.

Thanks for this great work.

I have a big graph of size 10 GB I use CSRGraphs to load the edgelist and compute the node embedding using node2vec. But, I got this problem while reading a graph. Here is the error I encountered for what I mean.

import csrgraph as cg
G = cg.read_edgelist("karate.txt",sep = "\t")
TypeError: sort_values() got an unexpected keyword argument 'ignore_index'

Any suggestion to fix this.
Thanks in advance.

Suggestion: support corpus_file parameter

Hi.

It would be great if nodevectors could support the word2vec's corpus_file parameter that allows for file-based fast training.

What do the devs think about that?

is it possible to split n2v to generate walks only?

Hi! ,

I am using node2vec to generate walks on graphs which i then pass to a different gensim modified by another tool (ths is for alignment of temporal models) -

Given the speed of carrying out walks with nodevectors - is it possible to separate the walks from the .fit method (as in have an option to ONLY carry out the walks without fitting the model so that i can just then save the walks to take on to the next tool?

thanks!

Has node2vec implementation been updated to use skip-gram as default?

Hi!,

Related to #40

I was wondering if node2vec now uses skip-gram by default (I cannot see it anywhere in the source code, but i am sure i am missing it!!)

If it hasn't, does the following line of code automatically set sg=1 if i add this?

n2v = Node2Vec(n_components=32, walklen=80, epochs=100, keep_walks=True, w2vparams={'sg':1}) 
n2v.fit(nx_graph)

I want to be sure this is correct, as when i set {'sg': 50} (just a very silly example to invoke an error), no error is thrown - and so I wonder if w2vparams={'sg':1} is actually selecting skip-gram instead of CBOW or if I am doing something incorrectly. Any advice (or the right way to do it) is appreciated :)

Secondly: instead of saving embeddings and then loading them as keyedvectors with word2vec - is there a way of converting the fitted object (n2v above) directly to a Word2Vec gensim object?

Thank you!

Continue fitting process

Hi,
Can I update node embeddings given and already trained model? I want to fit a model but then I want to update the network periodically and update the node embedding and not start from zero.

vhranger / nodevectors Goto Github PK

nodevectors's Issues

and I got:

Recommend Projects

Recommend Topics

Recommend Org