monarch-initiative / embiggen Goto Github PK

View Code? Open in Web Editor NEW

38.0 15.0 12.0 969.22 MB

🍇 Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

graph graph-representation-learning machine-learning

embiggen's Introduction

🍇 Embiggen

Embiggen is the graph machine learning submodule of the 🍇 GRAPE library.

How to install Embiggen

To install the complete GRAPE library, do run:

pip install grape

Instead, to exclusively install the Embiggen package, you can run:

pip install embiggen

Cite GRAPE

Please cite the following paper if it was useful for your research:

@misc{cappelletti2021grape,
  title={GRAPE: fast and scalable Graph Processing and Embedding}, 
  author={Luca Cappelletti and Tommaso Fontana and Elena Casiraghi and Vida Ravanmehr and Tiffany J. Callahan and Marcin P. Joachimiak and Christopher J. Mungall and Peter N. Robinson and Justin Reese and Giorgio Valentini},
  year={2021},
  eprint={2110.06196},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

embiggen's People

Contributors

Stargazers

Watchers

Forkers

callahantiff deepakunni3 justaddcoffee leopompidou realmarcin caufieldjh smartniz cthoyt lukematic remylau vishalbelsare bjoernbuth

embiggen's Issues

AttributeError, link prediction failed on Sumner

The code ran for 1 day and 3 hours, but failed due to the error:

Traceback (most recent call last):
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/runLinkPrediction_ppi.py", line 53, in
model.train(display_step=2)
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/xn2v/word2vec.py", line 421, in train
self.run_optimization(batch_x, batch_y)
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/xn2v/word2vec.py", line 399, in run_optimization
gradients = g.gradient(loss, [self.embedding, self.nce_weights, self.nce_biases])
File "/home/ravanv/.local/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 980, in gradient
unconnected_gradients=unconnected_gradients)
File "/home/ravanv/.local/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 76, in imperative_grad
compat.as_str(unconnected_gradients.value))
AttributeError: 'RefVariable' object has no attribute '_id'

error in test_version.py

Does anyone know what tests/test_version.py tests? Do we need it? I get an error when I run it.

ERROR: Failure: ModuleNotFoundError (No module named 'validate_version_code')

Thank you.

[item for sublist in sentences for item in sublist] not working

flat_list_of_words = [item for sublist in sentences for item in sublist]

this seems to be turning texts into plain letters and unittests are failing

Unable to run test suite

Hello,

I am Luca, a PhD student from the AnacletoLab (University of Milan). I have started to take a look at the code, and I was beginning from running the present test suite.

However, it fails since it tries to import the class Graph from hetnode2vec at line 6 of the file test_node2vec.py

from hn2v.hetnode2vec import Graph

I am guessing that the class was previously called Graph and then afterwards changed the name to N2vGraph, and the test was left un-updated. I have therefore tried to replace the class but the number of parameters that the two classes accept is different.

To avoid the test suite breaking in the future I would propose using Travis-CI, which is free to use for private repositories for students and professors for research purposes. I have experience in setting up a test pipeline with Travis-CI, if required.

Need file for testing read_databz2 of class TextEncoder

I need a bz2 file to test the read_databz2 method of the class TextEncoder .

Thank you!

model_to_dot

We should explore the model_to_dot facilities of keras for generating decent-looking graphics for publication, e.g.,

from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(model, show_shapes=True, show_layer_names=False, 
                 rankdir='TB').create(prog='dot', format='svg'))

Cannot execute parse: missing test file

Hello everyone,

May I ask where I can find the file small_g2d_test.txt?

I am trying to extend the test suite and without it, I cannot run the n2vParser class method's parse.

rationale for the max connected component and effects on embedding and prediction tasks

@pnrobinson @vidarmehr we are curious about the rationale for selecting the max component of the graph as input for the random walk. The bioNEV paper does not seem to do that. I believe that picking the max component dropped about 5k genes, so a bit. Presumably this will be dropping the less connected portions of the graph, but those are also harder for link prediction because more sparse. They may also be an effect on the negative sampling of edges, because some edges will have been dropped with < max components.

Along these lines, we are thinking of metrics to apply for the various steps in the pipeline so we can better track the data flow and effects.

Check number of edges of the Karate training graph

There are 75 lines in karate.train. So, I think the total number of edges (using CSFgraph) should be 150. But, it is 148.

Parallel code not working

This is the new code

for [orig_node, alias_node] in pool.map(self._get_alias_node, g.nodes()):
     alias_nodes[orig_node] = alias_node
     dateTimeObj = datetime.now()
     print(dateTimeObj)
      print("Processed %d nodes" % len(alias_node))
pool.close() # added just now but does not help

However, when using a 'realistic graph with 20,000 nodes and 400,000 edges, the code gets stuck here. It outputs the following :

(...19,996 other times)
2020-01-17 14:19:20.817591
Processed 2 nodes
2020-01-17 14:19:20.817595
Processed 2 nodes
2020-01-17 14:19:20.817600
Processed 2 nodes

htop shows that 8 processors are working full steam. However, python appears to create 20,000 processes, which surely is not the best way of doing things. I am not sure when this will be finished but I let this go for 3 hours yesterday and did not get past this point.

click error messages are not helpful

running runAnalysis.py -t gives the error message that the option -t is not recognized but should say that a command (e.g. disease-gene-embedding) is required)

Also can be delete the click branch?

trying a more fair training/testing split like 80/20

@vidarmehr would it be possible for you to try a 80/20 training/testing split for your link prediction runs? This will help avoid overfitting and please reviewers, and help us compare to the BioNEV paper.

We missed you on the call today -- some of the discussion is pasted in gitter so you can take a look.

TestSkipGramWord2Vec

This currently does not actually test anything. It is hard to test the actual production of embeddded vectors, but we should move the production of the vectors to a setup method and then test a few things such as the number of walks etc.

Publish embiggen on PyPI

Luca has plenty of experience with this, but for reference I located some relevant tutorials.

How to publish on PyPI:
How to Publish an Open-Source Python Package to PyPI
very thorough step-by-step guide

How to write your own Python Package and publish it on PyPi

How to organize a Python package:
Python Application Layouts: A Reference

Python Packaging User Guide

How to handle logging:
logging — Logging facility for Python
part of standard library, doc includes links to tutorials

test_raw_probs

CHeck this unit test -- we commented it out.,

Move to Monarch GitHub

We cannot use CI on the JAX github and thus we will need to move this repo. I will do so later on today -- please shout if there are any issues

add get batch from list of lists for CBOW

Consider adding this to superclass

extend or append

should this be extend or append?? -- next_batch (SG)

        batch = np.append(batch, current_batch)
        labels = np.append(labels, current_labels, axis=0)

Create link prediction scoring funcs utility file

Creating this issue as part of the conversation in #105.

TODO - Add Link Prediction Scoring functions to Utility Script

Common Neighbors
Jaccard’s Coefficient
Adamic-Adar
Preferential Attachment
See here for additional scoring functions worth considering, including:
- Degree Product
- Sorenson Similarity
- Leicht-Holme-Newman Similarity
- Shortest Path
- Resource Allocation
- Katz
- SimRank
- Rooted Page Rank

@vidarmehr - Does that cover everything?

remove/move test file

tests/test_n2v.py

this file is causing a travis error with the coverage because url.request is not working. The file is testing import functions that we want to move out of this project anyway. I would suggest we move the test elsewhere @vidarmehr

New branch?

@pnrobinson

Hi Peter,
Did you create a new branch called develop? I don't see it. Could you please check in the new branch?
Thank you.

How to format input data for GloVe

Hi everybody,
I am adapting a keras version of GloVe to use the tf keras (only small changes needed).
I am wondering if we want to have a standard way of passing the data to the models, and if we
are planning on changing anything in the present constructors for word2vec.
This is an excerpt of what the function looks like

v_size = 3000
tokenizer = Tokenizer(num_words=v_size, oov_token='UNK')
tokenizer.fit_on_texts(docs)

def generate_cooc_matrix(text, tokenizer, window_size, n_vocab, use_weighting=True):
    sequences = tokenizer.texts_to_sequences(text)

    cooc_mat = lil_matrix((n_vocab, n_vocab), dtype=np.float32)
    for sequence in sequences:
        for i, wi in zip(np.arange(window_size, len(sequence) - window_size), sequence[window_size:-window_size]):
            context_window = sequence[i - window_size: i + window_size + 1]
            distances = np.abs(np.arange(-window_size, window_size + 1))
            distances[window_size] = 1.0
            nom = np.ones(shape=(window_size * 2 + 1,), dtype=np.float32)
            nom[window_size] = 0.0

            if use_weighting:
                cooc_mat[wi, context_window] += nom / distances  # Update element
            else:
                cooc_mat[wi, context_window] += nom
    return cooc_mat

I am a little worried this might not be efficient enough for the datasets that we want to use, but hopefully we can speed things up iteratively

Various CI service integration

Currently, it is not possible to integrate the repository with various CI services to automate testing and code evaluation. How is the process of approval by the admins going?

I believe that any person that is part of TheJacksonLaboratory organization on GitHub should be able to add the integrations, maybe Peter could be added to it?

add support in CsfGraph for instatiating using a networkx graph

Per convo with @deepakunni3 and @realmarcin, it'd be nice if we could instantiate a CsfGraph by handing it a networkx graph, so we could then leverage KGX to easily import graphs in any format KGX supports (currently includes rdf, owl, pandas, json, tsv, csv, networkx, neo4j)

Glad to do this if it sounds reasonable

EarlyStopping

Consider using
keras.callbacks.EarlyStopping
However, since we are training in minibatches, there will be a lot of fluctuation so we need to have a long memory

travis complaining

test_word2vec.py", line 76,
is causing problems because travis cannot find the input file. I switched the relative to a calculated absolute path but this did not fix the problem. I am unsure what this issue is, does anybody have an idea?

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/monarch-initiative/N2V/tests/test_word2vec.py", line 76, in setUp
    training_graph = CSFGraph(pos_train)

  File "/home/travis/build/monarch-initiative/N2V/xn2v/csf_graph/csf_graph.py", line 18, in __init__
    raise TypeError("Could not find graph file {}".format(filepath))

TypeError: Could not find graph file /home/travis/build/monarch-initiative/N2V/data/ppismall/pos_train_edges

generate_hn2v_input_file "run pairdc" message

I'm trying to run the generate+hn2v_input_file.py script and trigger the following message:

You need to run pairdc and put its output file (e.g., g2d_associations_training_6_2014.tsv) into the data directory.
Could not find data/g2d_associations_training_4_2014.tsv

I know that I need to do something to the downloaded GTEx data file, but what is "pairdc"? From the printed message, I am unclear on how to how to complete the task. If this script is included in the package, I think this message might also be a source of confusion for end users. Perhaps we can add some additional detail to the message?

Method `predict_links` fails on execution

Some methods from the class LinkPrediction fail the execution in a basic example (that might be wrong) that I tried to setup in the class test suite.

The methods are:

Are the parameters used for the example wrong?

Class ContinuousBagOfWordsWord2Vec is never used

Since the class ContinuousBagOfWordsWord2Vec is never used I was wondering if it will ever be, and if that was to be the case we should remove it.

If it is useful, which parameters could I use for adding a simple test for the class?

Error in running link prediction on entire graph

An error happened when I ran link prediction on entire graph. The entire graph has almost 200 components. I am working on this issue now.

Codacy issue on assert tf.version >= "2.0"

I created a branch called "check_tensorflow_version". I have added
assert tf.__version__ >= "2.0" to word2vec.py to check the version of tensorflow. When I create a pull request, codacy gives an issue:

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

B101: Test for use of assert
This plugin test checks for the use of the Python assert{.docutils .literal .notranslate} keyword. It was discovered that some projects used assert to enforce interface constraints. However, assert is removed with compiling to optimised byte code (python -o producing *.pyo files). This caused various protections to be removed. The use of assert is also considered as general bad practice in OpenStack codebases.

Please see https://docs.python.org/2/reference/simple_stmts.html#the-assert-statement for more info on assert{.docutils .literal .notranslate}

Does anyone know what the correct way of checking the version of tensorflow is? Can we just write 'tensorflow>=2.0' in setup.py in install_requires?
Thank you.

Graph class too dependent on node names

right now the graph class expects the class of nodes to be encoded with the first character of the name. We should consider keep that for now, but if any of the nodes does not start with a letter, disactivating the heterogeneous algorithm.

CSFGraph is returning the string representations of nodes but we want ints

Write a function similar to

def neighbors(self, source):

but it should return ints
Call it

def neighbors_as_ints(self, source):

After this is tested it should become the main method and the original method should be get_neighbors_as_string

Adding type checking to tests

Wanted to get your thoughts on using Mypy or pytyp ? pytyp is from Google and while mentioned in the Style Guide, is not explicitly stated as the only type checker that can eb used.

mypy: Once we have added type hints to all scripts, we could extend the existing tests to include type checking. The documentation seems pretty straight forward (example from RealPython):

# headlines.py

def headline(text: str, align: bool = True) -> str:
    if align:
        return f"{text.title()}\n{'-' * len(text)}"
    else:
        return f" {text.title()} ".center(50, "o")

print(headline("python type checking"))
print(headline("use mypy", align="center"))

$ mypy headlines.py
headlines.py:10: error: Argument "align" to "headline" has incompatible
                        type "str"; expected "bool"

click vs. argparse?

@vidarmehr and @pnrobinson - Just wanted to check-in and see if there was decision made about using argparse or click?

Happy to help make the conversion once a decision is made.

Callback to tensorboard

to visualize embeddings using tensorboard

See example https://www.tensorflow.org/guide/keras/train_and_evaluate

Use requirement.txt

Per documentation here

Although possibly there is a reason we aren't using requirements.txt

Add better support for node types

@pnrobinson I think we're near the point where we can tests some KGs emitted by kg_covid_19.

These obviously will be heterogeneous graphs though, so we probably should make a decision about node types - i.e. how to refactor nodes types to not rely on the first character of the node name (node[0]). (See also #26)

A proposal:

make a node class with an attr called node_type.
change CsfGraph to optionally accept another file nodes.tsv with two columns:
node\tnode_type
and use the second column to populate node_type in the node class. Nodes not mentioned in this file are set to node_type None, and all nodes are None if the file is not present.
refactor all methods that deal with node types (e.g. __preprocess_transition_probs_xn2v()) to look for this info in the node class

Peter, others, thoughts?

pickle the state of the CSFGraph following creation of the alias probabilities

Make code that pickles the current state and also re-imports it. Should be unit tested. This could be a method of the CSFGraph class.

https://www.journaldev.com/15638/python-pickle-example

Logging

More housekeeping 😄

@justaddcoffee - do you know if there is a consensus about how to handle within file logging? I have noticed that some files have it, some don't, and others have it, but it's current commented out.

How would you like me to proceed going forward?

started to implement CBOW

The size of stacked_embeddings is causing an error (line word2vec.py 418).
Some seems different with the API for TF 1 and 2.

Adding pylint to travis

Probably could/should add pylint to our travis CI (unless there are any objections)

generate random negative edges

I think the generate random negative edges method does not generate completely "random" edges. IT.combination generates all combinations of edges, but I don't think how random are edges that we get.

key error when running runLinkPrediction_ppi.py

running runLinkPrediction_ppi.py on edges generated from the first 10k lines of the String DB, I'm getting a key error here:

https://github.com/monarch-initiative/N2V/blob/e159e15bedc89dacc5e8e30a2e53f8bfc8586daf/xn2v/hetnode2vec.py#L62

Walk iteration:
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/jtr4v/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 144, in <module>
    main(args)
  File "/Users/jtr4v/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 137, in main
    walks = pos_train_g.simulate_walks(args.num_walks, args.walk_length)
  File "/Users/jtr4v/PycharmProjects/N2V/xn2v/hetnode2vec.py", line 85, in simulate_walks
    walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))
  File "/Users/jtr4v/PycharmProjects/N2V/xn2v/hetnode2vec.py", line 63, in node2vec_walk
    walk.append(cur_nbrs[self.alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
KeyError: 4285

Seems like a bug - cur is an integer, but alias_nodes is a dict with gene IDs as keys, so I don't understand how alias_nodes[cur] on line 63 will ever work. (Possibly I'm running the code incorrectly though...)

Utility functions?

Should we add a file with various utility functions for working with the embeddings?
For instance, something like this to get the most similar words?

def get_cosine_sim(emb, valid_words, top_k):
    norm = np.sqrt(np.sum(emb**2,axis=1,keepdims=True))
    norm_emb = emb/norm
    in_emb = norm_emb[valid_words,:]
    similarity = np.dot(in_emb, np.transpose(norm_emb))
    sorted_ind = np.argsort(-similarity, axis=1)[:,1:top_k+1]
    return sorted_ind, valid_words

Consider refactor with keras style

See Part II: Writing your own training & evaluation loops from scratch
https://www.tensorflow.org/guide/keras/train_and_evaluate

xn2v_parser

Remove this class -- it does not make sense to have it here. We still have the code in the original IDG2KG management repository

SkipGram fails on embedding a protein-protein interaction file

I created a new branch called link_prediction_ppi in which I am testing graph embedding and link prediction on protein-protein interaction graph. In the first step, I just tried to embed the training graph with SkipGram. The algorithm fails, while it successfully run on karate.train graph.
The error is:

Traceback (most recent call last):
File "/Users/ravanv/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 52, in
model.train(display_step=2)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 414, in train
batch_x, batch_y = self.next_batch_from_list_of_lists(walkcount, self.num_skips, self.skip_window)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 377, in next_batch_from_list_of_lists
current_batch, current_labels = self.next_batch(sentence, batch_count, num_skips, skip_window)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 341, in next_batch
batch[i * num_skips + j] = buffer[skip_window]
ValueError: invalid literal for int() with base 10: 'ENSP00000264028'

Does anyone know what cases this issue? In karate.train, nodes are integers, but in ppi graph, nodes are ensembl ids likeENSP00000264028. Do you think it might have caused the error?
Thank you.

Use keras tokenizer

Replace the self-made tokenizer.
This will reduce the verbosity of our code.
See example 4 here:
https://www.programcreek.com/python/example/106871/keras.preprocessing.text.Tokenizer

Absolute paths in code relative to Peter's computer