Giter Site home page Giter Site logo

monarch-initiative / embiggen Goto Github PK

View Code? Open in Web Editor NEW
38.0 15.0 12.0 969.22 MB

πŸ‡ Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
graph graph-representation-learning machine-learning

embiggen's Introduction

πŸ‡ Embiggen

Pypi project Pypi total project downloads Tutorials Documentation Supported Python versions DOI License Telegram Group Discord Server Twitter

Embiggen is the graph machine learning submodule of the πŸ‡ GRAPE library.

How to install Embiggen

To install the complete GRAPE library, do run:

pip install grape

Instead, to exclusively install the Embiggen package, you can run:

pip install embiggen

Cite GRAPE

Please cite the following paper if it was useful for your research:

@misc{cappelletti2021grape,
  title={GRAPE: fast and scalable Graph Processing and Embedding}, 
  author={Luca Cappelletti and Tommaso Fontana and Elena Casiraghi and Vida Ravanmehr and Tiffany J. Callahan and Marcin P. Joachimiak and Christopher J. Mungall and Peter N. Robinson and Justin Reese and Giorgio Valentini},
  year={2021},
  eprint={2110.06196},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

embiggen's People

Contributors

callahantiff avatar caufieldjh avatar cmungall avatar deepakunni3 avatar dependabot[bot] avatar justaddcoffee avatar lucacappelletti94 avatar pnrobinson avatar realmarcin avatar vidarmehr avatar zommiommy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embiggen's Issues

AttributeError, link prediction failed on Sumner

The code ran for 1 day and 3 hours, but failed due to the error:

Traceback (most recent call last):
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/runLinkPrediction_ppi.py", line 53, in
model.train(display_step=2)
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/xn2v/word2vec.py", line 421, in train
self.run_optimization(batch_x, batch_y)
File "/projects/robinson-lab/vidar/Node2vec_1/N2V/xn2v/word2vec.py", line 399, in run_optimization
gradients = g.gradient(loss, [self.embedding, self.nce_weights, self.nce_biases])
File "/home/ravanv/.local/lib/python3.6/site-packages/tensorflow/python/eager/backprop.py", line 980, in gradient
unconnected_gradients=unconnected_gradients)
File "/home/ravanv/.local/lib/python3.6/site-packages/tensorflow/python/eager/imperative_grad.py", line 76, in imperative_grad
compat.as_str(unconnected_gradients.value))
AttributeError: 'RefVariable' object has no attribute '_id'

error in test_version.py

Does anyone know what tests/test_version.py tests? Do we need it? I get an error when I run it.

ERROR: Failure: ModuleNotFoundError (No module named 'validate_version_code')

Thank you.

Unable to run test suite

Hello,

I am Luca, a PhD student from the AnacletoLab (University of Milan). I have started to take a look at the code, and I was beginning from running the present test suite.

However, it fails since it tries to import the class Graph from hetnode2vec at line 6 of the file test_node2vec.py

from hn2v.hetnode2vec import Graph

I am guessing that the class was previously called Graph and then afterwards changed the name to N2vGraph, and the test was left un-updated. I have therefore tried to replace the class but the number of parameters that the two classes accept is different.

To avoid the test suite breaking in the future I would propose using Travis-CI, which is free to use for private repositories for students and professors for research purposes. I have experience in setting up a test pipeline with Travis-CI, if required.

model_to_dot

We should explore the model_to_dot facilities of keras for generating decent-looking graphics for publication, e.g.,

from keras.utils.vis_utils import model_to_dot
SVG(model_to_dot(model, show_shapes=True, show_layer_names=False, 
                 rankdir='TB').create(prog='dot', format='svg'))

Cannot execute parse: missing test file

Hello everyone,

May I ask where I can find the file small_g2d_test.txt?

I am trying to extend the test suite and without it, I cannot run the n2vParser class method's parse.

rationale for the max connected component and effects on embedding and prediction tasks

@pnrobinson @vidarmehr we are curious about the rationale for selecting the max component of the graph as input for the random walk. The bioNEV paper does not seem to do that. I believe that picking the max component dropped about 5k genes, so a bit. Presumably this will be dropping the less connected portions of the graph, but those are also harder for link prediction because more sparse. They may also be an effect on the negative sampling of edges, because some edges will have been dropped with < max components.

Along these lines, we are thinking of metrics to apply for the various steps in the pipeline so we can better track the data flow and effects.

Parallel code not working

This is the new code

for [orig_node, alias_node] in pool.map(self._get_alias_node, g.nodes()):
     alias_nodes[orig_node] = alias_node
     dateTimeObj = datetime.now()
     print(dateTimeObj)
      print("Processed %d nodes" % len(alias_node))
pool.close() # added just now but does not help

However, when using a 'realistic graph with 20,000 nodes and 400,000 edges, the code gets stuck here. It outputs the following :

(...19,996 other times)
2020-01-17 14:19:20.817591
Processed 2 nodes
2020-01-17 14:19:20.817595
Processed 2 nodes
2020-01-17 14:19:20.817600
Processed 2 nodes

htop shows that 8 processors are working full steam. However, python appears to create 20,000 processes, which surely is not the best way of doing things. I am not sure when this will be finished but I let this go for 3 hours yesterday and did not get past this point.

click error messages are not helpful

running runAnalysis.py -t gives the error message that the option -t is not recognized but should say that a command (e.g. disease-gene-embedding) is required)

Also can be delete the click branch?

trying a more fair training/testing split like 80/20

@vidarmehr would it be possible for you to try a 80/20 training/testing split for your link prediction runs? This will help avoid overfitting and please reviewers, and help us compare to the BioNEV paper.

We missed you on the call today -- some of the discussion is pasted in gitter so you can take a look.

TestSkipGramWord2Vec

This currently does not actually test anything. It is hard to test the actual production of embeddded vectors, but we should move the production of the vectors to a setup method and then test a few things such as the number of walks etc.

Publish embiggen on PyPI

Luca has plenty of experience with this, but for reference I located some relevant tutorials.

How to publish on PyPI:
How to Publish an Open-Source Python Package to PyPI
very thorough step-by-step guide

How to write your own Python Package and publish it on PyPi

How to organize a Python package:
Python Application Layouts: A Reference

Python Packaging User Guide

How to handle logging:
logging β€” Logging facility for Python
part of standard library, doc includes links to tutorials

Move to Monarch GitHub

We cannot use CI on the JAX github and thus we will need to move this repo. I will do so later on today -- please shout if there are any issues

extend or append

should this be extend or append?? -- next_batch (SG)

        batch = np.append(batch, current_batch)
        labels = np.append(labels, current_labels, axis=0)

Create link prediction scoring funcs utility file

Creating this issue as part of the conversation in #105.

TODO - Add Link Prediction Scoring functions to Utility Script

  • Common Neighbors
  • Jaccard’s Coefficient
  • Adamic-Adar
  • Preferential Attachment
  • See here for additional scoring functions worth considering, including:
    • Degree Product
    • Sorenson Similarity
    • Leicht-Holme-Newman Similarity
    • Shortest Path
    • Resource Allocation
    • Katz
    • SimRank
    • Rooted Page Rank

@vidarmehr - Does that cover everything?

remove/move test file

tests/test_n2v.py

  • this file is causing a travis error with the coverage because url.request is not working. The file is testing import functions that we want to move out of this project anyway. I would suggest we move the test elsewhere @vidarmehr

New branch?

@pnrobinson

Hi Peter,
Did you create a new branch called develop? I don't see it. Could you please check in the new branch?
Thank you.

How to format input data for GloVe

Hi everybody,
I am adapting a keras version of GloVe to use the tf keras (only small changes needed).
I am wondering if we want to have a standard way of passing the data to the models, and if we
are planning on changing anything in the present constructors for word2vec.
This is an excerpt of what the function looks like

v_size = 3000
tokenizer = Tokenizer(num_words=v_size, oov_token='UNK')
tokenizer.fit_on_texts(docs)

def generate_cooc_matrix(text, tokenizer, window_size, n_vocab, use_weighting=True):
    sequences = tokenizer.texts_to_sequences(text)

    cooc_mat = lil_matrix((n_vocab, n_vocab), dtype=np.float32)
    for sequence in sequences:
        for i, wi in zip(np.arange(window_size, len(sequence) - window_size), sequence[window_size:-window_size]):
            context_window = sequence[i - window_size: i + window_size + 1]
            distances = np.abs(np.arange(-window_size, window_size + 1))
            distances[window_size] = 1.0
            nom = np.ones(shape=(window_size * 2 + 1,), dtype=np.float32)
            nom[window_size] = 0.0

            if use_weighting:
                cooc_mat[wi, context_window] += nom / distances  # Update element
            else:
                cooc_mat[wi, context_window] += nom
    return cooc_mat

I am a little worried this might not be efficient enough for the datasets that we want to use, but hopefully we can speed things up iteratively

Various CI service integration

Currently, it is not possible to integrate the repository with various CI services to automate testing and code evaluation. How is the process of approval by the admins going?

I believe that any person that is part of TheJacksonLaboratory organization on GitHub should be able to add the integrations, maybe Peter could be added to it?

EarlyStopping

Consider using
keras.callbacks.EarlyStopping
However, since we are training in minibatches, there will be a lot of fluctuation so we need to have a long memory

travis complaining

test_word2vec.py", line 76,
is causing problems because travis cannot find the input file. I switched the relative to a calculated absolute path but this did not fix the problem. I am unsure what this issue is, does anybody have an idea?

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/monarch-initiative/N2V/tests/test_word2vec.py", line 76, in setUp
    training_graph = CSFGraph(pos_train)

  File "/home/travis/build/monarch-initiative/N2V/xn2v/csf_graph/csf_graph.py", line 18, in __init__
    raise TypeError("Could not find graph file {}".format(filepath))

TypeError: Could not find graph file /home/travis/build/monarch-initiative/N2V/data/ppismall/pos_train_edges

generate_hn2v_input_file "run pairdc" message

I'm trying to run the generate+hn2v_input_file.py script and trigger the following message:

You need to run pairdc and put its output file (e.g., g2d_associations_training_6_2014.tsv) into the data directory.
Could not find data/g2d_associations_training_4_2014.tsv

I know that I need to do something to the downloaded GTEx data file, but what is "pairdc"? From the printed message, I am unclear on how to how to complete the task. If this script is included in the package, I think this message might also be a source of confusion for end users. Perhaps we can add some additional detail to the message?

Codacy issue on assert tf.__version__ >= "2.0"

I created a branch called "check_tensorflow_version". I have added
assert tf.__version__ >= "2.0" to word2vec.py to check the version of tensorflow. When I create a pull request, codacy gives an issue:

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

B101: Test for use of assert
This plugin test checks for the use of the Python assert{.docutils .literal .notranslate} keyword. It was discovered that some projects used assert to enforce interface constraints. However, assert is removed with compiling to optimised byte code (python -o producing *.pyo files). This caused various protections to be removed. The use of assert is also considered as general bad practice in OpenStack codebases.

Please see https://docs.python.org/2/reference/simple_stmts.html#the-assert-statement for more info on assert{.docutils .literal .notranslate}

Does anyone know what the correct way of checking the version of tensorflow is? Can we just write 'tensorflow>=2.0' in setup.py in install_requires?
Thank you.

Graph class too dependent on node names

right now the graph class expects the class of nodes to be encoded with the first character of the name. We should consider keep that for now, but if any of the nodes does not start with a letter, disactivating the heterogeneous algorithm.

Adding type checking to tests

Wanted to get your thoughts on using Mypy or pytyp ? pytyp is from Google and while mentioned in the Style Guide, is not explicitly stated as the only type checker that can eb used.

mypy: Once we have added type hints to all scripts, we could extend the existing tests to include type checking. The documentation seems pretty straight forward (example from RealPython):

# headlines.py

def headline(text: str, align: bool = True) -> str:
    if align:
        return f"{text.title()}\n{'-' * len(text)}"
    else:
        return f" {text.title()} ".center(50, "o")

print(headline("python type checking"))
print(headline("use mypy", align="center"))
$ mypy headlines.py
headlines.py:10: error: Argument "align" to "headline" has incompatible
                        type "str"; expected "bool"

Add better support for node types

@pnrobinson I think we're near the point where we can tests some KGs emitted by kg_covid_19.

These obviously will be heterogeneous graphs though, so we probably should make a decision about node types - i.e. how to refactor nodes types to not rely on the first character of the node name (node[0]). (See also #26)

A proposal:

  • make a node class with an attr called node_type.
  • change CsfGraph to optionally accept another file nodes.tsv with two columns:
    node\tnode_type
    and use the second column to populate node_type in the node class. Nodes not mentioned in this file are set to node_type None, and all nodes are None if the file is not present.
  • refactor all methods that deal with node types (e.g. __preprocess_transition_probs_xn2v()) to look for this info in the node class

Peter, others, thoughts?

Logging

More housekeeping πŸ˜„

@justaddcoffee - do you know if there is a consensus about how to handle within file logging? I have noticed that some files have it, some don't, and others have it, but it's current commented out.

How would you like me to proceed going forward?

started to implement CBOW

The size of stacked_embeddings is causing an error (line word2vec.py 418).
Some seems different with the API for TF 1 and 2.

generate random negative edges

I think the generate random negative edges method does not generate completely "random" edges. IT.combination generates all combinations of edges, but I don't think how random are edges that we get.

key error when running runLinkPrediction_ppi.py

running runLinkPrediction_ppi.py on edges generated from the first 10k lines of the String DB, I'm getting a key error here:

https://github.com/monarch-initiative/N2V/blob/e159e15bedc89dacc5e8e30a2e53f8bfc8586daf/xn2v/hetnode2vec.py#L62

Walk iteration:
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/jtr4v/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 144, in <module>
    main(args)
  File "/Users/jtr4v/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 137, in main
    walks = pos_train_g.simulate_walks(args.num_walks, args.walk_length)
  File "/Users/jtr4v/PycharmProjects/N2V/xn2v/hetnode2vec.py", line 85, in simulate_walks
    walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))
  File "/Users/jtr4v/PycharmProjects/N2V/xn2v/hetnode2vec.py", line 63, in node2vec_walk
    walk.append(cur_nbrs[self.alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])
KeyError: 4285

Seems like a bug - cur is an integer, but alias_nodes is a dict with gene IDs as keys, so I don't understand how alias_nodes[cur] on line 63 will ever work. (Possibly I'm running the code incorrectly though...)

Utility functions?

Should we add a file with various utility functions for working with the embeddings?
For instance, something like this to get the most similar words?

def get_cosine_sim(emb, valid_words, top_k):
    norm = np.sqrt(np.sum(emb**2,axis=1,keepdims=True))
    norm_emb = emb/norm
    in_emb = norm_emb[valid_words,:]
    similarity = np.dot(in_emb, np.transpose(norm_emb))
    sorted_ind = np.argsort(-similarity, axis=1)[:,1:top_k+1]
    return sorted_ind, valid_words

xn2v_parser

Remove this class -- it does not make sense to have it here. We still have the code in the original IDG2KG management repository

SkipGram fails on embedding a protein-protein interaction file

I created a new branch called link_prediction_ppi in which I am testing graph embedding and link prediction on protein-protein interaction graph. In the first step, I just tried to embed the training graph with SkipGram. The algorithm fails, while it successfully run on karate.train graph.
The error is:

Traceback (most recent call last):
File "/Users/ravanv/PycharmProjects/N2V/runLinkPrediction_ppi.py", line 52, in
model.train(display_step=2)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 414, in train
batch_x, batch_y = self.next_batch_from_list_of_lists(walkcount, self.num_skips, self.skip_window)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 377, in next_batch_from_list_of_lists
current_batch, current_labels = self.next_batch(sentence, batch_count, num_skips, skip_window)
File "/Users/ravanv/PycharmProjects/N2V/xn2v/word2vec.py", line 341, in next_batch
batch[i * num_skips + j] = buffer[skip_window]
ValueError: invalid literal for int() with base 10: 'ENSP00000264028'

Does anyone know what cases this issue? In karate.train, nodes are integers, but in ppi graph, nodes are ensembl ids likeENSP00000264028. Do you think it might have caused the error?
Thank you.

Absolute paths in code relative to Peter's computer

Hello,

I was trying to base some tests on the run* files in the root of the repositories, and in one of them, runDiseaseLinkPrediction.py, there are absolute paths to some files. Are these files too big to be in the repository? How big are they? Is there an equivalent in the test repository?

Another absolute path is present in runDiseaseGeneEmbedding.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.