opentreeoflife / treemachine Goto Github PK

Source tree graph database

License: Other

Python 11.28% Shell 0.56% Java 87.92% Makefile 0.23%

treemachine's Introduction

treemachine-LITE

Description

treemachine-LITE is a pared-down version of the original treemachine which was used to generate synthetic phylogenetic trees for the Open Tree of Life project. Synthetic analyses are now performed by other tools. The role of treemachine-LITE is simply to construct a neo4j database and serve such trees.

Installing dependencies

treemachine-LITE is managed by Maven v.3 (including the dependencies). In order to compile and build treemachine-LITE, it is easiest to let Maven v.3 do the hard work.

maven On Linux you can install Maven v.3 with:

sudo apt-get install maven

On Mac OS, Maven v.3 can be installed with Homebrew:

brew install maven

jade and ot-base Once Maven v.3 is installed, the treemachine-LITE dependencies themselves can be installed with:

mvn_install_dependencies.sh

neo4j The DB constructed by treemachine-lite is meant to be served by neo4j. We are currently using neo4j-community-v1.9.5. To obtain and decompress neo4j:

$ curl http://files.opentreeoflife.org/neo4j/neo4j-community-1.9.5-unix.tar.gz > neo4j-community-1.9.5.tar.gz
$ tar xzf neo4j-community-1.9.5.tar.gz

Alternately, there is a make target for neo4j:

make neo4j

You can move the neo4j directory wherever you like.

Compiling treemachine-LITE

NOTE: The script for compiling the server plugins will delete any command-line jar in the target directory (and the opposite is true - compiling the command-line jar will delete the server plugins from the target directory). You can rebuild either just by running those scripts again.

Command-line version

To compile the command-line version (which you can then use to build a database):

sh mvn_cmdline.sh

This creates treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar in the target directory (and deletes the server plugin jar if it existed).

Server plugins

To compile the server plugins for interacting with the graph over REST calls:

sh mvn_serverplugins.sh

This creates treemachine-neo4j-plugins-0.0.1-SNAPSHOT.jar in the target directory (and deletes the command-line jar if it existed). Then, copy the .jar file into the neo4j plugins directory:

cp -p target/treemachine-neo4j-plugins-0.0.1-SNAPSHOT.jar $(NEO4J_HOME)/plugins

Usage

Constructing a DB

First, compile the command-line version (see above). Then, to build the DB:

java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar ingestsynth newick_tree json_annotations tsv_taxonomy DB_name

where:

newick_tree is the labelled_supetree/labelled_supertree.tre from the synthesis procedure
json_annotations is the annotated_supertree/annotations.json from the supertree procedure
tsv_taxonomy is the taxonomy.tsv file from the Open Tree of Life Taxonomy (OTT)
DB_name is the name for the directory that will hold the generated DB (you choose)

Serving a DB with Neo4j

Compile the server plugins (see 'Server plugins, above). Before starting neo4j, you will need to modify file $(NEO4J_HOME)/conf/neo4j-server.properties. Put the full path of the DB directory constructed by treemachine-LITE as the value for the org.neo4j.server.database.location setting.

After you have loaded content into your db, you can run the neo4j http server with the command :

neo4j-community-1.9.5/bin/neo4j start

If you add the neo4j-community-1.9.5/bin directory to your path, you can just say neo4j start.

Running the tests

To make sure everything is running ok, run the web service tests:

cd ws-tests
./run_tests.sh host:apihost=http://localhost:7474 host:translate=true

treemachine's People

Contributors

Stargazers

Watchers

Forkers

patrickniehaus letolabs fephyfofum infosucker josephwb nightstream

treemachine's Issues

Add ingroup node from nexson to jadetree

Add new nexson metadata property (id ot:focalClade) to jadetree.

Add boolean "remap" option to treemachine for pgloadind

On loading of a study into treemachine via pgloadind, add option to disregard taxon mapping, thereby setting treemachine to try out mapping using the existing TNRS functionality.

Allow treemachine to accept "ot:ottolid" or "ot:ottId"

For compatibility of new and old nexson vocabularies.

add node id to treeview arguson

Method for reading Nexson and adding to graph

Basically we need to connect NexsonReader to MainRunner.

The Nexson file could come either from the file system, if some independent process has fetched it from phylografter into a local file, or it could be fetched directly from phylografter using the web service - this is an undecided design question. I would say that for debugging purposes doing the local file would be the best first step since we'll have files available for testing in advance of deployment of the service in phylografter.

This is downstream of issue #7

add source tree taxa to metadata node

make sure that the original source tree taxa long array is added as a property to the metadata node.

pull nexson trees from phylografter and add to graph

need to be able to pull nexson trees from phylografter so they can be added to graph. this will be a server plugin.

taxonomy elements that trigger the "-bad path branch"

I'm not sure if this is an issue or not, but here is what I'm seeing...
In TreeLoader.addAdditionalTaxonomyTableIntoGraph there is a branch of code to deal with taxa that have an array of parents that eventually ends in null (around line 292-308 in the commit 3f88466 ).

It looks like the intent is to avoid adding disconnected nodes to the graph.
Question 1: if the taxon is part of a bad path, this logic will keep the TAXCHILDOF relationship from being made between the taxonomy nodes. Do we also want to go back and remove the nodes from the graph? (currently they are left disconnected from the rest of the taxonomy).

Question 2: I don't understand how the ancestral nodes in these "bad paths" ever get TAXCHILDOF relationships. But on the dev server (some of them at least) do have relationships (e.g. search for the name "saltmarsh%20clone%20ensemble" in the taxNamedNodes index). On my local instance, after running inittax on COL and addtax for NCBI, this node exists in the graph but has no relationships. Perhaps, the relationships on the dev-server are being created after other tree-machine commands. Do we want to standardize the behavior of these nodes?

Nexson import of ot:dataDeposit property

I fixed a problem NexsonReader.java had with reading in "@href" property values. Now it is stuck on "ot:dataDeposit". Specifically:

Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String

I don't have an example file with this field handy. I will try to get back to this in a few hours.

Knuckles in extracted synthetic tree

Synthetic tree retains "knuckles" or "knees" present in taxonomy. These are not handled by any analyses wanting to use the trees (although viewers like Dendroscope and FigTree can display them). For now, just use scripts to remove these. Eventually has as an export option.

run_example.sh fails with java.util.NoSuchElementException

This is from a clean installation according to the README. Suggestions?

$ ./run_example.sh 
+ java -Dlog4j.configuration=debuglog4j.properties -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar inittax example/ncbi_primates.txt test.db
things will happen here
initializing taxonomy from example/ncbi_primates.txt to test.db
Exception in thread "main" java.util.NoSuchElementException: the taxonomy file appears to be missing some fields.
    at opentree.GraphInitializer.processTaxInputLine(GraphInitializer.java:275)
    at opentree.GraphInitializer.addInitialTaxonomyTableIntoGraph(GraphInitializer.java:182)
    at opentree.MainRunner.taxonomyLoadParser(MainRunner.java:71)
    at opentree.MainRunner.main(MainRunner.java:1730)
+ exit

should return the source tree's metadata when we ask for that tree

I can try to work on this later today...

update synthesis filters to have explicit include/exclude

Currently not an option. Functionally this is not much of limitation (just reverse the filter criteria), but the sake of clarity, we should require the filter direction to be explicit. Should be a minor change, but one that requires some architecting.

getSyntheticTree arguson wrapped in list

This is mostly a question for Jim, I think.

The arguson from the getSyntheticTree service is wrapped in a list, but it seems like there is only one thing in that list, which is the entire rest of the json output. It looks something like this:

[
{ "name": "life", }
]

Is this correct? If it's just a fluke, I can easily change it now to not be contained in a list. Otherwise I will preserve this.

Make service to return identifying info about current draft tree

From Jim: "it would be great to have a simple treemachine service that just returns the latest ("current") synthetic tree id and its node id for 'life'"

Rename ISCALLED to something else

I brought this up before, but ... any chance of getting ISCALLED changed to something else?

(1) A taxon is a class of organisms, not a name for one.
(2) The taxon and the corresponding phylogenetically defined group may be mostly coincident but will never be exactly coincident.

Let A = a phylogenetically defined group = the MRCA or some organisms/groups together with all of the MRCA's descendents
Let B = a taxonomically defined group, according to some circumscription, perhaps based on physical characters (and known, one would hope, by whomever coined or meaningfully used the taxon name; just not known to us)
We are talking about the name for the relationship between A and B.

Semantically the relationship is that A is a subclass of B, i.e. all members of A are members of B.

The taxon B will in general be a proper superclass of A, not identical to A. If you consider ancestral organisms, B (if defined by characters and monophyletic) will have emerged historically before the point of bifurcation that defines A, so for some time there would have been Bs that were not As. It is also easy to imagine (say) genera that contain extant species that simply didn't participate in a phylogenetic analysis; A = the MRCA of all the species that were surveyed is only a subclass of B = the genus, not the same as the genus.

Maybe BELONGSTO, or NAMEDTAXON, etc. Not sure what to suggest.

This may never be seen in the UI or documentation, but it does show up in the code, and people will read the code.

store metadata in SYNTHCHILDOF relationships

supporting phylografter study ids for each relationship

Service to return list of the studies that were input to synthesis

It would be useful to have a service that returns, given the name of a synthetic tree, the list of studies (and trees too) that contributed to it.

Applications:

(1) Provide feedback to curators, who are often left wondering which of their studies made the cut.

(2) Needed for credits page, so that we can make an accurate bibliography. (DOIs would be useful for this purpose, but not essential since we can get them from Phylografter.)

It occurs to me it might also be useful to return, for each tree, the id of the root node of that tree, so that you can go from browsing a list of studies to browsing the synthetic tree or particular tree. But maybe that should be a separate service.

Many "isolated" taxa in treemachine (bad Lucene index?)

There are currently lots of taxa that appear alone in argus -- no ancestors or descendant nodes, just one lonely dot. For example, search for Hominidae.

Most nodes close to the root (life) node are intact, eg,
http://dev.opentreeoflife.org/opentree/otol.draft.22@3103299/Metazoa-Monosiga-ovata

...but treemachine fails when retrieving a subtree for Chordata:
http://dev.opentreeoflife.org/opentree/otol.draft.22@417744/Chordata

The error message describes a Lucene error, where multiple items are found in the index where one is expected.

It appears that all descendants nodes (within the clade Chordata) are isolated, eg:
http://dev.opentreeoflife.org/opentree/argus/ottol@947318/Craniata

JSON string export - are we asking for trouble?

We seem to just be composing JSON by hand, which is typically easy. We could use something like: http://jettison.codehaus.org/apidocs/org/codehaus/jettison/json/JSONStringer.html

In many cases we know that the strings are safe for JSON (in which case it is probably faster to not check to see if we need to escape them). I may be the most guilty party here, as I just worked on spitting out metadata for studies. We could easily have quotes or unicode in those strings associated with references. In which case my code would generate bogus JSON.

That data came from phylografter via JSON, so I'm hoping that it is clean.

Deciding that the db will hold strings that are already escaped for JSON would make life easier. Have we decided that? (if so are we checking this on ingesting of trees?).

Broken link to Treemachine Wiki in readme

https://github.com/OpenTreeOfLife/opentree-treemachine/wiki
should be
https://github.com/OpenTreeOfLife/treemachine/wiki

What is JadeNode.number for?

It's not clear from the documentation in JadeNode.java what the semantics of "number" is intended to be. Some explanation would be helpful.

make plugin to export JSON for argus to view the synthetic tree

Things the plugin should export in the JSON

phylografter study ids for studies support each relationship (ot:studyId)
doi for study linkout (ot:studyPublication)
treebase id for treebase linkout (ot:dataDeposit)

java.lang.OutOfMemoryError with large phylografter tree import

Tested upgrade on treemachine end. Big studies (e.g. 438) coming through without hanging or timing out.

On treemachine end, currently have to up the memory (e.g. -Xmx2g) or suffer:

java.lang.OutOfMemoryError: Java heap space

Not sure if there is a leak, etc. Will look into this.

Add taxonomy ids to the treeview arguson output

From JAR: The taxonomy source info (NCBI id and/or GBIF id) has to be made part of the appropriate web service response i.e. when Argus asks treemachine for 'arguson' for a part of the tree to display. The info is already there as a property of the node, I believe, it just needs to be sent to Argus.

https://trello.com/card/treemachine-side-support-for-links-from-taxon-nodes-to-ncbi-and-or-gbif/4fb665cce706648863c3cc90/96

Secure neo4j

Currently access on port 7474 lets you do anything you want to the graph db: add records, add and delete indexes, etc. That's probably fine for as long as we're in development but before an alpha release this needs to be tightened up.

The easiest way to do this would be using Apache Proxypass, which if apache and neo4j were running on the same physical server would have negligible overhead. Use of safe URIs could be web-public, with others limited to localhost. See also http://docs.neo4j.org/chunked/stable/security-server.html

Parse Nexson study file

There has to be a way to read in a phylogenetic tree coming from Phylografter and represented in Nexson (i.e. NeXML in JSON) format.

[Renamed issue, was "Nexson tree import]

deal with nexson otus that don't have ottol ids before adding to graph

we need to be able to deal with these before they are added to the graph. The idea would be to

TNRS the names that are there using taxomachine
- For ones that match that weren't matched before, we can assign the ottol id
For remaining names, if there are any, retrieve the scope determined by the TNRS
- Add these remaining names as names subtending the scope name to the taxonomy WITHIN treemachine
- Add these new names to an index that identifies these as new names and give a temp ottol id. This index will serve as easy way to update once we have new official ids from taxomachine and will serve to easily update taxomachine

Pruning overlapping tips

A problem with pruning overlapping taxa. Occurs when an incompletely specified taxon (e.g. Foo sp.) is mapped to the genus, but other congeners are also present in the tree. In this case, we want to prune the taxon mapped to the genus, and retain all others.

Here is an example of how it shouldn't behave:

pgloadind overlapping tips
pgloadind overlapping retained name "Leiognathus" nexsonid "node1057971"
pgloadind overlapping pruned name "Leiognathus striatus" nexsonid "node1057972"

Where "Leiognathus striatus" is a valid taxon.

Instead, it should behave like:

pgloadind overlapping tips
pgloadind overlapping retained name "Equulites stercorarius" nexsonid "node1057988"
pgloadind overlapping pruned name "Equulites" nexsonid "node1057993"

Correct JSON export for getStudyIngestMessagesForNexSON

A service in the GoLS plugin class, used for exporting study readiness metadata.

Currently this class double-wraps a JSON string and exports it. It should serialize the JSON properly to avoid encoding problems and the need to post-process all the escaped characters.

One option is the RepresentationConverter classes.

This issue split off of #32.

Update README and provide an example script

README and run_example.sh are out of date. They use old syntax for inittax and use addtree rather than addnexson.
See #54

regression of getSourceTreeIDs service

I'm getting HTTP error 500 from POST to /db/data/ext/GoLS/graphdb/getSourceTreeIDs

I'm probably the culprit. I'll tackle it today...

Correct JSON export for arguson treeview plugins

Currently this builds the JSON strings by hand and then double-JSON-izes them to export. We can correct this by defining RepresentationConverter methods that will serialize the JSON during export.

Not a high priority.

What is contained in the "jsonprint" attached Objects for JadeNodes?

I came across this when converting the json serialization methods. There is a passage in JadeNode.getJSON() method where an object called "jsonprint" that is attached to a JadeNode is concatenated onto the JSON as it is built. This code has been copied into the ArgusonRepresentationConverter.getArgusonRepresentationForJadeNode() method but is currently commented out, since it was not being added as an element with a value (presumably it is pre-constructed and pre-formatted JSON?). It can be added back if necessary but not sure how best to do that for now.

add mrp matrix dump for supertree building

Filter nexson trees

Skip trees flagged as deprecated in the nexson passed from phylografter. Regards OpenTreeOfLife/phylografter#64

JadeNode.getJSON could be significantly more efficient

It's using concatenation += to build the result string, meaning a copy of the string is made for every token added. Running time could be linear instead of quadratic by using a StringBuffer or StringWriter to build up the result.

Service to return list of the studies that were input to synthesis

Need a table or JSON structure that has an entry for each study that was input to the synthesis of any tree in the graph db.

This is needed in order to generate a bibliography or 'credits' page for the web site.

Each entry needs at least the study id. If more information, such as the DOI, is handy, it would be convenient to have that as well. (Other information can be obtained from Crossref or Phylografter as needed.)

Bad JSON for node Aves (due to un-encoded quotes in string value)

Opentree web-app can't show Aves properly:

http://dev.opentreeoflife.org/opentree/argus/otol.draft.22@474108/Aves

The error is from a string value (a study reference) with quotes in it, and they weren't properly encoded:

"S. J. Hackett, R. T. Kimball, S. Reddy, R. C. K. Bowie, E. L. Braun, M. J. Braun, J. L. Chojnowski, W. A. Cox, K.-L. Han, J. Harshman, C. J. Huddleston, B. D. Marks, K. J. Miglia, W. S. Moore, F. H. Sheldon, D. W. Steadman, C. C. Witt, T. Yuri. 2008. "A Phylogenomic Study of Birds Reveals Their Evolutionary History". Science. 320(5884). 1763-1768."

Since the JSON sent to the browser is encoded, these quotes would need to be double-encoded, travelling in the response as \\", instead of \". (There are many gotchas like this, so it's almost never a good idea to build JSON from scratch.) It looks like JadeNode.getJSON uses a semi-manual approach that doesn't properly handle quotes within strings:

https://github.com/OpenTreeOfLife/treemachine/blob/cc452aa9e19addf90ce57be806e03970107c3fb5/src/main/java/jade/tree/JadeNode.java

As a workaround, consider adding a method (or enhancing JSONExporter.escapeString) to encode quotes within a string, and call this whenever we're adding a string whose content might include quotes or other problematic characters (colon, others?).

Regarding previous discussion in #32, #27, #5: I'm definitely a fan of using a proven JSON serializer rather than attempting "manual" JSON.

Is the GetJsons plugin in treemachine obsolete?

This class appears to be a relict of a code transfer with taxomachine, and contains services that appear to be entirely unused (and some only partially functional?). The corresponding services all exist within taxomachine, where they are functional.

If this class is obsolete then I will remove it from treemachine.

Also mentioned in #32.

Incomplete installation instructions

In the instructions after sh mvn_cmdline.sh
which yielded "BUILD SUCCESS" I got stuck here:
java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar
There is no such jar file.
There is a jar file in target/ but it isn't runnable ("no main manifest attribute").

Then the instructions say to run neo4j, but didn't say previously that neo4j had to be installed.

ingest treebase id into neo4j

get subtree for taxon list

to be ported/refactored from taxomachine

Create Travis CI Config + Basic Tests

This setup will use Maven to install the necessary dependencies and run some basic tests that verify treemachine compiles and isn't completely borked.

LICENSE file missing

What is the license of treemachine?

Synthesis aimed directly at posterior distribution of trees

Add synthesis methods aimed specifically at tree distributions where taxon sampling is completely overlapping, i.e. posterior distributions (and possibly bootstrap samples). Methods such as concordance trees, agreement subtrees, etc.

nested input taxa

need to add exception for nested taxa names in the input source trees. so (Homo, Homo_sapiens) needs to kick out because Homo_sapiens is nested within Homo but they are sister in the tree.

Store jadetree metadata (from nexson) in graph

Nexson trees successfully read in, including metadata. Need to add these properties to the graph.

Should ot:studyYear be stored as a Long (it seems like it should be an int)

Is this correct? I had expected it to be an integer but I get ClassCastException errors when trying to cast it as an Integer; it says it is a Long.

For now I will treat this as a Long. It is easy to switch if it is changed to an integer (there is an enum).

Internal Server Error: More than one element in ...

I'm getting errors from http://dev.opentreeoflife.org/opentree/argus/otol.draft.22@434521/Homininae
and others:

Whoops! The call to get the tree around a node did not work out the way we were hoping it would. Show details

[error] Internal Server Error

{ "message" : "More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@6bcca3da. First element is 'Node[3103424]' and the second element is 'Node[3103423]'", "exception" : "NoSuchElementException", "stacktrace" : [ "org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:118)", "org.neo4j.index.impl.lucene.IdToEntityIterator.getSingle(IdToEntityIterator.java:88)", "org.neo4j.index.impl.lucene.IdToEntityIterator.getSingle(IdToEntityIterator.java:32)", "opentree.GraphExplorer.reconstructSyntheticTreeHelper(GraphExplorer.java:2278)", "opentree.GraphExplorer.reconstructSyntheticTree(GraphExplorer.java:1933)", "opentree.plugins.GoLS.getSyntheticTree(GoLS.java:207)", "java.lang.reflect.Method.invoke(Method.java:601)", "org.neo4j.server.plugins.PluginMethod.invoke(PluginMethod.java:57)", "org.neo4j.server.plugins.PluginManager.invoke(PluginManager.java:168)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:300)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:122)", "java.lang.reflect.Method.invoke(Method.java:601)" ] }

From: Stephen
That is odd because they are in the newick that I created. I wonder what the problem is. You can see if you go from Homo sapiens that it is in there and the parent is from a source tree and homoninae is a parent but you can't seem to navigate to it. I don't think that is an issue with the synthetic tree (or if it is, it clearly isn't causing a problem for generating the newick string which traverses the same relationships). You can get the newick from here https://www.dropbox.com/s/hq9e06z7xxut0ls/test.tar.gz