Giter Site home page Giter Site logo

treemachine's Introduction

Build Status

treemachine-LITE

Description

treemachine-LITE is a pared-down version of the original treemachine which was used to generate synthetic phylogenetic trees for the Open Tree of Life project. Synthetic analyses are now performed by other tools. The role of treemachine-LITE is simply to construct a neo4j database and serve such trees.

Installing dependencies

treemachine-LITE is managed by Maven v.3 (including the dependencies). In order to compile and build treemachine-LITE, it is easiest to let Maven v.3 do the hard work.

maven On Linux you can install Maven v.3 with:

sudo apt-get install maven

On Mac OS, Maven v.3 can be installed with Homebrew:

brew install maven

jade and ot-base Once Maven v.3 is installed, the treemachine-LITE dependencies themselves can be installed with:

mvn_install_dependencies.sh

neo4j The DB constructed by treemachine-lite is meant to be served by neo4j. We are currently using neo4j-community-v1.9.5. To obtain and decompress neo4j:

$ curl http://files.opentreeoflife.org/neo4j/neo4j-community-1.9.5-unix.tar.gz > neo4j-community-1.9.5.tar.gz
$ tar xzf neo4j-community-1.9.5.tar.gz

Alternately, there is a make target for neo4j:

make neo4j

You can move the neo4j directory wherever you like.

Compiling treemachine-LITE

NOTE: The script for compiling the server plugins will delete any command-line jar in the target directory (and the opposite is true - compiling the command-line jar will delete the server plugins from the target directory). You can rebuild either just by running those scripts again.

Command-line version

To compile the command-line version (which you can then use to build a database):

sh mvn_cmdline.sh

This creates treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar in the target directory (and deletes the server plugin jar if it existed).

Server plugins

To compile the server plugins for interacting with the graph over REST calls:

sh mvn_serverplugins.sh

This creates treemachine-neo4j-plugins-0.0.1-SNAPSHOT.jar in the target directory (and deletes the command-line jar if it existed). Then, copy the .jar file into the neo4j plugins directory:

cp -p target/treemachine-neo4j-plugins-0.0.1-SNAPSHOT.jar $(NEO4J_HOME)/plugins

Usage

Constructing a DB

First, compile the command-line version (see above). Then, to build the DB:

java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar ingestsynth newick_tree json_annotations tsv_taxonomy DB_name

where:

  • newick_tree is the labelled_supetree/labelled_supertree.tre from the synthesis procedure
  • json_annotations is the annotated_supertree/annotations.json from the supertree procedure
  • tsv_taxonomy is the taxonomy.tsv file from the Open Tree of Life Taxonomy (OTT)
  • DB_name is the name for the directory that will hold the generated DB (you choose)

Serving a DB with Neo4j

Compile the server plugins (see 'Server plugins, above). Before starting neo4j, you will need to modify file $(NEO4J_HOME)/conf/neo4j-server.properties. Put the full path of the DB directory constructed by treemachine-LITE as the value for the org.neo4j.server.database.location setting.

After you have loaded content into your db, you can run the neo4j http server with the command :

neo4j-community-1.9.5/bin/neo4j start

If you add the neo4j-community-1.9.5/bin directory to your path, you can just say neo4j start.

Running the tests

To make sure everything is running ok, run the web service tests:

cd ws-tests
./run_tests.sh host:apihost=http://localhost:7474 host:translate=true

treemachine's People

Contributors

blackrim avatar chinchliff avatar jar398 avatar jimallman avatar josephwb avatar kcranston avatar mtholder avatar rhr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

treemachine's Issues

Method for reading Nexson and adding to graph

Basically we need to connect NexsonReader to MainRunner.

The Nexson file could come either from the file system, if some independent process has fetched it from phylografter into a local file, or it could be fetched directly from phylografter using the web service - this is an undecided design question. I would say that for debugging purposes doing the local file would be the best first step since we'll have files available for testing in advance of deployment of the service in phylografter.

This is downstream of issue #7

taxonomy elements that trigger the "-bad path branch"

I'm not sure if this is an issue or not, but here is what I'm seeing...
In TreeLoader.addAdditionalTaxonomyTableIntoGraph there is a branch of code to deal with taxa that have an array of parents that eventually ends in null (around line 292-308 in the commit 3f88466 ).

It looks like the intent is to avoid adding disconnected nodes to the graph.
Question 1: if the taxon is part of a bad path, this logic will keep the TAXCHILDOF relationship from being made between the taxonomy nodes. Do we also want to go back and remove the nodes from the graph? (currently they are left disconnected from the rest of the taxonomy).

Question 2: I don't understand how the ancestral nodes in these "bad paths" ever get TAXCHILDOF relationships. But on the dev server (some of them at least) do have relationships (e.g. search for the name "saltmarsh%20clone%20ensemble" in the taxNamedNodes index). On my local instance, after running inittax on COL and addtax for NCBI, this node exists in the graph but has no relationships. Perhaps, the relationships on the dev-server are being created after other tree-machine commands. Do we want to standardize the behavior of these nodes?

Nexson import of ot:dataDeposit property

I fixed a problem NexsonReader.java had with reading in "@href" property values. Now it is stuck on "ot:dataDeposit". Specifically:

Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String

I don't have an example file with this field handy. I will try to get back to this in a few hours.

Knuckles in extracted synthetic tree

Synthetic tree retains "knuckles" or "knees" present in taxonomy. These are not handled by any analyses wanting to use the trees (although viewers like Dendroscope and FigTree can display them). For now, just use scripts to remove these. Eventually has as an export option.

run_example.sh fails with java.util.NoSuchElementException

This is from a clean installation according to the README. Suggestions?

$ ./run_example.sh 
+ java -Dlog4j.configuration=debuglog4j.properties -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar inittax example/ncbi_primates.txt test.db
things will happen here
initializing taxonomy from example/ncbi_primates.txt to test.db
Exception in thread "main" java.util.NoSuchElementException: the taxonomy file appears to be missing some fields.
    at opentree.GraphInitializer.processTaxInputLine(GraphInitializer.java:275)
    at opentree.GraphInitializer.addInitialTaxonomyTableIntoGraph(GraphInitializer.java:182)
    at opentree.MainRunner.taxonomyLoadParser(MainRunner.java:71)
    at opentree.MainRunner.main(MainRunner.java:1730)
+ exit

update synthesis filters to have explicit include/exclude

Currently not an option. Functionally this is not much of limitation (just reverse the filter criteria), but the sake of clarity, we should require the filter direction to be explicit. Should be a minor change, but one that requires some architecting.

getSyntheticTree arguson wrapped in list

This is mostly a question for Jim, I think.

The arguson from the getSyntheticTree service is wrapped in a list, but it seems like there is only one thing in that list, which is the entire rest of the json output. It looks something like this:

[
{ "name": "life", }
]

Is this correct? If it's just a fluke, I can easily change it now to not be contained in a list. Otherwise I will preserve this.

Rename ISCALLED to something else

I brought this up before, but ... any chance of getting ISCALLED changed to something else?

(1) A taxon is a class of organisms, not a name for one.
(2) The taxon and the corresponding phylogenetically defined group may be mostly coincident but will never be exactly coincident.

Let A = a phylogenetically defined group = the MRCA or some organisms/groups together with all of the MRCA's descendents
Let B = a taxonomically defined group, according to some circumscription, perhaps based on physical characters (and known, one would hope, by whomever coined or meaningfully used the taxon name; just not known to us)
We are talking about the name for the relationship between A and B.

Semantically the relationship is that A is a subclass of B, i.e. all members of A are members of B.

The taxon B will in general be a proper superclass of A, not identical to A. If you consider ancestral organisms, B (if defined by characters and monophyletic) will have emerged historically before the point of bifurcation that defines A, so for some time there would have been Bs that were not As. It is also easy to imagine (say) genera that contain extant species that simply didn't participate in a phylogenetic analysis; A = the MRCA of all the species that were surveyed is only a subclass of B = the genus, not the same as the genus.

Maybe BELONGSTO, or NAMEDTAXON, etc. Not sure what to suggest.

This may never be seen in the UI or documentation, but it does show up in the code, and people will read the code.

Service to return list of the studies that were input to synthesis

It would be useful to have a service that returns, given the name of a synthetic tree, the list of studies (and trees too) that contributed to it.

Applications:

(1) Provide feedback to curators, who are often left wondering which of their studies made the cut.

(2) Needed for credits page, so that we can make an accurate bibliography. (DOIs would be useful for this purpose, but not essential since we can get them from Phylografter.)

It occurs to me it might also be useful to return, for each tree, the id of the root node of that tree, so that you can go from browsing a list of studies to browsing the synthetic tree or particular tree. But maybe that should be a separate service.

Many "isolated" taxa in treemachine (bad Lucene index?)

There are currently lots of taxa that appear alone in argus -- no ancestors or descendant nodes, just one lonely dot. For example, search for Hominidae.

Most nodes close to the root (life) node are intact, eg,
http://dev.opentreeoflife.org/opentree/otol.draft.22@3103299/Metazoa-Monosiga-ovata

...but treemachine fails when retrieving a subtree for Chordata:
http://dev.opentreeoflife.org/opentree/otol.draft.22@417744/Chordata

The error message describes a Lucene error, where multiple items are found in the index where one is expected.

It appears that all descendants nodes (within the clade Chordata) are isolated, eg:
http://dev.opentreeoflife.org/opentree/argus/ottol@947318/Craniata

JSON string export - are we asking for trouble?

We seem to just be composing JSON by hand, which is typically easy. We could use something like: http://jettison.codehaus.org/apidocs/org/codehaus/jettison/json/JSONStringer.html

In many cases we know that the strings are safe for JSON (in which case it is probably faster to not check to see if we need to escape them). I may be the most guilty party here, as I just worked on spitting out metadata for studies. We could easily have quotes or unicode in those strings associated with references. In which case my code would generate bogus JSON.

That data came from phylografter via JSON, so I'm hoping that it is clean.

Deciding that the db will hold strings that are already escaped for JSON would make life easier. Have we decided that? (if so are we checking this on ingesting of trees?).

What is JadeNode.number for?

It's not clear from the documentation in JadeNode.java what the semantics of "number" is intended to be. Some explanation would be helpful.

Secure neo4j

Currently access on port 7474 lets you do anything you want to the graph db: add records, add and delete indexes, etc. That's probably fine for as long as we're in development but before an alpha release this needs to be tightened up.

The easiest way to do this would be using Apache Proxypass, which if apache and neo4j were running on the same physical server would have negligible overhead. Use of safe URIs could be web-public, with others limited to localhost. See also http://docs.neo4j.org/chunked/stable/security-server.html

Parse Nexson study file

There has to be a way to read in a phylogenetic tree coming from Phylografter and represented in Nexson (i.e. NeXML in JSON) format.

[Renamed issue, was "Nexson tree import]

deal with nexson otus that don't have ottol ids before adding to graph

we need to be able to deal with these before they are added to the graph. The idea would be to

  • TNRS the names that are there using taxomachine
    • For ones that match that weren't matched before, we can assign the ottol id
  • For remaining names, if there are any, retrieve the scope determined by the TNRS
    • Add these remaining names as names subtending the scope name to the taxonomy WITHIN treemachine
    • Add these new names to an index that identifies these as new names and give a temp ottol id. This index will serve as easy way to update once we have new official ids from taxomachine and will serve to easily update taxomachine

Pruning overlapping tips

A problem with pruning overlapping taxa. Occurs when an incompletely specified taxon (e.g. Foo sp.) is mapped to the genus, but other congeners are also present in the tree. In this case, we want to prune the taxon mapped to the genus, and retain all others.

Here is an example of how it shouldn't behave:

pgloadind overlapping tips
pgloadind overlapping retained name "Leiognathus" nexsonid "node1057971"
pgloadind overlapping pruned name "Leiognathus striatus" nexsonid "node1057972"

Where "Leiognathus striatus" is a valid taxon.

Instead, it should behave like:

pgloadind overlapping tips
pgloadind overlapping retained name "Equulites stercorarius" nexsonid "node1057988"
pgloadind overlapping pruned name "Equulites" nexsonid "node1057993"

Correct JSON export for getStudyIngestMessagesForNexSON

A service in the GoLS plugin class, used for exporting study readiness metadata.

Currently this class double-wraps a JSON string and exports it. It should serialize the JSON properly to avoid encoding problems and the need to post-process all the escaped characters.

One option is the RepresentationConverter classes.

This issue split off of #32.

Correct JSON export for arguson treeview plugins

Currently this builds the JSON strings by hand and then double-JSON-izes them to export. We can correct this by defining RepresentationConverter methods that will serialize the JSON during export.

Not a high priority.

What is contained in the "jsonprint" attached Objects for JadeNodes?

I came across this when converting the json serialization methods. There is a passage in JadeNode.getJSON() method where an object called "jsonprint" that is attached to a JadeNode is concatenated onto the JSON as it is built. This code has been copied into the ArgusonRepresentationConverter.getArgusonRepresentationForJadeNode() method but is currently commented out, since it was not being added as an element with a value (presumably it is pre-constructed and pre-formatted JSON?). It can be added back if necessary but not sure how best to do that for now.

JadeNode.getJSON could be significantly more efficient

It's using concatenation += to build the result string, meaning a copy of the string is made for every token added. Running time could be linear instead of quadratic by using a StringBuffer or StringWriter to build up the result.

Service to return list of the studies that were input to synthesis

Need a table or JSON structure that has an entry for each study that was input to the synthesis of any tree in the graph db.

This is needed in order to generate a bibliography or 'credits' page for the web site.

Each entry needs at least the study id. If more information, such as the DOI, is handy, it would be convenient to have that as well. (Other information can be obtained from Crossref or Phylografter as needed.)

Bad JSON for node Aves (due to un-encoded quotes in string value)

Opentree web-app can't show Aves properly:

http://dev.opentreeoflife.org/opentree/argus/otol.draft.22@474108/Aves

The error is from a string value (a study reference) with quotes in it, and they weren't properly encoded:

"S. J. Hackett, R. T. Kimball, S. Reddy, R. C. K. Bowie, E. L. Braun, M. J. Braun, J. L. Chojnowski, W. A. Cox, K.-L. Han, J. Harshman, C. J. Huddleston, B. D. Marks, K. J. Miglia, W. S. Moore, F. H. Sheldon, D. W. Steadman, C. C. Witt, T. Yuri. 2008. "A Phylogenomic Study of Birds Reveals Their Evolutionary History". Science. 320(5884). 1763-1768."

Since the JSON sent to the browser is encoded, these quotes would need to be double-encoded, travelling in the response as \\", instead of \". (There are many gotchas like this, so it's almost never a good idea to build JSON from scratch.) It looks like JadeNode.getJSON uses a semi-manual approach that doesn't properly handle quotes within strings:

https://github.com/OpenTreeOfLife/treemachine/blob/cc452aa9e19addf90ce57be806e03970107c3fb5/src/main/java/jade/tree/JadeNode.java

As a workaround, consider adding a method (or enhancing JSONExporter.escapeString) to encode quotes within a string, and call this whenever we're adding a string whose content might include quotes or other problematic characters (colon, others?).

Regarding previous discussion in #32, #27, #5: I'm definitely a fan of using a proven JSON serializer rather than attempting "manual" JSON.

Is the GetJsons plugin in treemachine obsolete?

This class appears to be a relict of a code transfer with taxomachine, and contains services that appear to be entirely unused (and some only partially functional?). The corresponding services all exist within taxomachine, where they are functional.

If this class is obsolete then I will remove it from treemachine.

Also mentioned in #32.

Incomplete installation instructions

In the instructions after sh mvn_cmdline.sh
which yielded "BUILD SUCCESS" I got stuck here:
java -jar target/treemachine-0.0.1-SNAPSHOT-jar-with-dependencies.jar
There is no such jar file.
There is a jar file in target/ but it isn't runnable ("no main manifest attribute").

Then the instructions say to run neo4j, but didn't say previously that neo4j had to be installed.

Create Travis CI Config + Basic Tests

This setup will use Maven to install the necessary dependencies and run some basic tests that verify treemachine compiles and isn't completely borked.

Synthesis aimed directly at posterior distribution of trees

Add synthesis methods aimed specifically at tree distributions where taxon sampling is completely overlapping, i.e. posterior distributions (and possibly bootstrap samples). Methods such as concordance trees, agreement subtrees, etc.

nested input taxa

need to add exception for nested taxa names in the input source trees. so (Homo, Homo_sapiens) needs to kick out because Homo_sapiens is nested within Homo but they are sister in the tree.

Internal Server Error: More than one element in ...

I'm getting errors from http://dev.opentreeoflife.org/opentree/argus/otol.draft.22@434521/Homininae
and others:

Whoops! The call to get the tree around a node did not work out the way we were hoping it would. Show details

[error] Internal Server Error

{ "message" : "More than one element in org.neo4j.index.impl.lucene.LuceneIndex$1@6bcca3da. First element is 'Node[3103424]' and the second element is 'Node[3103423]'", "exception" : "NoSuchElementException", "stacktrace" : [ "org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:118)", "org.neo4j.index.impl.lucene.IdToEntityIterator.getSingle(IdToEntityIterator.java:88)", "org.neo4j.index.impl.lucene.IdToEntityIterator.getSingle(IdToEntityIterator.java:32)", "opentree.GraphExplorer.reconstructSyntheticTreeHelper(GraphExplorer.java:2278)", "opentree.GraphExplorer.reconstructSyntheticTree(GraphExplorer.java:1933)", "opentree.plugins.GoLS.getSyntheticTree(GoLS.java:207)", "java.lang.reflect.Method.invoke(Method.java:601)", "org.neo4j.server.plugins.PluginMethod.invoke(PluginMethod.java:57)", "org.neo4j.server.plugins.PluginManager.invoke(PluginManager.java:168)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:300)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:122)", "java.lang.reflect.Method.invoke(Method.java:601)" ] }

From: Stephen
That is odd because they are in the newick that I created. I wonder what the problem is. You can see if you go from Homo sapiens that it is in there and the parent is from a source tree and homoninae is a parent but you can't seem to navigate to it. I don't think that is an issue with the synthetic tree (or if it is, it clearly isn't causing a problem for generating the newick string which traverses the same relationships). You can get the newick from here https://www.dropbox.com/s/hq9e06z7xxut0ls/test.tar.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.