The treematcher from etetoolkit

ensure python 2 and python 3 compatibility

just as other ete3 modules. We use six for portability.

Implement a cache system for complex queries

Implement some cache functions for common operations (i.e. species content, duplications, etc). For instance, if the pattern contain statements checking node size, you can precompute a dictionary like node2size (take a look at the current Tree.get_cached_content) that is later used. If many things are cached, a single dict could be shared, saving memory. Node2cache = {node: [size, species, leaves] }. Alternatively, we can just keep a node2descendants dictionary, so we calculate any value on the fly but without the need of traversing the tree back and forth many times.

find_match should be an iterator returning all found matches.

find_match function now return only the first match of the queried pattern.
Ideally this function would yield all the matches, and could have a parameter to limit the number of matches the user may want.

something like:
find_match(self, tree, local_vars, maxhits=1)

do not document internal variables as docstrings

docstrings are intended for users, so no need to provide a detailed description on how to use internal variables. Also, it is weird that function arguments are in form '__name'. I think is not necessary

ete3 merge

Once all other issues are finish, specially the one regarding the structure of files, we will need to merge the repo into a new ete3 branch.

@unode has some experience on this and I (@jhcepas) can also help

not sure what smart_lineages function is used for

It needs some review and examples in the refactored code

check and improve `ete search` tool

basic functionality of the treematcher module should be accessible from the API through an ete tool called ete-search which should be implemented as ete_search.py

Desired features:

review and close this issue: #17
read newick pattern from the command line as newick string or from a file
read target trees from the command line as newick strings or from a files
dump matches in newick format or print them as ascii art
any other stuff here?

docstring documentation is missing in TreeMatcher class and functions.

remove the preprocessing from shortcut functions

review unit tests

While refactoring I found a few test cases that where returning True when actually the result was incorrect. Let's try to double check all the unitests little by little.
It is very important that we can trust them

python 2/3 compatibility

Make sure code is python 2/3 compatible. ETE uses the module six for keeping compatible syntax

Incomplete Code Segment in ete_treematcher.py

In line number 156 the code is like this
if li yield line
which doesn't make any sense!

make caching optional

caching is now compulsory. This behavior may not be the best option for plenty of small trees, and may just be untractable, in terms of memory consumption, with very large trees.

A good real case example

having a nice real case example using the command line tool would round the project very nicely. I am assigning this to milestone 1.1, as it is not as urgent as doc and features, but I think is something Alisha could use for a tutorial-like wiki page

strange behavior of regex with wild char

It does not fail for all trees, but, for this one in particular...

from treematcher import TreePattern
from ete3 import Tree
t = Tree('((aaaaaaaaad:1,(aaaaaaaaae:1,aaaaaaaaaf:1)1:1)1:1,((aaaaaaaaag:1,aaaaaaaaah:1)1:1,((aaaaaaaaai:1,aaaaaaaaaj:1)1:1,(aaaaaaaaaa:1,(aaaaaaaaab:1,aaaaaaaaac:1)1:1)1:1)1:1)1:1);')

print t
"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""
pt = '((aaaaaaaaaa)*,(aaaaaaaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()

"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""
pt = '((aaaaNOTaaaaaa)*,(aaaaaaNOTaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()
"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""

update documentation in README

it can probably be reviewed and used for the tutorial. The README and this repo will disappear once merged with ete3

README and basic usage documentation

Add a few lines of documentation to the README file.

Program description
How to use the command line tool, with a couple of examples.
How to use the API version (basic example)

clean up and update unitest

place all tests and examples in test_treematcher.py, which should use python unittest module like for other ete3 modules.

progressively expand test cases

shortcuts for common queries

Some queries are very common. We should provide shortcut functions. Some candidates:
contains_species(@, [a, b, c, ...])
contains(@, [a, b, c, ...])
number_of_species(@)
is_duplication(@) # checks node.evol_type inferred with PhyloTree.get_descendant_evol_events()
is_speciation(@) # checks
size(@)
New function TreePattern.has_match() → return true if the target tree has at least one match

command line tool

The command line tool is now usable, but with very basic options.
Some more stuff should be included. For instance, options for reading target trees and pattern expressions from files, reporting matches visually or as newick, reporting just number of hits, etc..

`pattern.find_matches(tree)` or `treematcher.find_matches(tree, pattern)`?

We need to decide what strategy is best (or maybe keep both), and organized the code in treematcher consequently. pros? cons?

p = Tree('(A+,B);') 
t = Tree('((A,B), C);')
treematcher.find_matches(t, p)

p = Tree('(A+,B);') 
t = Tree('((A,B), C);')
p.find_matches(tree)

review examples in README

I did not test the examples in the README after refactoring. Some will need small changes in the syntax I guess.

treematcher tutorial

A small tutorial section on how to use the treematcher API is missing.
Use ete3/sdoc/tutorial/tutorial_ncbi* as example

title in documentation

Not sure if regular-expression like queries is very true now. We basically use python syntax.... (good because give us much power, but technically we departure from the idea of a high level syntax)

Functionality standards

Desired functionality
Accept or Reject a relation or property

Type of connection

connected with zero or more intermediate nodes.
connected with one or more intermediate nodes: implemented.
connected with zero or one intermediate noeds.
in different subtrees ( = connected with zero or one intermediate nodes or they are sister nodes).

Type of connection, secondary ( variations of first two).

connected with zero until a N maximum number of nodes.
connected with at least a minimum N number of nodes.
connected with at least N and until M number of nodes.

Properties

is root
is leaf

Sets

Reference

has the same attributes as the node that the node it refers.

restructure repository in preparation for merging with `ete3`

the directories and file names should match ete3 structure:

ete3/test/test_treematcher.py
ete3/treematcher/__init__.py
ete3/treematcher/treematcher.py
ete3/sdoc/tutorial/tutorial_treematcher.rst
ete3/tools/ete_search.py

only event label per node

there is no need for a loop here:

treematcher/treematcher.py

Line 368 in 870cfbe

for event in __pattern.evol_events:

and here:

treematcher/treematcher.py

Line 390 in 870cfbe

for event in __pattern.evol_events:

the functions should receive a target node, and return true if is duplication 'target.etype=="D"' or speciation ('S'). Actually, the propose of this was more for things like contains_duplications(_target) or even better duplications_bellow(_target), which should read from the cached content. Something like:

def duplications_under(node):
    events = get_cached_attr(node, 'etype')
    return events.count('D')

relaxed matches (regular expression like syntax)

some proposals for the relaxed matching syntax (using Perl symbols):

Find nodes where (a,b) is connected to (c,d) by one or more intermediate parent nodes:

( ((a,b))+, (c,d));

Find nodes where (a,b) and (c,d) exist in the same tree connected by any number of nodes:

( ((a,b))*, ((c,d))*);

Ideally, this should also allow us to combine with basic node searches. For instance, find nodes where (a,b) is connected to (c,d), by one or more duplication parent nodes:

( ((a,b))'+is_duplication(@) == True', (c,d));

Syntax looks ugly, I know, but once the functionality is in place we could think of a more readable method

etetoolkit / treematcher Goto Github PK

treematcher's People

Contributors

Stargazers

Watchers

Forkers

treematcher's Issues

Recommend Projects

Recommend Topics

Recommend Org