Giter Site home page Giter Site logo

treematcher's People

Contributors

alishamechtley avatar dpliakos avatar fransua avatar jhcepas avatar unode avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

treematcher's Issues

Implement a cache system for complex queries

Implement some cache functions for common operations (i.e. species content, duplications, etc). For instance, if the pattern contain statements checking node size, you can precompute a dictionary like node2size (take a look at the current Tree.get_cached_content) that is later used. If many things are cached, a single dict could be shared, saving memory. Node2cache = {node: [size, species, leaves] }. Alternatively, we can just keep a node2descendants dictionary, so we calculate any value on the fly but without the need of traversing the tree back and forth many times.

find_match should be an iterator returning all found matches.

find_match function now return only the first match of the queried pattern.
Ideally this function would yield all the matches, and could have a parameter to limit the number of matches the user may want.

something like:
find_match(self, tree, local_vars, maxhits=1)

do not document internal variables as docstrings

docstrings are intended for users, so no need to provide a detailed description on how to use internal variables. Also, it is weird that function arguments are in form '__name'. I think is not necessary

ete3 merge

Once all other issues are finish, specially the one regarding the structure of files, we will need to merge the repo into a new ete3 branch.

@unode has some experience on this and I (@jhcepas) can also help

check and improve `ete search` tool

basic functionality of the treematcher module should be accessible from the API through an ete tool called ete-search which should be implemented as ete_search.py

Desired features:

  • review and close this issue: #17
  • read newick pattern from the command line as newick string or from a file
  • read target trees from the command line as newick strings or from a files
  • dump matches in newick format or print them as ascii art
  • any other stuff here?

review unit tests

While refactoring I found a few test cases that where returning True when actually the result was incorrect. Let's try to double check all the unitests little by little.
It is very important that we can trust them

python 2/3 compatibility

Make sure code is python 2/3 compatible. ETE uses the module six for keeping compatible syntax

make caching optional

caching is now compulsory. This behavior may not be the best option for plenty of small trees, and may just be untractable, in terms of memory consumption, with very large trees.

A good real case example

having a nice real case example using the command line tool would round the project very nicely. I am assigning this to milestone 1.1, as it is not as urgent as doc and features, but I think is something Alisha could use for a tutorial-like wiki page

strange behavior of regex with wild char

It does not fail for all trees, but, for this one in particular...

from treematcher import TreePattern
from ete3 import Tree
t = Tree('((aaaaaaaaad:1,(aaaaaaaaae:1,aaaaaaaaaf:1)1:1)1:1,((aaaaaaaaag:1,aaaaaaaaah:1)1:1,((aaaaaaaaai:1,aaaaaaaaaj:1)1:1,(aaaaaaaaaa:1,(aaaaaaaaab:1,aaaaaaaaac:1)1:1)1:1)1:1)1:1);')

print t
"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""
pt = '((aaaaaaaaaa)*,(aaaaaaaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()

"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""
pt = '((aaaaNOTaaaaaa)*,(aaaaaaNOTaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()
"""
      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
  |      \-aaaaaaaaaf
--|
  |      /-aaaaaaaaag
  |   /-|
  |  |   \-aaaaaaaaah
  |  |
   \-|      /-aaaaaaaaai
     |   /-|
     |  |   \-aaaaaaaaaj
      \-|
        |   /-aaaaaaaaaa
         \-|
           |   /-aaaaaaaaab
            \-|
               \-aaaaaaaaac
"""

README and basic usage documentation

Add a few lines of documentation to the README file.

  • Program description
  • How to use the command line tool, with a couple of examples.
  • How to use the API version (basic example)

clean up and update unitest

place all tests and examples in test_treematcher.py, which should use python unittest module like for other ete3 modules.

progressively expand test cases

shortcuts for common queries

Some queries are very common. We should provide shortcut functions. Some candidates:
contains_species(@, [a, b, c, ...])
contains(@, [a, b, c, ...])
number_of_species(@)
is_duplication(@) # checks node.evol_type inferred with PhyloTree.get_descendant_evol_events()
is_speciation(@) # checks
size(@)
New function TreePattern.has_match() โ†’ return true if the target tree has at least one match

command line tool

The command line tool is now usable, but with very basic options.
Some more stuff should be included. For instance, options for reading target trees and pattern expressions from files, reporting matches visually or as newick, reporting just number of hits, etc..

review examples in README

I did not test the examples in the README after refactoring. Some will need small changes in the syntax I guess.

treematcher tutorial

A small tutorial section on how to use the treematcher API is missing.
Use ete3/sdoc/tutorial/tutorial_ncbi* as example

title in documentation

Not sure if regular-expression like queries is very true now. We basically use python syntax.... (good because give us much power, but technically we departure from the idea of a high level syntax)

Functionality standards

Desired functionality
Accept or Reject a relation or property

Type of connection

  • connected with zero or more intermediate nodes.
  • connected with one or more intermediate nodes: implemented.
  • connected with zero or one intermediate noeds.
  • in different subtrees ( = connected with zero or one intermediate nodes or they are sister nodes).

Type of connection, secondary ( variations of first two).

  • connected with zero until a N maximum number of nodes.
  • connected with at least a minimum N number of nodes.
  • connected with at least N and until M number of nodes.

Properties

  • is root
  • is leaf

Sets

  • meets "that" possible requirements.
  • not in that set
  • logical OR
  • logical AND
  • logical NOT

Reference

  • has the same attributes as the node that the node it refers.

only event label per node

there is no need for a loop here:

for event in __pattern.evol_events:

and here:

for event in __pattern.evol_events:

the functions should receive a target node, and return true if is duplication 'target.etype=="D"' or speciation ('S'). Actually, the propose of this was more for things like contains_duplications(_target) or even better duplications_bellow(_target), which should read from the cached content. Something like:

def duplications_under(node):
    events = get_cached_attr(node, 'etype')
    return events.count('D')

relaxed matches (regular expression like syntax)

some proposals for the relaxed matching syntax (using Perl symbols):

Find nodes where (a,b) is connected to (c,d) by one or more intermediate parent nodes:

( ((a,b))+, (c,d));

Find nodes where (a,b) and (c,d) exist in the same tree connected by any number of nodes:

( ((a,b))*, ((c,d))*);

Ideally, this should also allow us to combine with basic node searches. For instance, find nodes where (a,b) is connected to (c,d), by one or more duplication parent nodes:

( ((a,b))'+is_duplication(@) == True', (c,d));

Syntax looks ugly, I know, but once the functionality is in place we could think of a more readable method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.