etetoolkit / treematcher Goto Github PK
View Code? Open in Web Editor NEWSearch flexible patterns within tree structures using regular expression like syntax.
Search flexible patterns within tree structures using regular expression like syntax.
just as other ete3 modules. We use six
for portability.
Implement some cache functions for common operations (i.e. species content, duplications, etc). For instance, if the pattern contain statements checking node size, you can precompute a dictionary like node2size (take a look at the current Tree.get_cached_content) that is later used. If many things are cached, a single dict could be shared, saving memory. Node2cache = {node: [size, species, leaves] }. Alternatively, we can just keep a node2descendants dictionary, so we calculate any value on the fly but without the need of traversing the tree back and forth many times.
find_match function now return only the first match of the queried pattern.
Ideally this function would yield all the matches, and could have a parameter to limit the number of matches the user may want.
something like:
find_match(self, tree, local_vars, maxhits=1)
docstrings are intended for users, so no need to provide a detailed description on how to use internal variables. Also, it is weird that function arguments are in form '__name'. I think is not necessary
It needs some review and examples in the refactored code
basic functionality of the treematcher module should be accessible from the API through an ete tool called ete-search which should be implemented as ete_search.py
Desired features:
While refactoring I found a few test cases that where returning True when actually the result was incorrect. Let's try to double check all the unitests little by little.
It is very important that we can trust them
Make sure code is python 2/3 compatible. ETE uses the module six for keeping compatible syntax
In line number 156 the code is like this
if li yield line
which doesn't make any sense!
caching is now compulsory. This behavior may not be the best option for plenty of small trees, and may just be untractable, in terms of memory consumption, with very large trees.
having a nice real case example using the command line tool would round the project very nicely. I am assigning this to milestone 1.1, as it is not as urgent as doc and features, but I think is something Alisha could use for a tutorial-like wiki page
It does not fail for all trees, but, for this one in particular...
from treematcher import TreePattern
from ete3 import Tree
t = Tree('((aaaaaaaaad:1,(aaaaaaaaae:1,aaaaaaaaaf:1)1:1)1:1,((aaaaaaaaag:1,aaaaaaaaah:1)1:1,((aaaaaaaaai:1,aaaaaaaaaj:1)1:1,(aaaaaaaaaa:1,(aaaaaaaaab:1,aaaaaaaaac:1)1:1)1:1)1:1)1:1);')
print t
"""
/-aaaaaaaaad
/-|
| | /-aaaaaaaaae
| \-|
| \-aaaaaaaaaf
--|
| /-aaaaaaaaag
| /-|
| | \-aaaaaaaaah
| |
\-| /-aaaaaaaaai
| /-|
| | \-aaaaaaaaaj
\-|
| /-aaaaaaaaaa
\-|
| /-aaaaaaaaab
\-|
\-aaaaaaaaac
"""
pt = '((aaaaaaaaaa)*,(aaaaaaaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()
"""
/-aaaaaaaaad
/-|
| | /-aaaaaaaaae
| \-|
| \-aaaaaaaaaf
--|
| /-aaaaaaaaag
| /-|
| | \-aaaaaaaaah
| |
\-| /-aaaaaaaaai
| /-|
| | \-aaaaaaaaaj
\-|
| /-aaaaaaaaaa
\-|
| /-aaaaaaaaab
\-|
\-aaaaaaaaac
"""
pt = '((aaaaNOTaaaaaa)*,(aaaaaaNOTaaaj)*)@;'
tpt = TreePattern(pt)
print tpt.find_match(t).next()
"""
/-aaaaaaaaad
/-|
| | /-aaaaaaaaae
| \-|
| \-aaaaaaaaaf
--|
| /-aaaaaaaaag
| /-|
| | \-aaaaaaaaah
| |
\-| /-aaaaaaaaai
| /-|
| | \-aaaaaaaaaj
\-|
| /-aaaaaaaaaa
\-|
| /-aaaaaaaaab
\-|
\-aaaaaaaaac
"""
it can probably be reviewed and used for the tutorial. The README and this repo will disappear once merged with ete3
Add a few lines of documentation to the README file.
place all tests and examples in test_treematcher.py
, which should use python unittest module like for other ete3 modules.
progressively expand test cases
Some queries are very common. We should provide shortcut functions. Some candidates:
contains_species(@, [a, b, c, ...])
contains(@, [a, b, c, ...])
number_of_species(@)
is_duplication(@) # checks node.evol_type inferred with PhyloTree.get_descendant_evol_events()
is_speciation(@) # checks
size(@)
New function TreePattern.has_match() โ return true if the target tree has at least one match
The command line tool is now usable, but with very basic options.
Some more stuff should be included. For instance, options for reading target trees and pattern expressions from files, reporting matches visually or as newick, reporting just number of hits, etc..
We need to decide what strategy is best (or maybe keep both), and organized the code in treematcher consequently. pros? cons?
p = Tree('(A+,B);')
t = Tree('((A,B), C);')
treematcher.find_matches(t, p)
p = Tree('(A+,B);')
t = Tree('((A,B), C);')
p.find_matches(tree)
I did not test the examples in the README after refactoring. Some will need small changes in the syntax I guess.
A small tutorial section on how to use the treematcher API is missing.
Use ete3/sdoc/tutorial/tutorial_ncbi* as example
Not sure if regular-expression like queries is very true now. We basically use python syntax.... (good because give us much power, but technically we departure from the idea of a high level syntax)
Desired functionality
Accept or Reject a relation or property
Type of connection
Type of connection, secondary ( variations of first two).
Properties
Sets
Reference
the directories and file names should match ete3 structure:
ete3/test/test_treematcher.py
ete3/treematcher/__init__.py
ete3/treematcher/treematcher.py
ete3/sdoc/tutorial/tutorial_treematcher.rst
ete3/tools/ete_search.py
there is no need for a loop here:
Line 368 in 870cfbe
and here:
Line 390 in 870cfbe
the functions should receive a target node, and return true if is duplication 'target.etype=="D"' or speciation ('S'). Actually, the propose of this was more for things like contains_duplications(_target) or even better duplications_bellow(_target), which should read from the cached content. Something like:
def duplications_under(node):
events = get_cached_attr(node, 'etype')
return events.count('D')
some proposals for the relaxed matching syntax (using Perl symbols):
Find nodes where (a,b) is connected to (c,d) by one or more intermediate parent nodes:
( ((a,b))+, (c,d));
Find nodes where (a,b) and (c,d) exist in the same tree connected by any number of nodes:
( ((a,b))*, ((c,d))*);
Ideally, this should also allow us to combine with basic node searches. For instance, find nodes where (a,b) is connected to (c,d), by one or more duplication parent nodes:
( ((a,b))'+is_duplication(@) == True', (c,d));
Syntax looks ugly, I know, but once the functionality is in place we could think of a more readable method
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.