This is a research collaboration between Luna(Ecology Evolutionary Biology Lab, UTK) and Moon(Computational Biology Lab, UTK).
A database for managing and retireveing published phylogenetic tree. To this aim, several steps will be done:
- A collection of candidate publications(PDF) with phylogenetic tree figures.
- All figures are identified, cropped out from the original paper and converted into picture in a unified format(PNG).
- All tree images need to be converted into a coding format(Newick).
- All the above informations are organized into a database.
- synthesize data
- design model (attention mechanism with simple filters)
- test accuracy on synthesized data
(0. label data manually)
- test accurcy on real data
- analyze error case by case
- add variations to the model
the evaluation of prediction accuracy for unlabel data is needed so that we can pick out those problematic pictures and improve the database mannully.
- find out features that prediction performence are sensitive to (for example, to predict age of a person, it's easier to tell if the person is a female rather than male. Another example, it's easier to predict nationality between Chinese and British, but harder for Chinese and Japanese)
- design model specific for the above features
- softmax and cross-entropy may be helpful to evaluate the prediction confidence
- compare the unsupervised clustering results of raw picutures and extracted code or generated pictures may be helpful
Deep learning method for this project is mainly based on image caption archetecture which is a hybrid betwee CNN(convolutional neural network) for image feature extraction and a RNN(recurrent neural network) for generating language which is newick code here.
(0. same labeled data)
- list of candidate papers
- number of phylogenetic tree figures in each candidate paper
- coordinates and size of each figure (x, y, h, w)
- list of extracted images
- preprocess data: get grayscale and extract species names with 100% accuracy 2.1 build archetecture 2.2 design cost function
- train and fine tune
- generate standard pictures with predicted newick code