Pic2NewickTree

This is a research collaboration between Luna(Ecology Evolutionary Biology Lab, UTK) and Moon(Computational Biology Lab, UTK).

Final product

A database for managing and retireveing published phylogenetic tree. To this aim, several steps will be done:

A collection of candidate publications(PDF) with phylogenetic tree figures.
All figures are identified, cropped out from the original paper and converted into picture in a unified format(PNG).
All tree images need to be converted into a coding format(Newick).
All the above informations are organized into a database.

Approche A: Manually-engineered steps for pictures with standard presentations

design phase

synthesize data
design model (attention mechanism with simple filters)
test accuracy on synthesized data

test phase

(0. label data manually)

test accurcy on real data
analyze error case by case
add variations to the model

evaluate prediction confidence

the evaluation of prediction accuracy for unlabel data is needed so that we can pick out those problematic pictures and improve the database mannully.

find out features that prediction performence are sensitive to (for example, to predict age of a person, it's easier to tell if the person is a female rather than male. Another example, it's easier to predict nationality between Chinese and British, but harder for Chinese and Japanese)
design model specific for the above features
softmax and cross-entropy may be helpful to evaluate the prediction confidence
compare the unsupervised clustering results of raw picutures and extracted code or generated pictures may be helpful

Approche B: Deep learning model with standard presentations

Deep learning method for this project is mainly based on image caption archetecture which is a hybrid betwee CNN(convolutional neural network) for image feature extraction and a RNN(recurrent neural network) for generating language which is newick code here.

acquire data

(0. same labeled data)

list of candidate papers
number of phylogenetic tree figures in each candidate paper
coordinates and size of each figure (x, y, h, w)
list of extracted images

model construction

preprocess data: get grayscale and extract species names with 100% accuracy 2.1 build archetecture 2.2 design cost function
train and fine tune

presentation

generate standard pictures with predicted newick code

lunasare / pic2newicktree Goto Github PK

pic2newicktree's Introduction

Pic2NewickTree

Final product

Approche A: Manually-engineered steps for pictures with standard presentations

design phase

test phase

evaluate prediction confidence

Approche B: Deep learning model with standard presentations

acquire data

model construction

presentation

pic2newicktree's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent