benkaehler / short-read-tax-assignment Goto Github PK

View Code? Open in Web Editor NEW

This project forked from caporaso-lab/tax-credit

0.0 0.0 1.0 3.36 GB

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.41% Python 0.59%

short-read-tax-assignment's People

Contributors

Watchers

Forkers

nbokulich

short-read-tax-assignment's Issues

Assess the feasibility of trying all the sci-kit learn classifiers

Nick suggests a tiered approach. Ben will do a survey and see how many classifiers there are and whether it could be done.

Do we need to be more careful about prior (or marginal) distributions?

I am concerned about classifier training priors. This came up in my conversation with Steve.

When we are training our classifier, either implicitly or explicitly we bias our classifier to be more likely to predict a taxon according to some prior distribution. This is natural and desirable in some machine learning contexts: if a classifier is uncertain about a prediction then it is going to do better if it guesses according to the unconditional distribution of classes. Some classifiers set the priors to be uniform, some set them according to the distribution of classes in the training sample.

In the normal machine learning context this is uncontroversial because we usually train and validate our classifier on samples that are presumably drawn from the same population.

In our case things are different. We are training our data on a reference set, then testing it on samples that in some cases have contrived and unnatural distributions of classes.

So in my mind there are two questions:

Is our training prior appropriate? Does the reference sample represent some sort of global prior that we would expect for taxa, or is the distribution of classes in the reference data set an artefact of the historical forces by which the reference set has been accumulated?
Are our test sample distributions appropriate for benchmarks? Should we be tuning our classifiers to do well on data sets with distributions of taxa that are unrealistic?

Migrate the Jupyter Notebooks to Python 3

short-read-tax-assignment is a mouthful

@GavinHuttley has suggested that we call it the SHARK.

@gregcaporaso seems naturally apprehensive.

How about the following backronym?

SHort read tAxonomy Rating for Classifiers?

So it would be SHARC. Close enough.

Add evaluation for the naïve bayes classifier from BenKaehler/q2-feature-classifier

implement test for classification of "novel" taxa

How do taxonomy classifiers perform when they encounter a query sequence that is not represented in the reference database? To what degree do they "overclassify"?

For this test, "novel taxa" consist of query sequences randomly drawn from a reference database (source). Taxonomy assignment is then performed using a modified reference database (ref), which consists of the source minus novel taxa AND all seqs with matching taxonomy annotations at the taxonomic level (L) being tested (species, genus, family, etc). Also remove any taxa from the ref that do not have near neighbors at level L, e.g., other species in the genus

Following this method,
Match: assignment == L - 1 (e.g., a novel species is assigned the correct genus)
overclassification: assignment == L (e.g., correct genus but assigns to a near neighbor)
misclassification: incorrect assignment at L - 1 (e.g., wrong genus-level assignment)

One question: is it worth also defining underclassification, i.e., assignment < L (e.g., correct family but no genus)? My gut feeling is NO, since this will complicate matters and we are left asking at which level L this becomes irrelevant. (e.g., if species X is assigned to the correct phylum but wrong class, is this still underclassification and is that important?) Unlike overclassification, I also question whether this would yield meaningful interpretation.

Select and evaluate several alternative scikit-learn classifiers

scikit-learn offers several different classifiers in different categories.

A reasonable overview of what's available and appropriate is here.

Would suggest that we should try at least BernoulliNB, SVCs and at least one of Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression, or MLPClassifier.

Should also revisit feature selection. SelectPercentile doesn't seem to help for MultinomialNB, but it may work elsewhere and other feature selection techniques exist.

Also try TfidfTransformer for feature extraction.

All of this should be possible using the general fit-classifier method in feature-classifier, but may require some fixes to the code.

benkaehler / short-read-tax-assignment Goto Github PK

short-read-tax-assignment's People

Contributors

Watchers

Forkers

short-read-tax-assignment's Issues

Assess the feasibility of trying all the sci-kit learn classifiers

Do we need to be more careful about prior (or marginal) distributions?

Migrate the Jupyter Notebooks to Python 3

short-read-tax-assignment is a mouthful

Add evaluation for the naïve bayes classifier from BenKaehler/q2-feature-classifier

implement test for classification of "novel" taxa

Select and evaluate several alternative scikit-learn classifiers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent