Giter Site home page Giter Site logo

benkaehler / short-read-tax-assignment Goto Github PK

View Code? Open in Web Editor NEW

This project forked from caporaso-lab/tax-credit

0.0 0.0 1.0 3.36 GB

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 99.41% Python 0.59%

short-read-tax-assignment's People

Contributors

benkaehler avatar ebolyen avatar gregcaporaso avatar jairideout avatar nbokulich avatar zellett avatar

Watchers

 avatar  avatar

Forkers

nbokulich

short-read-tax-assignment's Issues

Do we need to be more careful about prior (or marginal) distributions?

I am concerned about classifier training priors. This came up in my conversation with Steve.

When we are training our classifier, either implicitly or explicitly we bias our classifier to be more likely to predict a taxon according to some prior distribution. This is natural and desirable in some machine learning contexts: if a classifier is uncertain about a prediction then it is going to do better if it guesses according to the unconditional distribution of classes. Some classifiers set the priors to be uniform, some set them according to the distribution of classes in the training sample.

In the normal machine learning context this is uncontroversial because we usually train and validate our classifier on samples that are presumably drawn from the same population.

In our case things are different. We are training our data on a reference set, then testing it on samples that in some cases have contrived and unnatural distributions of classes.

So in my mind there are two questions:

  1. Is our training prior appropriate? Does the reference sample represent some sort of global prior that we would expect for taxa, or is the distribution of classes in the reference data set an artefact of the historical forces by which the reference set has been accumulated?
  2. Are our test sample distributions appropriate for benchmarks? Should we be tuning our classifiers to do well on data sets with distributions of taxa that are unrealistic?

implement test for classification of "novel" taxa

How do taxonomy classifiers perform when they encounter a query sequence that is not represented in the reference database? To what degree do they "overclassify"?

For this test, "novel taxa" consist of query sequences randomly drawn from a reference database (source). Taxonomy assignment is then performed using a modified reference database (ref), which consists of the source minus novel taxa AND all seqs with matching taxonomy annotations at the taxonomic level (L) being tested (species, genus, family, etc). Also remove any taxa from the ref that do not have near neighbors at level L, e.g., other species in the genus

Following this method,
Match: assignment == L - 1 (e.g., a novel species is assigned the correct genus)
overclassification: assignment == L (e.g., correct genus but assigns to a near neighbor)
misclassification: incorrect assignment at L - 1 (e.g., wrong genus-level assignment)

One question: is it worth also defining underclassification, i.e., assignment < L (e.g., correct family but no genus)? My gut feeling is NO, since this will complicate matters and we are left asking at which level L this becomes irrelevant. (e.g., if species X is assigned to the correct phylum but wrong class, is this still underclassification and is that important?) Unlike overclassification, I also question whether this would yield meaningful interpretation.

Select and evaluate several alternative scikit-learn classifiers

scikit-learn offers several different classifiers in different categories.

A reasonable overview of what's available and appropriate is here.

Would suggest that we should try at least BernoulliNB, SVCs and at least one of Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression, or MLPClassifier.

Should also revisit feature selection. SelectPercentile doesn't seem to help for MultinomialNB, but it may work elsewhere and other feature selection techniques exist.

Also try TfidfTransformer for feature extraction.

All of this should be possible using the general fit-classifier method in feature-classifier, but may require some fixes to the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.