Giter Site home page Giter Site logo

caporaso-lab / tax-credit Goto Github PK

View Code? Open in Web Editor NEW
15.0 9.0 8.0 2.97 GB

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

License: BSD 3-Clause "New" or "Revised" License

Python 0.59% Jupyter Notebook 99.41%

tax-credit's Introduction

TAX CREdiT: TAXonomic ClassifieR Evaluation Tool

Build Status

A standardized and extensible evaluation framework for taxonomic classifiers

To view static versions of the reports , start here.

Environment

This repository contains python-3 code and Jupyter notebooks, but some taxonomy assignment methods (e.g., using QIIME-1 legacy methods) may require different python or software versions. Hence, we use conda parallel environments to support comparison of myriad methods in a single framework.

The first step is to install conda and install QIIME2 following the instructions provided here.

An example of how to load different environments to support other methods can be see in the QIIME-1 taxonomy assignment notebook.

Setup and install

The library code and IPython Notebooks are then installed as follows:

git clone https://github.com/gregcaporaso/tax-credit.git
cd tax-credit/
pip install .

Finally, download and unzip the reference databases:

wget https://unite.ut.ee/sh_files/sh_qiime_release_20.11.2016.zip
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
unzip sh_qiime_release_20.11.2016.zip
tar -xzf gg_13_8_otus.tar.gz

Equipment

The analyses included here can all be run in standard, modern laptop, provided you don't mind waiting a few hours on the most memory-intensive step (taxonomy classification of millions of sequences). With the exception of the q2-feature-classifier naive-bayes* classifier sweeps, which were run on a high-performance cluster, all analyses presented in tax-credit were run in a single day using a MacBook Pro with the following specifications: OS OS X 10.11.6 "El Capitan" Processor 2.3 GHz Intel Core i7 Memory 8 GB 1600 MHz DDR3

If you intend to perform extensive parameter sweeps on a classifier (e.g., several hundred or more parameter combinations), you may want to consider running these analyses using cluster resources, if available.

Using the Jupyter Notebooks included in this repository

To view and interact with Jupyter Notebook, change into the /tax-credit/ipynb directory and run Jupyter Notebooks from the terminal with the command:

jupyter notebook index.ipynb

The notebooks menu should open in your browser. From the main index, you can follow the menus to browse different analyses, or use File --> Open from the notebook toolbar to access the full file tree.

Citing

A publication is on its way! For now, if you use any of the data or code included in this repository, please cite https://github.com/caporaso-lab/tax-credit

tax-credit's People

Contributors

benkaehler avatar ebolyen avatar gregcaporaso avatar jairideout avatar nbokulich avatar zellett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tax-credit's Issues

generate_taxa_compare_table.py ignores usearch output

Working on adding evaluations for a usearch-based taxonomy assigner, but generate_taxa_compare_table.py is ignoring the files. This looks like an easy fix for this specific case, but we'll need to generalize to support arbitrary taxonomic assignment methods.

multiple_assign_taxonomy.py RTAX regex options

multiple_assign_taxonomy.py needs RTAX regex flexibility similar to assign_taxonomy.py to accommodate sequences from different Illumina platforms/software... not sure which, but there are two main variations.

VAR 1 (compatible with RTAX default):
HMPMockV1.2.Staggered2_7619 A0A3V120410:1:000000000-A0R10:1:2:9402:7252/1

VAR 2 (incompatible without header regex setting):
HMPMockV1.2.Staggered2_7619 A0A3V120410:1:000000000-A0R10:1:2:9402:7252 1:N:0:0

The issue is that the header/read_1_id in VAR1 are captured with the default regex, (\S+)\s(\S+?)/
however VAR2 requires the regex altered to (\S+)\s(\S+?)\s

assign_taxonomy.py uses the options to manually set --amplicon_id_regex and --header_id_regex. Either these should be added to multiple_assign_taxonomy.py or else the default should be flexible to recognize both variants. If possible, both would be ideal, as novice users would get hung up figuring out regex, but this should be very flexible for non-Illumina platforms that would have different headers.

update install instructions

Trying to install this, but I am running into the issue where the project itself requires biom 1.2.0 (not on pypi, and 1.3.1 doesn't seem to work) however the ipython notebooks require biom 2.0.0

re-run uclust taxonomy assigner, sort out scoring bug

There are a lot of "None" assignments in some of the results, including those from the id0.97000_ma3_c_0.5 run on the Broad-2 dataset, and other top uclust runs.

The question(s) are: are the None assignments correct? If not, why are they being assigned None. Regardless, why are these results scoring so high? There seems to be a bug in the calculation.

implement custom workflow for assigning taxonomy

A custom workflow needs to be implemented to easily allow us to assign taxonomy using different taxonomy assigners and/or parameters, as well as easily support/scale to more taxonomy assigners and Illumina runs besides the ones we currently have.

eval framework whole body results only has 46 samples

The expected bray-curtis distance matrix in the eval framework's whole body results (data/eval-subject-results/natural/study_449/gg_13_5_otus/expected/bray_curtis_dm.txt) only has 46 samples. Shouldn't this dataset should be much larger?

update eval framework notebook to make better use of multi-level comparisons

  • evaluation 3 currently only works at a single level
  • update evaluations 1 and 2 (including underlying code) to support running at multiple levels in a single call.

what would be really cool would be to have tiled heat maps for the final figure 1 that include levels 2-6 as the rows, and precision, recall, f-measure, correlation as the columns, though that may get a little crazy. @ebolyen, what do you think about that? (related to #73.)

clean up method heatmaps

Heatmaps are going to be an effective way to illustrate performance by combination of method and dataset. The current heatmaps can be found here and the to-do items are:

  • clarify labels
  • maybe do hierarchical clustering (it'd be good to see how it looks, and this could be useful for filtering out some of the lowest performers so the labels work better) - if we do this, we probably want to cluster by one metric (e.g. F-measure) and then use that ordering for all of the plots as it's important that the layout be the same for comparing plots
  • maybe re-do with bokeh for mouseover information

Our bokeh/QIIME example will be useful for this. It illustrates heatmap plotting and hierarchical clustering with DataFrames.

@ebolyen, would you be able to work on this?

make repo public

Creating this issue so that we remember to make the repo public when the paper is submitted.

same dataset/parameter combo listed twice in eval framework output

While testing out the eval framework notebook, the table produced by generate_pearson_spearman_table produced multiple lines for a single dataset/classifier/parameter combination.

For example:

   Data set     |     r       |    rho      |    Method      |          Parameters          
...
ITS2-SAG1       |       0.632 |       0.096 |mothur          |0.7                         
...
ITS2-SAG1       |       0.493 |      -0.531 |mothur          |0.7                             
...

@gregcaporaso I'm guessing that each of these lines corresponds to a different database, since there are two ITS databases in the precomputed results. Is that right? If so, we'll need the results split by database, or another field with the database name.

Normalize expected community taxonomy strings

The expected mock community taxa summary files have varying numbers of taxonomic levels (i.e. there are some with 6, 7, and 8 levels) and they are not in the standard greengenes format (e.g. no p__ prefixes, etc.).

As is, we cannot use @kpatnode's generate_taxa_compare_table.py script, since it uses QIIME's compare_taxa_summaries.py script under the hood. compare_taxa_summaries.py does not know how to resolve these discrepancies between expected and observed taxonomy strings, and we get correlation coefficients that are essentially meaningless.

Assigning to @gregcaporaso, though we should both get together sometime soon to discuss how to resolve this.

drop rtax from the analysis

It's sort of a metaclassifier strategy as it could really be implemented with any of the methods. Because it's different it adds a lot of complexity. I would rather explore different sequence lengths and the different ends independently, and suggest rtax as a method to make use of longer reads. Thoughts on this?

there is a discrepancy between summarize_taxa.py observation ids, and biom 1.x.x observation ids

Illustrated in this gist. Thanks @ekopylova for catching this! This seems to be at the core of some of the differences we noticed recently when comparing results generated by @nbokulich and myself.

I'm going to be updated some features of the eval framework for the paper-submission, so will figure this out that that point (this will also likely involve a switch to biom 2.0 for the notebooks in this repo).

compare taxa summaries to expected mock communities

Once taxonomy has been assigned in a variety of ways, the resulting taxa summaries need to be compared to the expected (i.e. known) mock community compositions located under data/mock-community-compositions/.

add option to compare based on subset of precomputed results

When testing out the eval framework notebooks with the usearch taxonomy assigner, I only ran usearch using the gg_13_5 database and a couple of 16S datasets. In the summarized results tables, the highest ranking classifiers are reported for datasets that I didn't run usearch against (e.g., the ITS datasets). I think it'd be helpful to have the eval framework only compare the query results to the subject results that match (e.g., databases and datasets).

There are a couple of use-cases:

  • quick testing runtimes (basically what I'm doing right now)
  • maybe a developer creates a novel classifier that is really good at ITS sequence classification and only wants to perform those comparisons

@gregcaporaso what do you think? This almost makes sense to be the default behavior, with a way to specify that all comparisons are performed like it is doing now.

add remaining natural communities to eval framework

The eval framework only includes the whole body natural community (study 449). We'll need the other (8?) natural community precomputed data to be added into the correct spots so that users can compare their results against all datasets found in the manuscript. The IPython notebooks should also be updated when this happens.

compare taxa summaries based on different assigners/parameters

Once we have the taxa summaries, we need to compare taxonomic community compositions that were generated using different taxonomy assigners/parameters. For example, we need to compare taxa summary files (from the same dataset) that were created using RDP and RTAX to see whether the results are correlated or not.

expand eval framework analyses to include all paper analyses

To-do items from discussion with @nbokulich:

  • add plotting of heatmaps of taxa that are detected at significantly different abundances between configurations
  • add comparisons of runtimes (maybe if provided by the user - this will be dependent on where the analysis is run, so maybe not something we want to include in the eval framework itself)
  • add evaluations on simulated communities (@nbokulich, can you follow up with where these are in the repo. If we can add these in in the same way as the other data sets (i.e., in the data directory it would be very easy to include them in the analyses)).

compare_taxa_summaries.py - new stats features

compare_taxa_summaries.py (in the development version of QIIME) needs the following new features added, per the discussion between Greg and Jai:

  1. Needs a "tail type" option to specify the type of test to perform (i.e. the alternative hypothesis). Right now, a two-tailed test is performed, and this should remain the default.
  2. Needs to perform a nonparametric (i.e. permutation-based) test of significance for both Pearson and Spearman, in addition to the current parametric version using the t-distribution. An option should be added to specify the number of permutations (default should be 999).
  3. Needs to compute the confidence intervals for both Pearson and Spearman using Fisher's transformation. Needs an option to specify the confidence level (e.g. 99%, 95%, etc.).

multiple_assign_taxonomy.py doesn't support rtax correctly

Currently, you can only specify a single dataset's paired end reads to multiple_assign_taxonomy.py (via --read_1_seqs_fp and --read_2_seqs_fp). The script interface needs to be modified such that rtax can be run on multiple datasets.

BIOM 2.0 updates to eval framework

  • replace all calls to Table. collapseObservationsByMetadata with calls to Table.collapse(..., axis='observation', min_group_size=1) (see #61)
  • parse_biom_table to load_table

create mock community table describing datasets

create mock community sample table that includes: dataset/sample id, target gene, number of taxa, sequencing technology, median read length, forward primer, reverse primer, paired-end (yes/no), original citation, db accession number, download URL, what else? This could be generated as an excel spreadsheet, and Greg can convert to a pandas DataFrame for integration in the analysis notebook.

remove QIIME dependency from eval framework

There are a couple of imports from QIIME in the eval framework code (PCoA and Procrustes functionality). It'd be nice (though certainly not high priority) to remove QIIME as a dependency since it is kind of a heavy-weight one.

create natural community data set description table

create natural community dataset table that includes: dataset id, target gene, sequencing technology, median read length, forward primer, reverse primer, paired-end (yes/no), original citation, db accession number, download URL, what else? This could be generated as an excel spreadsheet, and Greg can convert to a pandas DataFrame for integration in the analysis notebook.

clean up mock community definitions

in investigating precision issues (#65) i'm noticing some minor inconsistencies in how the taxa are named in the expected mock community compositions and how they're named in the 13_8 reference database.

@nbokulich, do you have some bandwidth to clean those up?

I posted a few examples in #65, but really want we want to do is go through all of the taxa listed in the nine *_key.txt files in this directory and confirm that the names are listed as they are in gg_13_8_otus/taxonomy/97_otu_taxonomy.txt.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.