caporaso-lab / tax-credit Goto Github PK

View Code? Open in Web Editor NEW

15.0 9.0 8.0 2.97 GB

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools

License: BSD 3-Clause "New" or "Revised" License

Python 0.59% Jupyter Notebook 99.41%

tax-credit's Introduction

TAX CREdiT: TAXonomic ClassifieR Evaluation Tool

A standardized and extensible evaluation framework for taxonomic classifiers

To view static versions of the reports , start here.

Environment

This repository contains python-3 code and Jupyter notebooks, but some taxonomy assignment methods (e.g., using QIIME-1 legacy methods) may require different python or software versions. Hence, we use conda parallel environments to support comparison of myriad methods in a single framework.

The first step is to install conda and install QIIME2 following the instructions provided here.

An example of how to load different environments to support other methods can be see in the QIIME-1 taxonomy assignment notebook.

Setup and install

The library code and IPython Notebooks are then installed as follows:

git clone https://github.com/gregcaporaso/tax-credit.git
cd tax-credit/
pip install .

Finally, download and unzip the reference databases:

wget https://unite.ut.ee/sh_files/sh_qiime_release_20.11.2016.zip
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
unzip sh_qiime_release_20.11.2016.zip
tar -xzf gg_13_8_otus.tar.gz

Equipment

The analyses included here can all be run in standard, modern laptop, provided you don't mind waiting a few hours on the most memory-intensive step (taxonomy classification of millions of sequences). With the exception of the q2-feature-classifier naive-bayes* classifier sweeps, which were run on a high-performance cluster, all analyses presented in tax-credit were run in a single day using a MacBook Pro with the following specifications: OS OS X 10.11.6 "El Capitan" Processor 2.3 GHz Intel Core i7 Memory 8 GB 1600 MHz DDR3

If you intend to perform extensive parameter sweeps on a classifier (e.g., several hundred or more parameter combinations), you may want to consider running these analyses using cluster resources, if available.

Using the Jupyter Notebooks included in this repository

To view and interact with Jupyter Notebook, change into the /tax-credit/ipynb directory and run Jupyter Notebooks from the terminal with the command:

jupyter notebook index.ipynb

The notebooks menu should open in your browser. From the main index, you can follow the menus to browse different analyses, or use File --> Open from the notebook toolbar to access the full file tree.

Citing

A publication is on its way! For now, if you use any of the data or code included in this repository, please cite https://github.com/caporaso-lab/tax-credit

tax-credit's People

Contributors

Stargazers

Watchers

Forkers

jairideout ekopylova ebolyen benkaehler xbiome rcedgar ksilnoaa

tax-credit's Issues

add uclust and sortmerna precomputed results as subject results in eval framework

The eval framework currently has precomputed results for blast, mothur, rdp, and rtax. Should also add uclust precomputed results.

all tables should be generated as pandas data frames

generate_taxa_compare_table.py ignores usearch output

Working on adding evaluations for a usearch-based taxonomy assigner, but generate_taxa_compare_table.py is ignoring the files. This looks like an easy fix for this specific case, but we'll need to generalize to support arbitrary taxonomic assignment methods.

multiple_assign_taxonomy.py RTAX regex options

multiple_assign_taxonomy.py needs RTAX regex flexibility similar to assign_taxonomy.py to accommodate sequences from different Illumina platforms/software... not sure which, but there are two main variations.

VAR 1 (compatible with RTAX default):
HMPMockV1.2.Staggered2_7619 A0A3V120410:1:000000000-A0R10:1:2:9402:7252/1

VAR 2 (incompatible without header regex setting):
HMPMockV1.2.Staggered2_7619 A0A3V120410:1:000000000-A0R10:1:2:9402:7252 1:N:0:0

The issue is that the header/read_1_id in VAR1 are captured with the default regex, (\S+)\s(\S+?)/
however VAR2 requires the regex altered to (\S+)\s(\S+?)\s

assign_taxonomy.py uses the options to manually set --amplicon_id_regex and --header_id_regex. Either these should be added to multiple_assign_taxonomy.py or else the default should be flexible to recognize both variants. If possible, both would be ideal, as novice users would get hung up figuring out regex, but this should be very flexible for non-Illumina platforms that would have different headers.

update install instructions

Trying to install this, but I am running into the issue where the project itself requires biom 1.2.0 (not on pypi, and 1.3.1 doesn't seem to work) however the ipython notebooks require biom 2.0.0

change repo name to reference eval framework and port to biocore

The current repo name short-read-tax-assignment made sense when we started this project, but now that we've shifted focus onto the evaluation framework, should we rename it to something that references the framework?

I'm not inspired with any short memorable names right now. @gregcaporaso @nbokulich ideas? Or want to just stick with what we have?

re-run uclust taxonomy assigner, sort out scoring bug

There are a lot of "None" assignments in some of the results, including those from the id0.97000_ma3_c_0.5 run on the Broad-2 dataset, and other top uclust runs.

The question(s) are: are the None assignments correct? If not, why are they being assigned None. Regardless, why are these results scoring so high? There seems to be a bug in the calculation.

update definition of precision to not count "Other" as a false positive

see discussion in #65

add sortmerna-based taxonomic assigner to qiime

should be hooked up to assign_taxonomy.py and there should be a parallel_assign_taxonomy_sortmerna.py script.

implement custom workflow for assigning taxonomy

A custom workflow needs to be implemented to easily allow us to assign taxonomy using different taxonomy assigners and/or parameters, as well as easily support/scale to more taxonomy assigners and Illumina runs besides the ones we currently have.

eval framework whole body results only has 46 samples

The expected bray-curtis distance matrix in the eval framework's whole body results (data/eval-subject-results/natural/study_449/gg_13_5_otus/expected/bray_curtis_dm.txt) only has 46 samples. Shouldn't this dataset should be much larger?

add uclust assigner support to multiple_assign_taxonomy.py

why is precision so low?

related to #62?

Do we need to rerun against new greengenes?

The rep sets and OTU tables for the bacterial datasets were generated using gg_4feb2011. Do we need to rerun these using gg_12_10?

what code do we need that doesn't directly support the eval framework?

I'd like to gut a lot of the code in this repo as much of it is dated, doesn't work anymore, etc. I'm thinking that anything that doesn't directly support the evaluation framework can probably be dropped. Thoughts on this @jairideout and @nbokulich?

Related to #74.

update eval framework notebook to make better use of multi-level comparisons

evaluation 3 currently only works at a single level
update evaluations 1 and 2 (including underlying code) to support running at multiple levels in a single call.

what would be really cool would be to have tiled heat maps for the final figure 1 that include levels 2-6 as the rows, and precision, recall, f-measure, correlation as the columns, though that may get a little crazy. @ebolyen, what do you think about that? (related to #73.)

why are spearman rho values all negative?

I think I'm probably using it incorrectly.

refactor eval framework library code to reduce duplicate code

In code/taxcompare/eval_framework.py, there are two places where @gregcaporaso left comments about refactoring. This refactoring would help reduce some duplication of code.

clean up method heatmaps

Heatmaps are going to be an effective way to illustrate performance by combination of method and dataset. The current heatmaps can be found here and the to-do items are:

clarify labels
maybe do hierarchical clustering (it'd be good to see how it looks, and this could be useful for filtering out some of the lowest performers so the labels work better) - if we do this, we probably want to cluster by one metric (e.g. F-measure) and then use that ordering for all of the plots as it's important that the layout be the same for comparing plots
maybe re-do with bokeh for mouseover information

Our bokeh/QIIME example will be useful for this. It illustrates heatmap plotting and hierarchical clustering with DataFrames.

@ebolyen, would you be able to work on this?

make repo public

Creating this issue so that we remember to make the repo public when the paper is submitted.

do we want precomputed results for other database versions?

The eval framework has precomputed results for Greengenes 13_5. Do we want to also add precomputed results for other Greengenes versions?

Mismatch between id_to_taxonomy_fp variable names

The script and the lib code disagree on what this is called, causing the script to pass a variable that does not exist.

move uc_to_assignments gist into this repo?

@gregcaporaso do we want uc_to_assignments moved into this repo? That way it's one less thing for users to download, and we can better keep track of updates to it.

Choose initial assigners/parameters to test

We need to choose an initial set of assigners and parameters to test using multiple_assign_taxonomy.py. There is a Google doc that has some initial ideas.

generate OTU tables and rep sets for remaining datasets

Broad1 and Turnbaugh2 datasets need to have clean OTU tables and representative sets generated.

same dataset/parameter combo listed twice in eval framework output

While testing out the eval framework notebook, the table produced by generate_pearson_spearman_table produced multiple lines for a single dataset/classifier/parameter combination.

For example:

   Data set     |     r       |    rho      |    Method      |          Parameters          
...
ITS2-SAG1       |       0.632 |       0.096 |mothur          |0.7                         
...
ITS2-SAG1       |       0.493 |      -0.531 |mothur          |0.7                             
...

@gregcaporaso I'm guessing that each of these lines corresponds to a different database, since there are two ITS databases in the precomputed results. Is that right? If so, we'll need the results split by database, or another field with the database name.

add setup.py (at least) or make repo pip installable

rename and provide descriptions of data sets (with references)

From e-mail with @nbokulich:

There are two ITS datasets:
RDBW = ITS1
SAG = ITS2
There were not actually two SAGs, e.g., SAG1 and SAG2, so renaming for consistency should not cause any issues.

Normalize expected community taxonomy strings

The expected mock community taxa summary files have varying numbers of taxonomic levels (i.e. there are some with 6, 7, and 8 levels) and they are not in the standard greengenes format (e.g. no p__ prefixes, etc.).

As is, we cannot use @kpatnode's generate_taxa_compare_table.py script, since it uses QIIME's compare_taxa_summaries.py script under the hood. compare_taxa_summaries.py does not know how to resolve these discrepancies between expected and observed taxonomy strings, and we get correlation coefficients that are essentially meaningless.

Assigning to @gregcaporaso, though we should both get together sometime soon to discuss how to resolve this.

drop rtax from the analysis

It's sort of a metaclassifier strategy as it could really be implemented with any of the methods. Because it's different it adds a lot of complexity. I would rather explore different sequence lengths and the different ends independently, and suggest rtax as a method to make use of longer reads. Thoughts on this?

generate_taxa_compare_table.py can only handle hard-coded taxonomic assignment methods

we'll need this to generalize to support the comparison framework. see #35 for detail on a specific case where this failed. the fix is to generate assignment_method_choices dynamically rather than hard-coding it in code/taxcompare/generate_taxa_compare_table.py (line 18 in the current version of the code).

there is a discrepancy between summarize_taxa.py observation ids, and biom 1.x.x observation ids

Illustrated in this gist. Thanks @ekopylova for catching this! This seems to be at the core of some of the differences we noticed recently when comparing results generated by @nbokulich and myself.

I'm going to be updated some features of the eval framework for the paper-submission, so will figure this out that that point (this will also likely involve a switch to biom 2.0 for the notebooks in this repo).

Level selection for multiple_assign_taxonomy.py and generate_taxa_compare_tables.py

multiple_assign_taxonomy.py and generate_taxa_compare_tables.py do not currently support taxonomy hierarchy level selection (multiple..py generates 2-6 by default, generate..py does but only levels 2-6). Having more flexibility here, especially to classify to level 7, would be necessary for species leve with most databases.

compare taxa summaries to expected mock communities

Once taxonomy has been assigned in a variety of ways, the resulting taxa summaries need to be compared to the expected (i.e. known) mock community compositions located under data/mock-community-compositions/.

add option to compare based on subset of precomputed results

When testing out the eval framework notebooks with the usearch taxonomy assigner, I only ran usearch using the gg_13_5 database and a couple of 16S datasets. In the summarized results tables, the highest ranking classifiers are reported for datasets that I didn't run usearch against (e.g., the ITS datasets). I think it'd be helpful to have the eval framework only compare the query results to the subject results that match (e.g., databases and datasets).

There are a couple of use-cases:

quick testing runtimes (basically what I'm doing right now)
maybe a developer creates a novel classifier that is really good at ITS sequence classification and only wants to perform those comparisons

@gregcaporaso what do you think? This almost makes sense to be the default behavior, with a way to specify that all comparisons are performed like it is doing now.

add remaining natural communities to eval framework

The eval framework only includes the whole body natural community (study 449). We'll need the other (8?) natural community precomputed data to be added into the correct spots so that users can compare their results against all datasets found in the manuscript. The IPython notebooks should also be updated when this happens.

add low abundance OTU filtering prior to query/subject OTU tables

discussed in #65

compare taxa summaries based on different assigners/parameters

Once we have the taxa summaries, we need to compare taxonomic community compositions that were generated using different taxonomy assigners/parameters. For example, we need to compare taxa summary files (from the same dataset) that were created using RDP and RTAX to see whether the results are correlated or not.

generate plots/tables on per-data-set basis

there seem to be some big differences in the scores we achieve on each of the data sets, so we should compare individually. this will be much easier after completing #64.

process ITS-1 dataset

Once the ITS-1 dataset becomes available, it needs to be processed.

add --single_ok option to multiple_assign_taxonomy.py

This is necessary for outputting taxonomy strings even if an assignment isn't made (in this case when one end of a paired end read fails quality filtering).

expand eval framework analyses to include all paper analyses

To-do items from discussion with @nbokulich:

add plotting of heatmaps of taxa that are detected at significantly different abundances between configurations
add comparisons of runtimes (maybe if provided by the user - this will be dependent on where the analysis is run, so maybe not something we want to include in the eval framework itself)
add evaluations on simulated communities (@nbokulich, can you follow up with where these are in the repo. If we can add these in in the same way as the other data sets (i.e., in the data directory it would be very easy to include them in the analyses)).

compare_taxa_summaries.py - new stats features

compare_taxa_summaries.py (in the development version of QIIME) needs the following new features added, per the discussion between Greg and Jai:

Needs a "tail type" option to specify the type of test to perform (i.e. the alternative hypothesis). Right now, a two-tailed test is performed, and this should remain the default.
Needs to perform a nonparametric (i.e. permutation-based) test of significance for both Pearson and Spearman, in addition to the current parametric version using the t-distribution. An option should be added to specify the number of permutations (default should be 999).
Needs to compute the confidence intervals for both Pearson and Spearman using Fisher's transformation. Needs an option to specify the confidence level (e.g. 99%, 95%, etc.).

update multiple_assign_taxonomy.py to work with new eval framework directory structure

multiple_assign_taxonomy.py needs to be updated to work with the new directory structure that is expected by the evaluation framework.

multiple_assign_taxonomy.py doesn't support rtax correctly

Currently, you can only specify a single dataset's paired end reads to multiple_assign_taxonomy.py (via --read_1_seqs_fp and --read_2_seqs_fp). The script interface needs to be modified such that rtax can be run on multiple datasets.

BIOM 2.0 updates to eval framework

replace all calls to Table. collapseObservationsByMetadata with calls to Table.collapse(..., axis='observation', min_group_size=1) (see #61)
parse_biom_table to load_table

create mock community table describing datasets

create mock community sample table that includes: dataset/sample id, target gene, number of taxa, sequencing technology, median read length, forward primer, reverse primer, paired-end (yes/no), original citation, db accession number, download URL, what else? This could be generated as an excel spreadsheet, and Greg can convert to a pandas DataFrame for integration in the analysis notebook.

remove QIIME dependency from eval framework

There are a couple of imports from QIIME in the eval framework code (PCoA and Procrustes functionality). It'd be nice (though certainly not high priority) to remove QIIME as a dependency since it is kind of a heavy-weight one.

create natural community data set description table

create natural community dataset table that includes: dataset id, target gene, sequencing technology, median read length, forward primer, reverse primer, paired-end (yes/no), original citation, db accession number, download URL, what else? This could be generated as an excel spreadsheet, and Greg can convert to a pandas DataFrame for integration in the analysis notebook.

clean up mock community definitions

in investigating precision issues (#65) i'm noticing some minor inconsistencies in how the taxa are named in the expected mock community compositions and how they're named in the 13_8 reference database.

@nbokulich, do you have some bandwidth to clean those up?

I posted a few examples in #65, but really want we want to do is go through all of the taxa listed in the nine *_key.txt files in this directory and confirm that the names are listed as they are in gg_13_8_otus/taxonomy/97_otu_taxonomy.txt.