Giter Site home page Giter Site logo

q2-brocc's Introduction

q2-brocc

This is a QIIME 2 plugin to support the BROCC software for taxonomic assignment. BROCC was originally built to generate taxonomic assignments of marker gene sequences from fungi. However, we've found BROCC to be useful with many other types of data.

The strategy applied by BROCC is to search against a large reference database, the nt database from NCBI, using BLAST. Then, BROCC applies a voting process to generate taxonomic assignments. The main factors that differentiate BROCC from other software in this area are the level of control in the voting, and the efforts to remove undesirable or uninformative votes from the results.

More details about BROCC can be found in our publication (PMID 22759449) and on the github page for the software (https://github.com/kylebittinger/brocc).

Installing the plugin

QIIME 2 must be installed for the plugin to work. Instructions for installing QIIME 2 can be found at https://qiime2.org. If you'd like to use BROCC outside of QIIME 2, that's totally possible. The standalone software can be installed by running pip install brocc.

To install the BROCC plugin for QIIME 2, run

 pip install q2-brocc

Once installed, we should check that QIIME 2 can pick up on our new plugin. If you run

 qiime --help

you should see brocc listed under the available commands. Furthermore, if you run

 qiime brocc --help

you should see some help documentation for the plugin.

To run this plugin, you'll need to have the nt database from NCBI downloaded and configured. It's about 60G and it takes a while to download. You can download over FTP using a command

 wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz" 

NCBI has some tips on downloading large data sets at https://www.ncbi.nlm.nih.gov/home/download/. Once the nt database is downloaded, you'll need to make sure that BLAST can find the database files. You do this by setting an environment variable, BLASTDB, to the directory where the nt database files are stored. Do this by running:

 export BLASTDB=/path/to/directory/containing/nt-database-files

While you're at it, copy/paste this line into your ~/.bashrc file so the environment variable is set the next time you log in.

Let's test that the nt database is configured correctly by BLASTing a couple of fungal sequences. Download the sequences with

 wget https://raw.githubusercontent.com/kylebittinger/q2-brocc/master/q2_brocc/tests/data/query.fasta

and then search against the nt database with

 blastn -query query.fasta -db nt -outfmt 7

If successful, you should see a table of search results printed to your screen.

We have one more thing to do before we start assigning sequences to taxa. To improve performance, we'll need to create a local database of the NCBI taxonomy. Run the following command to download and process the taxonomy files:

 create_local_taxonomy_db.py

Once downloaded and configured, you should see a file at ~/.brocc/taxonomy.db, about 5G in size. The BROCC software should work without this database, but it will be unbearably slow to look up the taxa one-by-one over a network connection.

Making some taxonomic assignments

Let's download a couple of fungal sequences and try out the assigner. We'll be using the plugin on QIIME 2 artifacts, not raw sequence files. The fungal sequences above are available as a QIIME 2 artifact by running:

wget https://github.com/kylebittinger/q2-brocc/blob/master/q2_brocc/tests/data/query.qza?raw=true

We'll run the BROCC plugin using the command:

qiime brocc classify-brocc --i-query query.qza --o-classification query-brocc.qza

Once the classifier has finished running, you should see a QIIME 2 artifact file, query-brocc.qza. We can create a visualization for this artifact by running:

qiime metadata tabulate --m-input-file query-brocc.qza --o-visualization query-brocc.qzv

You can check out the visualization by dragging the file into your web browser at https://view.qiime2.org/. Your results should match those in our test files: the sequence "pichia1" should be assigned to "Eukaryota; Fungi; Ascomycota; Saccharomycetes; Saccharomycetales; Phaffomycetaceae; Cyberlindnera; Cyberlindnera jadinii", and the sequence named "trichosporon1" should be assigned to "Eukaryota; Fungi; Basidiomycota; Tremellomycetes; Trichosporonales; Trichosporonaceae; Trichosporon; Trichosporon asahii".

The BROCC algorithm

The goal of BROCC is to generate reliable taxonomic assignments from a set of BLAST results. Because we often work with the nt database from NCBI, the software has a number of features to help with quality control.

The quality control portion of BROCC's workflow is illustrated below. If certain requirements aren't satisfied, a message is printed instead of a taxonomic assignment.

The voting stage of BROCC's algorithm includes some machinery to handle uninformative or generic taxa (e.g. "environmental samples"). These are removed before the voting starts at each rank. Also, notice at the end of the process that in addition to reaching a consensus, the leading candidate taxon must surpass a minimum absolute number of votes, typically 3. You probably wouldn't want to base your assignment on a single reference sequence in nt; the whole idea behind BROCC is to look for a consensus among multiple hits.

The similarity threshold and consensus thresholds change with each rank. Generally, the similarity threshold should be very high at the species level (average nucleotide identity within species is around 95%). At the phylum level, however, we can be much more loose in considering hits.

Conversely, our requirements for a consensus are lowest at the species level (winner needs 60% of votes). This is because we may legitimately collect hits to a neighboring species. However, the consensus threshold is raised as we move to higher and higher ranks in the taxonomy, up to 90% at the phylum level. For example, if the leading phylum represented 70% of the total vote, BROCC would not assign to that phylum.

q2-brocc's People

Contributors

kylebittinger avatar

Stargazers

 avatar  avatar  avatar

Watchers

James Cloos avatar  avatar

q2-brocc's Issues

DB path issue

I just followed the "Installing the plugin" step by step.

$pip install q2-brocc
$qiime brocc --help
$wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz"

I could get 54 nt.xx.tar.gz files in my DB directory (/My/path/NCBI_DB/)
and I copy the path to ~/.bashrc

$source ~/.bashrc

#download test file
$wget https://raw.githubusercontent.com/kylebittinger/q2-brocc/master/q2_brocc/tests/data/query.fasta

$blastn -query query.fasta -db nt -outfmt 7

HOWEVER, I get the message
"BLAST Database error: No alias or index file found for nucleotide database [nt] in search path [/My/Working/Directory/Brocc:/My/path/NCBI_DB:]"

Should I download any other files? or decompress the nt.??.tar.gz files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.