jfmrod / mapseq Goto Github PK

highthroughput rRNA sequence classification and OTU mapping for metagenomic samples

License: Other

Makefile 0.01% Shell 0.02% M4 0.29% C 98.34% C++ 1.30% TeX 0.03%

mapseq's Introduction

MAPseq v1.2.6 (24 Mar 2020)

by Joao F. Matias Rodrigues, Thomas S.B. Schmidt, Janko Tackmann, and Christian von Mering
Institute of Molecular Life Sciences, University of Zurich, Switzerland

Reference: Matias Rodrigues JF, Schmidt TSB, Tackmann J & von Mering C (2017) MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics. http://doi.org/10.1093/bioinformatics/btx517

Installation
MAPseq usage instructions a. Default reference
b. Custom user-provided reference
c. Single sample counts summar
d. OTU count table for multiple samples
File output
History

MAPseq is a set of fast and accurate sequence read classification tools designed to assign taxonomy and OTU classifications to ribosomal RNA sequences. This is done by using a reference set of full-length ribosomal RNA sequences for which known taxonomies are known, and for which a set of high quality OTU clusters has been previously generated. For each read, the best guess and correspoding confidence in the assignment is shown at each taxonomic and OTU level.

For bugs and more information contact: Joao F. Matias Rodrigues [email protected]

1. INSTALLATION

You can get the source code on github or binary packages at:

git clone https://github.com/jfmrod/MAPseq.git

The binary packages can be found in the "Releases" page on GitHub: https://github.com/jfmrod/MAPseq/releases

i) Installing the binary package

To install the binary package simply unpack the contents of the mapseq tar.gz file, e.g.:

tar -xvzf mapseq-1.2-linux.tar.gz # for the linux version
or
tar -xvzf mapseq-1.2-macosx.tar.gz # for the MacOSX version

The mapseq binary will be located in the created directory. You may move the whole directory to another location. Moving only the binary elsewhere will break the installation though, as the data files are searched for in relation to the binary's path.

ii) Installing from source

To compile from the github source you will need:

svn
autotools/autoconf
wget
git
libncurses5-dev
libtool

On Ubuntu systems you can install these with the command:

sudo apt-get install build-essential wget subversion git libncurses5-dev libtool

You can then clone the mapseq repository with:

git clone https://github.com/jfmrod/MAPseq.git

Once you have cloned the repository you can type the following commands in the MAPseq directory

./setup.sh # downloads eutils and the reference data files
./bootstrap # this step is only needed if you cloned the repository, you will also need to install autotools/autoconf packages
./configure
make
make install

If you want to install MAPseq to your home directory instead of the default system wide /usr/local/ directory, you can change the ./configure command to:

./configure --prefix=$HOME/usr

This would install the program binaries to a directory usr/bin inside your home directory (i.e.: $HOME/usr/bin/mapseq), after you type the command "make install".

2. MAPseq usage instructions

a) Default reference

MAPseq takes as input a fasta file with raw sequence data which should have been previously demultiplexed and quality filtered usually from a fastq file produced by the sequencing machine.

If the input sequences can be found in the file "rawseqs.fa". Then to classify the reads one simply has to run the following command:

mapseq rawseqs.fa > rawseqs.fa.mseq

This will classify all the sequences found in rawseqs against the standard reference dataset provided with MAPseq.

You can change the number of threads that MAPseq uses with the -nthreads <no_threads> argument.

b) Custom user-provided reference

You can use mapseq with your own fasta reference and taxonomy files with the following command:

mapseq rawseqs.fa <customref.fasta> <customref.tax> [customref.tax2 ...] > rawseqs.fa.mseq

Where customref.fasta is a nucleotide fasta file with your reference set and customref.tax, customref.tax2 are one or more taxonomic assignments for each sequence in the reference.

The taxonomy file should have a header (preceeded with the # character) with the identity cutoff parameters and description of the taxonomy followed by two tab-separated columns composed of the accession id and the taxonomy. For example:

#cutoff: 0.00:0.08 0.70:0.35 0.70:0.35 0.70:0.35 0.80:0.25 0.92:0.08 0.95:0.05
#name: NCBI
#levels: Kingdom Phylum Class Order Family Genus Species
HE801216:78..1345 Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales;Methylococcaceae;Methylomonas;Methylomonas paludis
HE802067:76740..77993 Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium;Corynebacterium glutamicum
HE804045:1012175..1013425 Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Saccharothrix;Saccharothrix espanaensis

c) Single sample counts summary

mapseq -otucounts <sample1.mseq>

Provides you with summary counts of the MSEQ output files for each taxonomy and level. Example:

#sample.mseq 102301
Taxonomy TaxonomyLevel Label Counts
0 0 Bacteria 102301
0 1 Bacteria;Bacteroidetes 7586
0 1 Bacteria;Firmicutes 2150
0 1 Bacteria;Proteobacteria 112
0 1 Bacteria;PHY_Coriobacteriia 7
0 1 Bacteria;Actinobacteria 4
0 1 Bacteria;Fusobacteria 1
0 2 Bacteria;Bacteroidetes;Bacteroidia 7585
0 2 Bacteria;Firmicutes;Clostridia 1330

d) OTU count table for multiple samples

mapseq -otutable <sample1.mseq> <sample2.mseq> ...

Generates a tab separated value (tsv) file with the counts for each sample (column wise) and OTU or taxonomic labels (row wise). Which taxomy (OTU or NCBI taxonomy) and levels in the taxonomy can be specified using the -ti and -tl, respectively.

The generated table can be imported into R with the following R command:

myotutable <- read.table("map.otutable",sep="\t",header=TRUE)

3. MSEQ FILE OUTPUT

In the results output, each line indicates a classification of the read. Two output formats can be chosen ("simple" or "confidences") using the --outfmt option.

The "confidences" format outputs the confidence values for each of the taxonomic levels. For example: SRR044946.347 GQ156763:1..1446 548 0.91985428 505 22 22 1 540 263 800 0.99072355 20 Bacteria 1 1 Firmicutes 0.55452305 1 Clostridia 0.55452305 1 Clostridiales 0.55452305 1 Ruminococcaceae 0.31190208 0.3119020760059357 Ruminococcus 0 0.2104288786649704 Ruminococcus gnavus 0 0.0604640431702137 Bacteria 0.58272612 1 F6159 0.22964814 1 G35588 0 1 S61033 0 0.7381679934055649 SS52094 0 0.2980887881680916

Each field is tab separated and indicates the following:
Field
1 Query sequence id
2 Reference sequence id (highest alignment score)
3 Alignment score
4 Pairwise identity
5 Matches
6 Mismatches
7 Gaps
8 Query start pos
9 Query end pos
10 Reference start pos
11 Reference end pos
12 Strand (+/-)
13 [empty]

After the first empty field the taxonomy classifications and confidences are shown, every taxonomy classification is separated by an empty field. Although different fasta reference and taxonomy databases can be specified by the user, by default mapseq maps reads to the NCBI taxonomy and to OTU taxonomies

The combined confidence is computed based on a score confidence, used to control misclassification errors, and a identity cutoff confidence, used to ensure that the query isnt misclassified due to the inexistence of a sequence representative in the database of the true classification. The score confidence is calculated by comparing the identity of the assigned taxonomy to the identity of the first sequence not matching the assigned taxonomy. The identity cutoff confidence uses preoptimized cutoffs at each taxonomic level to calculate the confidence that the query is not too divergent from the assigned taxonomy.

We recommend using a combined confidence cutoff of 0.4, or 0.5 as this value yielded the highest F1/2-score for MAPseq in our benchmarks. Please see our article for further information.

The "simple" format gives the alignment information plus the taxonomy assignment for which the combined confidence at least 0.5. For example:
query1 FJ560320:1..876 301 0.7369985 301 0 0 0 301 305 606 + Archaea Archaea;F94;G275

4. HISTORY

1.2.6 (24 Mar 2020)

Fixed "-otutable" option using only counts from first sample.

1.2.5 (12 Jul 2019)

Added "-otucounts" and "-otutable" options to generate count summary for single mapseq (.mseq) files or an otu/taxa table for multiple .mseq files

1.2.4 (27 May 2019)

Added "-ignoreEmptyTax" option. Default is off until thorough benchmarks are performed. Prevents second hits with missing taxonomic labels (uncertain annotation) from decreasing the confidence of the top hit assignment.

1.2.3 (2 Oct 2018)

Fixed missing newline causing last sequence to be missed, added assert on empty sequences
Fixed double hits reported when classifying long queries (>1200bp)
Updated to mapref-2.2b: Removed low quality reference sequences that would cause issues when classifying low quality query sequences.

1.2.2 (30 Oct 2017)

Fixed multithreaded race condition causing issues on some systems.

1.2.1 (23 Oct 2017)

Updated mapref to v2.2. Fixed several issues with v2.0.
Dropped LTP taxonomy due to low coverage.

1.2 (16 July 2017)

Updated mapref to v2.0, now includes 1.5 million sequences.
Added assert checks.

1.1 (24 April 2017)

Several improvements and bug fixes, updated to latest NCBI taxonomy.

1.0 (14 October 2016)

First release of MAPseq.

mapseq's People

Contributors

Stargazers

Watchers

Forkers

aniag mouradbioinfo taodfang cschu alessiomilanese paulzierep meringlab

mapseq's Issues

Is v 1.2.5 going to be released as package?

Hello, I am trying to install and use MAPSeq for a project and I necessarily need v 1.2.5 to make use of the new -otutable option.

I have tried to install from source, but I am having troubles compiling due apparent lack of zlib headers.
./libs/eutils/efile.h:80:10: fatal error: zlib.h: No such file or directory

I was hoping to use a pre-compiled binary, but the latest release is v 1.2.3 which lacks the option I need.

Are there any plans to release the newest version in binary format?

Mac executable

Hey I wanted to inform you that the link to the mac executable is dead.

I wanted also to ask if it would be possible to make mapseq installable via bioconda. I can help you, but for this I would need a way to download a binary package for mac or build it on mac.

Installation problems

Greetings
was installing MAPseq in ubuntu on windows wsl and when running make I got this error:

etable.cpp: In function ‘estrarrayof<etable> tablesLoad(const estr&, const evarhash&)’:
etable.cpp:242:3: error: ‘egzfile’ was not declared in this scope
   egzfile f;
   ^~~~~~~
etable.cpp:242:3: note: suggested alternative: ‘edcfile’
   egzfile f;
   ^~~~~~~
   edcfile
etable.cpp:260:8: error: ‘f’ was not declared in this scope
   if (!f.open(filename,"r")) { lerror(filename); return(ts); }
        ^
etable.cpp:271:15: error: ‘f’ was not declared in this scope
   if (header) f.readarr(line,headerarr,sep);
               ^
etable.cpp:273:10: error: ‘f’ was not declared in this scope
   while (f.readarr(line,arr,sep)){
          ^
etable.cpp:315:3: error: ‘f’ was not declared in this scope
   f.close();
   ^
etable.cpp: In function ‘etable etableLoad(const estr&, const evarhash&)’:
etable.cpp:362:3: error: ‘egzfile’ was not declared in this scope
   egzfile f;
   ^~~~~~~
etable.cpp:362:3: note: suggested alternative: ‘edcfile’
   egzfile f;
   ^~~~~~~
   edcfile
etable.cpp:372:8: error: ‘f’ was not declared in this scope
   if (!f.open(filename,"r")) { lerror(filename); return(t); }
        ^
etable.cpp:376:15: error: ‘f’ was not declared in this scope
   if (header) f.readarr(line,headerarr,sep);
               ^
etable.cpp:380:10: error: ‘f’ was not declared in this scope
   while (f.readarr(line,arr,sep)){
          ^
Makefile:1105: recipe for target 'etable.lo' failed
make[3]: *** [etable.lo] Error 1
make[3]: Leaving directory '/home/jcarrico/NGStools/MAPseq/libs/eutils'
Makefile:766: recipe for target 'all' failed
make[2]: *** [all] Error 2
make[2]: Leaving directory '/home/jcarrico/NGStools/MAPseq/libs/eutils'
Makefile:683: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/jcarrico/NGStools/MAPseq'
Makefile:461: recipe for target 'all' failed
make: *** [all] Error 2
(base)

Any idea how can this be solved?

Installation error

Hi, I tried installing the latest version from the repo using the instructions in your README file but i am getting a compilation error... After installing dependencies I run:
git clone https://github.com/jfmrod/MAPseq.git
cd MAPseq
./setup.sh
./bootstrap
./configure
make
make install

This runs ok up to ./configure. When running make I get this error:
libtool: compile: g++ -DHAVE_CONFIG_H -I. -DINCLUDEPATH="/usr/local/lib/eutils/include" -DMODULEPATH="/usr/local/lib/eutils" -DWEBHTMLPATH="/usr/local/share/eutils/webhtml" -O3 -pthread -D_GNU_SOURCE -D_DEFAULT_SOURCE -MT ernd.lo -MD -MP -MF .deps/ernd.Tpo -c ernd.cpp -fPIC -DPIC -o .libs/ernd.o
ernd.cpp: In destructor ‘ernd::~ernd()’:
ernd.cpp:164:32: error: ‘hCryptAlg’ was not declared in this scope
BCryptCloseAlgorithmProvider(hCryptAlg, 0);
^~~~~~~~~
ernd.cpp:164:3: error: ‘BCryptCloseAlgorithmProvider’ was not declared in this scope
BCryptCloseAlgorithmProvider(hCryptAlg, 0);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Makefile:1102: recipe for target 'ernd.lo' failed
I would try and debug a bit more but my knowledge of C++ is very limited.
Many thanks.

MapSeq takes too long in file clustering

Hi João,

I tried using MAPseq to analyze my zymo mock dataset using Greengenes2 database (downloaded here: https://greengenes2.ucsd.edu/) as a reference. After editing the database as instructed, I ran the following code:
./mapseq -fastq [my forward reads.fa] [my reverse reads.fa] [edited GG2 taxonomy database] > [result.fa.mseq] -nthreads 32
It could detect my database but then it proceed to do clustering, which took too long. The program ran for almost a week, then it was finally terminated by my HPC admin and it is not even done yet.

The following screenshot was taken after more than 48 hours running. I didn't take any screenshot after that.

Is there anything to do to resolve this issue? Or maybe I made a mistake?
Thanks in advance

Missing db fasta file

Hi Joao

I just the alpha release of mapseq 2 a try. And running mapseq out of the box fails

 ./mapseq test.fasta
# mapseq v2.0alpha (Feb 26 2020)
E! Fri Jun 12 11:54:08 2020 [] mapseq.cpp:3143 void loadSequences(eseqdb&, int): !error!: fasta db not found: share/mapseq/mapref-2.2.fna

The db in the archive is named differently. This command works

./mapseq test.fasta share/mapseq/mapref-2.2b.fna

Is this the correct database?

README update

Hi there,

I think the README is a bit outdated. Some people installing the software through git clone might have the problem with the eutils if the package subversion is not installed. So an update of README with additional notes would be helpful.

Regards,

Tien Du

Handle non-overlapping paired-end reads

Installation Issue #2

Getting an error following installation protocol:

In file included from ./eutils/emain.h:4:0,
from mapseq.cpp:1:
./eutils/eutils.h:6:28: fatal error: eutils-config.h: No such file or directory
#include "eutils-config.h"
^
compilation terminated.

otutable limited to taxonomy level 3

Any reason why otutable limits the taxonomy level to 3?

Segmentation fault when using '-otutable'

Whenever I use the argument '-otutable' it throws me "Segmentation Fault" error. Can you please help me with this?

Different outputs

Hi João,

I noticed running mapseq with the same config and same input files multiple times. Yet it delivered slightly different outputs.

For instance, using the command: ./../mapseq-1.2.3-linux/mapseq -tophits 80 -topotus 40 -outfmt simple sequences.fasta LSU.fasta slv_lsu_filtered2.txt > output1

on this small sequences file:

one time the output was:

second time:

Is this behavior considered normal? why does it happen?

Best,
Rand

Long queries giving more than one hit per query

Queries larger than 1500bp can give more than one hit per query

Missing newline on EOF causes last sequence to be ignored

Using:
mapseq bad.fasta bad.fasta bad.tax
produces:

# mapseq v1.0 (Jan  1 1970)
# loaded 1 sequences
# loading taxonomy file: bad.tax
# taxonomies: 1
# tax levels: 1
# tax: 0 level: 0 (1)
# fcount: 0 otukmercount: 0
#query  dbhit   bitscore        identity        matches mismatches      gaps    query_start     query_end       dbhit_start     dbhit_end       strand        bad.tax:taxlevel0        combined_cf     score_cf
# processing input... 1
# done processing 1 seqs (0.023071s)

While:
mapseq good.fasta good.fasta good.tax
produces:

# mapseq v1.0 (Jan  1 1970)
# loaded 1 sequences
# loading taxonomy file: good.tax
# taxonomies: 1
# tax levels: 1
# tax: 0 level: 0 (1)
# fcount: 0 otukmercount: 291
#query  dbhit   bitscore        identity        matches mismatches      gaps    query_start     query_end       dbhit_start     dbhit_end       strand        good.tax:taxlevel0       combined_cf     score_cf
# processing input... 1
N1      N1      1256    1       1256    0       0       0       1256    0       1256    +               Escherichia coli        1       1
# done processing 1 seqs (0.022354s)

The only difference between good.tax, good.fasta and bad.tax, bad.fasta is that the bad files are missing a newline character on the last line.

The binary was compiled from:

% svn info                   
Path: .
Working Copy Root Path: /software/mapseq/svn
URL: http://www.matiasrodrigues.com/svn/mapseq
Relative URL: ^/mapseq
Repository Root: http://www.matiasrodrigues.com/svn
Repository UUID: 64c77371-564d-4174-b641-e67ee90e9faa
Revision: 1100
Node Kind: directory
Schedule: normal
Last Changed Author: jfmrod
Last Changed Rev: 1099
Last Changed Date: 2017-03-31 23:53:52 +0200 (Fri, 31 Mar 2017)

Didn't find binary package file and libeutills-1.0.so error

Hi! I use your software and like it very much.
Just tried to install in my machine a few times with no success.
First I tried the the binary package, but didn't find the package file from the git clone download.
Then I installed from source, playing those commands (ubuntu 19.04):

sudo apt install build-essential wget subversion git libncurses5-dev libtool
sudo ./setup.sh 
sudo ./bootstrap 
sudo apt install autotools-dev
sudo apt install autoconf
sudo ./configure
sudo make
sudo make install

But when I run mapseq, the screen print this out:
mapseq: error while loading shared libraries: libeutils-1.0.so: cannot open shared object file: No such file or directory.
Can you help me?

Thanks in advance,
Rondon Neto

Visualization with Krona

Is there a way to visualize the output with krona?

How to define the cutoff values in the taxonomy file?

I'm kinda confused about how the cutoffs must be defined in the taxonomy files. What do the numbers on either side of the double colons mean?

no -outfmt "simple" in v2.0.1alpha

Hi,

I am trying to run mapseq v2.0.1alpha using the old flag -outfmt simple
However, I realized that the output seems to be the one from -outfmt confidences.

I tried again using version 1.2.6, and the -outfmt simple instead works.
I wonder whether there is an issues with the new version or something on my system is not right.
It runs fine but before start processing I get this error:
E! Tue Dec 14 13:36:05 2021 [] ./libs/eutils/ehashmap.h:430 T& ehashmap<K, T, hashfunc>::values(const K&) [with K = estr; T = eclassBase; size_t (* hashfunc)(const K&) = hash_lookup3_estr]: !error!: ehashmap: key not found E! Tue Dec 14 13:36:05 2021 [] eparserinterpreter.cpp:2078 virtual evar ecodeAtomSingle::interpret(estrhashof<evar>&, bool&, int&): !error!: Exception error

I would appreciate any suggestions. Thank you

Certain taxa clutters strict column counts in output

Hi,

I'm making some scripts to convert the output (with confidences) to different visualization tools, such as KronaTools++
Certain taxonomy classifications (B16S, ...) are written with a prepended "\t" in the output, which breaks the assumption that each classification strictly consists of 3 columns pr. level after column 13 [empty]. I have made a quick fix for this for my self, but just thought you should know in case this is a bug that should be fixed in the output.

Improve taxonomy information loading times

Run with third-party taxonomy db

Hi Joao,

Is that possible to run MapSeq with green genes, rdp or silva taxonomy databases?
It seems like they have a different format.

If so, could you please update the readme file as well?

Best
Alex

Improve confidence estimation

Including at least 1 reference sequences from different taxonomic labels at the alignment phase.
This will avoid overconfident estimates in some cases

Some sequences without taxonomy assignment output

Dear Author

I extracted V3V4 from Silva database and use that as the reference database in MAPseq. I removed those sequences with length over 600 bps in the database.

After I annotate my representative sequences with MAPseq, I found that a portion of the sequences were not annotated at all (200 sequences in 4100 were not annotated).

What does this mean? Does this mean that these sequences were unknown? Why MAPseq does not output the results for those sequences?

Add MAPseq to bioconda

Being able to install MAPseq as a bioconda package would be awesome! 🐍 📦

(Vsearch is on available on bioconda, and this has aided it's adoption. )

14922 sequences not found in sequence database