envmetagen / metabinkit Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 1.0 6.53 MB

Set of programs to perform taxonomic binning.

License: GNU General Public License v3.0

Shell 42.36% R 56.91% Dockerfile 0.73%

taxonomy dna-sequences docker metabarcoding-data ncbi-taxid binning-thresholds binning

metabinkit's Introduction

metabinkit

Set of programs to perform taxonomic binning.

Overview
Conda
Docker
Manual installation
Metabinkit programs
FAQs

Overview

From metagenomic or metabarcoding data, it is often necessary to assign taxonomy to DNA sequences. This is generally performed by aligning sequences to a reference database, usually resulting in multiple database alignments for each query sequence. Using these alignment results, metabinkit assigns a single taxon to each query sequence, based on user-defined percentage identity thresholds. In essence, for each query, the alignments are filtered based on the percentage identity thresholds and the lowest common ancestor for all alignments passing the filters is determined. The metabin program is not limited to BLAST alignments, and can accept alignment results produced using any program, provided the input format is correct. However, functionality is also available to create BLAST databases and to perform BLAST alignments, which can be passed directly to metabin.

Conda

Metabinkit is available as a conda package in Bioconda. Simply run the following commands to install metabinkit

 conda install -c bioconda metabinkit
 conda activate base

or you may also try this if you encounter problems with the command above

 conda create -n your_env_name -c bioconda -c conda-forge metabinkit
 conda activate your_env_name

Docker

A docker image with metabinkit is available at DockerHub (https://hub.docker.com/r/envmetagen/metabinkit/tags/). This facilitates the setup and installation of metabinkit, makes it easy to track all software versions used in the analyses, and ensures that only dependency versions compatible with metabinkit are used. See the Docker userguide for more details.

Alternatively you may install the software from source following the instructions provided next. A 64bit computer with an up to date Linux OS installed will be required.

Manual installation

Supported OS

metabinkit is developed and tested on multiple distributions of Linux (e.g. Fedora, Ubuntu). Consider the Docker container if you use a non-supported pperating system or operating system version.

Getting sources

Option 1: download the latest source release tarball from https://github.com/envmetagen/metabinkit/releases, and then from your download directory type:

tar xzf metabinkit-x.x.x.tar.gz
cd metabinkit-x.x.x

Option 2: to use git to download the repository with the entire code history, type:

git clone https://github.com/envmetagen/metabinkit.git
cd metabinkit

Installing metabinkit and dependencies

A full installation of metabinkit requires third-party components. A script (install.sh) is provided to facilitate the installation of metabinkit and some dependencies, others need to be already installed in the system (R 3.6.0 or above).

To install metabinkit to the home folder, type:

./install.sh  -i $HOME

A file metabinkit_env.sh will be created on the toplevel installation folder ($HOME in the above example) with the configuration setup for the shell. To enable the configuration is necessary to load the configuration with the source command, e.g.,

source $HOME/metabinkit_env.sh

This needs to be run each time a terminal is opened, or add the above line to the $HOME/.bash_profile file.

To install only certain programs/dependencies use the -x argument, e.g.

./install.sh -i $HOME -x taxonkit

Available options for -x are: taxonkit, blast, metabinkit, R_packages, taxonomy_db

Programs

metabin

Usage: metabin -i filename -o outfile [other options]

run metabin -h for a list of all options and defaults

Expected file formats and contents

The minimum required input for metabin is: -i, --input: a tab-separated file with three compulsory columns: qseqid, pident, and taxids, plus, optionally, seven columns more columns K,P,C,O,F,G,S

qseqid: id of the query sequence
pident: the percentage identity of the alignment
taxids: NCBI taxid of the database subject sequence (staxid or staxids are also accepted)
K,P,C,O,F,G,S: kingdom, pylum, class, order, family, genus, species of the database subject sequence

Other columns may be present and will be ignored, unless specified by the --FilterCol argument (see How it Works)

How it works

Click on image to view larger version

Examples

Example 1. Default settings

Input:

$ head metabinkit/tests/test_files/in0.blast.short.tsv 
taxids	qseqid	pident
41217	query1	69.565
148819	query1	73.442
148819	query2	65.775
148819	query3	73.243
148819	query4	69.211
52396	query5	70.629
55837	query5	84.722
55837	query5	84.722
96912	query5	66.897

run metabin

$ metabin -i metabinkit/tests/test_files/in0.blast.short.tsv -o out0.short.bins

Explanation: Do not filter any alignments based on Accession Number or blacklisted taxa. Do not apply any "Top.." thresholds. Attempt to bin alignments with the default %identity thresholds: species-99%, genus-97%, family-95%, above family-90%. Use taxids column to retrieve taxonomy. Output three files: Main results - out0.short.bins.tsv, Information statistics - out0.short.bins.info.tsv, version info - out0.short.bins.versions.txt

screen output (stderr)

metabinkit version: 0.1.8
[info] Starting Binning
[info] Read 12259 entries from in0.blast.short.tsv
 WARNING! missing columns in input table with taxonomic information:K,P,C,O,F,G,S
[info]  Trying to get taxonomic information from the database in /home/tutorial/TOOLS/metabinkit.install/exe/../db/ ...
[info]  taxonomic information retrieval complete.
[info] binning at species level
[info] excluding 11279 entries with pident below 99
[info] applying top threshold of 100
[info] binned 72 sequences at species level
[info] binning at genus level
[info] excluding 8918 entries with pident below 97
[info] applying top threshold of 100
[info] binned 24 sequences at genus level
[info] binning at family level
[info] excluding 8187 entries with pident below 95
[info] applying top threshold of 100
[info] binned 75 sequences at family level
[info] binning at higher-than-family level
[info] excluding 5937 entries with pident below 90
[info] applying top threshold of 100
[info] binned 119 sequences at higher than family level
[info] Total number of binned 290 sequences
[info] not binned 1211 sequences
[info] Complete. 12259 hits from 1501 queries processed in 1.69 mins.
[info] 
Note: By default, if a taxon cannot be assigned at a given taxonomic level the following codes are used to explain the motive:
- mbk:bl-S,mbk:bl-G,mbk:bl-F - taxid blacklisted at species, genus or family (respectively)
- mbk:nb-thr - pident was below the threshold
- mbk:nb-lca - the lowest common ancestor was above this taxonomic level
- mbk:tnf - the taxid was not found in the taxonomy database
If --no_mbk option was used the codes will be NA

[info] binned table written to out0.short.bins.tsv
[info] information stats written to out0.short.bins.info.tsv
[info] Versions info written to out0.short.bins.versions.txt
[info] Binning complete in 1.72 min

view results

$ head -n 4 out0.short.bins.tsv 
qseqid	pident	min_pident	K	P	C	O	F	G	S
query663	100	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query1227	100	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query1482	99.265	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana

some other selected results

qseqid	pident	min_pident	K	P	C	O	F	G	S
query283	100	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio elongatulus
query900	99.242	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Anodonta	Anodonta exulcerata
query163	99.265	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Anodonta	Anodonta exulcerata
query487	99.265	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Anodonta	Anodonta exulcerata
query305	99.153	0	Eukaryota	Mollusca	Bivalvia	Veneroida	Corbiculidae	Corbicula	mbk:nb-lca
query592	98.276	0	Eukaryota	Mollusca	Bivalvia	Veneroida	Corbiculidae	Corbicula	mbk:nb-thr
query1494	99.153	0	Eukaryota	Mollusca	Bivalvia	Veneroida	Corbiculidae	Corbicula	mbk:nb-lca
query589	97.842	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	mbk:nb-lca	mbk:nb-thr
query557	99.275	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	mbk:nb-lca	mbk:nb-lca
query762	80.986	NA	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr
query560	71.942	NA	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr	mbk:nb-thr

Example 2. Custom settings

Inputs:

$ head metabinkit/tests/test_files/in1.blast.short.tsv 
taxids	qseqid	pident	qcovs	saccver	staxid	ssciname	old_taxids	K	P	C	O	F	G	S
41217	query1	69.565	99	KY081324.1	41217	Barbatia lima	41217	Eukaryota	Mollusca	Bivalvia	Arcoida	Arcidae	Barbatia	Barbatia lima
148819	query1	73.442	99	AF305058.1	148819	Scapharca broughtonii	148819	Eukaryota	Mollusca	Bivalvia	Arcoida	Arcidae	Scapharca	Scapharca broughtonii
148819	query2	65.775	99	AF305058.1	148819	Scapharca broughtonii	148819	Eukaryota	Mollusca	Bivalvia	Arcoida	Arcidae	Scapharca	Scapharca broughtonii
148819	query3	73.243	99	AF305058.1	148819	Scapharca broughtonii	148819	Eukaryota	Mollusca	Bivalvia	Arcoida	Arcidae	Scapharca	Scapharca broughtonii
148819	query4	69.211	99	AF305058.1	148819	Scapharca broughtonii	148819	Eukaryota	Mollusca	Bivalvia	Arcoida	Arcidae	Scapharca	Scapharca broughtonii
52396	query5	70.629	100	MF326974.1	52396	Lampsilis siliquoidea	52396	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Lampsilis	Lampsilis siliquoidea
55837	query5	84.722	100	MH349358.1	55837	Unio pictorum	55837	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio pictorum
55837	query5	84.722	100	MH349357.1	55837	Unio pictorum	55837	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio pictorum
96912	query5	66.897	100	AY498702.1	96912	Fusconaia flava	96912	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Fusconaia	Fusconaia flava

From previous experience we have identified entries in genbank that appears erroneous, so we provide a list of those in a file. In this example we are using genbank entries flagged by Mioduchowska et al. 2018.

$ head -n 4 metabinkit/tests/test_files/Mioduchowska2018_flaggedAccessions.txt 
KX531007.1
KC706821.1
KJ950123.1
JQ798675.1

For the purposes of this example, we are certain that Mizuhopecten yessoensis can not be in our DNA samples. Note this should be used with caution. An example of where it could be justified to blacklist a taxon is: 1) the taxon is only known from a distant country, with very little or no chance that it is present in the sampled environment, even as a recent invasive; and 2) the taxon has not been worked on in the laboratory that processed the samples. Note also, for example, if providing a file to the --FamilyBL argument, all taxa under each taxid provided will be blacklisted when binning at family level.

$ head metabinkit/tests/test_files/testspecies2exclude.txt 
6573

run metabin

$ metabin -i in1.blast.short.tsv -o out1.blast.short.bins -S 98 -G 94 -F 93 -A 88 --SpeciesBL testspecies2exclude.txt --FilterFile Mioduchowska2018_flaggedAccessions.txt --FilterCol saccver --TopSpecies 2 --TopGenus 2 --TopFamily 2 --TopAF 2 --sp_discard_sp --sp_discard_mt2w --sp_discard_num

Explanation: First remove any alignments that have one of the flagged Accession Numbers in the saccver column. During species-level binning, first remove the species that we have blacklisted. Note that as we only provided a --SpeciesBL, these taxa still be considered during binning at other levels. Furthermore, during species-level binning do not consider species with "sp.", more than two spaces, or numbers in their names. Apply a "Top.." threshold of 2 for all taxonomic levels. Attempt to bin alignments with the following %identity thresholds: species-98%, genus-94%, family-93%, above family-88%. Use the K,P,C,O,F,G,S columns as the taxonomy.

screen output (stderr)

metabinkit version: 0.1.8
[1] TRUE
[info] Starting Binning
[info] Read 12259 entries from in1.blast.short.tsv
[info] Filtering table (12259) using saccver column.
[info] Filtered table (12259) using saccver column.
16:29:14.561 [WARN] taxid 2746931 not found
[info] Maximum # Taxa disabled at species level:1
[info] binning at species level
[info] Not considering species with 'sp.'
[info] Not considering species with more than two words
[info] Not considering species with numbers
[info] excluding 10453 entries with pident below 98
[info] applying top threshold of 2
[info] binned 50 sequences at species level
[info] binning at genus level
[info] excluding 8367 entries with pident below 94
[info] applying top threshold of 2
[info] binned 49 sequences at genus level
[info] binning at family level
[info] excluding 7469 entries with pident below 93
[info] applying top threshold of 2
[info] binned 139 sequences at family level
[info] binning at higher-than-family level
[info] excluding 4298 entries with pident below 88
[info] applying top threshold of 2
[info] binned 67 sequences at higher than family level
[info] Total number of binned 305 sequences
[info] not binned 1196 sequences
[info] Complete. 12259 hits from 1501 queries processed in 0.46 mins.
[info] 
Note: By default, if a taxon cannot be assigned at a given taxonomic level the following codes are used to explain the motive:
- mbk:bl-S,mbk:bl-G,mbk:bl-F - taxid blacklisted at species, genus or family (respectively)
- mbk:nb-thr - pident was below the threshold
- mbk:nb-lca - the lowest common ancestor was above this taxonomic level
- mbk:tnf - the taxid was not found in the taxonomy database
If --no_mbk option was used the codes will be NA

[info] binned table written to out1.blast.short.bins.tsv
[info] information stats written to out1.blast.short.bins.info.tsv
[info] Versions info written to out1.blast.short.bins.versions.txt
[info] Binning complete in 0.49 min

view results

$ head out1.blast.short.bins.tsv
qseqid	pident	min_pident	K	P	C	O	F	G	S
query16	98.551	96.551	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query31	98.529	96.529	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio elongatulus
query122	98.561	96.561	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query142	100	98	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query163	99.265	97.265	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Anodonta	Anodonta exulcerata
query251	100	98	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
query263	98.529	96.529	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio elongatulus
query270	98.54	97.27	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Unio	Unio elongatulus
query277	100	98	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana

metabinkit_blast

A wrapper for running BLAST. Minimum input is a fasta file and a BLAST-formatted database.

Usage: metabinkit_blast -f fasta file -D reference_DB -o outfile [options]

run metabinkit_blast -h for a list of all options and defaults run blastn -help for more details about specific options

Note that the defaults run a "thorough" BLAST. That is: it uses the blastn task and a small word size to increase sensitivity; it uses relaxed gap settings and a high query cover percentage to tell BLAST to only report full-length (or almost full-length) alignments; it uses a low minimum percentage identity; it keeps 100 hits per query. These defaults are targeted towards metabarcoding purposes, i.e. we really only care about close to full length alignments (anything less can be misleading) with many alignments (keeping only a few can be misleading), ranging anywhere from 100-50%, which will help to provide accurate taxonomic binning. The relaxed gap settings mean that almost all queries will result in close to full alignments, otherwise we would get many queries without hits. Note that these defaults are at the expense of requiring more CPU time.

The BLAST can be limited to certain sections of the BLAST-formatted database with the -N and P options.

metabinkit_blastgendb

A wrapper for generated a BLAST-formatted database. Minimum input is a fasta file.

Usage: metabinkit_blastgendb -f fasta file -t taxid_map -o db [options] run metabinkit_blastgendb -h for a list of all options and defaults

To allow full functionality of a BLAST database, taxonomic information is required. This can be provided by specifying a file with -T: mapping between the sequence id and the NCBI taxid (tab separated). If none is found it will look for taxid=xxxx; in the fasta header after the first space and consider the word up to the first space or | as the sequence id.

Checks on the created database can be included by using the -c option. As well as checking that the taxonomic information was correctly added, this performs a small BLAST and checks the results.

FAQs

How do "Top.." thresholds work and what are their effects?

The main %identity thresholds (-S, --Species,-G, --Genus,_F, --Family,-A, --AboveF) are absolute minimum thresholds. In contrast, the "Top.." %identity thresholds (--TopSpecies,--TopGenus,--TopFamily,--TopAF) are relative minimum thresholds, applied after the main %identity. For each query, the "Top.." threshold is the %identity of the best hit minus the "Top.." value. In the example below, a "Top.." of 2 corresponds to 97.8 and alignments below this are discarded prior to binning. A "Top.." of 5 corresponds to 94.8, so alignments below this are discarded.

qseqid taxids pident
query1 1234 99.8
query1 1234 99.6
query1 12345 97.7
query1 12345 97.6
query1 12345 97.6
query1 123456 94.8
query1 123456 94.8
query1 123456 93.6

"Top.." will mainly affect the resolution of the results. The lower the "Top.." value, the greater the number of alignments discarded. As is also required for the main %identity thresholds, "Top.." thresholds should be identifed empirically. Below is an illustration of how "Top.." can affect results, when using an identical main %identity.

#Query1
P	C	O	F	pident
phy1	cla1	ord1	fam1	85 
phy1	cla1	ord1	fam1	84
phy1	cla1	ord1	fam1	84
phy1	cla1	ord1	fam1	83
phy1	cla1	ord1	fam2	79
phy1	cla1	ord1	fam2	78
phy1	cla1	ord2	fam3	74
phy1	cla1	ord2	fam3	70
phy1	cla2	ord3	fam4	60

settings			bin             reason
--TopFamily=1,--Family=70	fam1 alignments below 70 are removed, additionally alignments below 84 are removed
--TopFamily=2,--Family=70	fam1 alignments below 70 are removed, additionally alignments below 83 are removed
--TopFamily=5,--Family=70	fam1 alignments below 70 are removed, additionally alignments below 80 are removed
--TopFamily=8,--Family=70	ord1 alignments below 70 are removed, additionally alignments below 77 are removed
--TopFamily=10,--Family=70	ord1 alignments below 70 are removed, additionally alignments below 75 are removed
--TopFamily=15,--Family=70	cla1 alignments below 70 are removed, additionally alignments below 70 are removed
--TopFamily=30,--Family=70	phy1 alignments below 70 are removed, additionally alignments below 55 are removed

Using a very low "Top.." threshold, e.g. 0, may lead to over-classifying the sequence to the incorrect taxonomy, as many very similar alignments will be discarded. Nevertheless, this is still more conservative than, for example, taking only the best alignment, as all alignments that have an identical % identity to the top alignment will be kept. Using a Top threshold is particularly relevant when, for example, a query has a best alignment of 85 % identity and the family level threshold (-F, --Family) is low e.g. 70%; in such a case it is reasonable to apply a --TopFamily threshold to only consider alignments within a certain range of the best alignment, increasing the likelihood of binning at family level.

Why is only the classical seven-rank taxonomy considered?

This is the usual format used in this field of research, and can be extracted from most databases
Version 2 will be extended to include subspecies
Catering for all potential ranks would produce different outputs which would complicate downstream analyses

Why are binning thresholds specifically implemented at species, genus and family ranks, but for above family are combined?

metabin will report the final bins obtained for all ranks, even if they could not be assigned at family rank.
In the classical seven rank taxonomy, the NCBI taxonomy almost always has information at the species, genus and family ranks, but is often missing this information for phylum, class and order rank, making it difficult to apply thresholds at every level. For example, the NCBI taxid 570251, a species of Platyhelminthes, Catenula turgida, has the taxonomy Eukaryota, Platyhelminthes, Catenulida, unknown, Catenulidae, Catenula, Catenula turgida
The --TopAF argument is effectively an order-level threshold, and metabin will assign at order rank where possible (i.e. the lowest common ancestor is at the order rank and this order is not "unknown"). Where order-level assignation fails it will report the lowest common ancestor regardless of the rank.

I have performed alignments, but do not have NCBI taxids, how can I use metabin?

Providing the K,P,C,O,F,G,S columns in the -i, --input file will avoid using the NCBI taxonomy
If you have neither the NCBI taxids nor the K,P,C,O,F,G,S columns and only have taxon names, NCBI taxids can be generated from these using the NCBI TaxIdentifier. Be careful to double check the results make sense, and understand the error codes (e.g. duplicates, not found etc.). Or consider using taxonkit
Consider using metabinkit_blast to align sequences to your reference database. This will output the taxids of the reported alignments.

How do I choose thresholds?

metabin is not a classifer, in that it does not attempt to find optimal binning thresholds. All settings are user-defined. The thresholds should be based on an analysis of the target DNA region. It is possible that a future version of metabinkit will include classifer functionality. For further reading consider exploring:

metabinkit's People

Contributors

Stargazers

Watchers

Forkers

peteradavey21

metabinkit's Issues

metabinkit_blastgendb not creating db

Hi Nuno,

probably a blast issue(?) but I get this error when trying to make a blast database on my laptop. I have seen it before on my laptop, but if you know what the issue might be it would be great to solve it!

I just tried on emg1 and it is fine! (fresh metabinkit installs on both machines)

$metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o SANGER_PLUS_GENBANK_CUT_BLASTDB
metabinkit 0.1.7


Building a new DB, current time: 06/12/2020 12:49:44
New DB name:   /media/sf_Documents/WORK/CIBIO/AA_PROJECTS/IRANVERTS/SHEEP/SANGER_PLUS_GENBANK_CUT_BLASTDB
New DB title:  SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B

No volumes were created.

Error: mdb_env_open: Invalid argument

installing

Hi Nuno, tried to run installer. Worked for a while then..

[INFO] Installing blast...done.
[INFO] Installing metabinkit...
'R/lca.R' -> '/home/tutorial/R/lca.R'
'R/metabinkit.R' -> '/home/tutorial/R/metabinkit.R'
'exe' -> '/home/tutorial/exe'
'exe/metabin' -> '/home/tutorial/exe/metabin'
'exe/taxonkit_children.sh' -> '/home/tutorial/exe/taxonkit_children.sh'
[INFO] Creating /home/tutorial/metabinkit_env.sh...
You may want to consider adding the following line to your .bash_profile file.

source /home/tutorial/metabinkit_env.sh
[INFO] Creating /home/tutorial/metabinkit_env.sh...done.
[INFO] Installing metabinkit...done.
./install.sh: line 247: install_r_packages: command not found

Install error with R>=4

Describe the bug

The install.sh checks for the R version and fails with R version 4.1.2.

To Reproduce
Steps to reproduce the behavior:

Run 'install.sh' with an R with version greater or equal to 4
See error

Expected behavior

Install without errors.

Screenshots

Desktop (please complete the following information):

OS:Linux
Version:latest

Additional context
Add any other context about the problem here.

unknown vs NA

I am not sure if the distinction is clear, but originally it was like this

No hits passing binning filters = "NA;NA;NA;NA;NA;NA;NA"
Hits passing filter, but common ancestor above kingdom = "unknown;unknown;unknown;unknown;NA;NA;NA" #i.e. "unknown" can never occur at family, genus or species level
Hits passing filter, but common ancestor is class level = "kingdom1;phylum1;class1;unknown;NA;NA;NA"

This is because I used NA to be an indicator that the binning thresholds were not passed. I used unknown only for above family (htf), to indicate that while a common ancestor may have been found it could not be assigned at e.g. order level (=unknown). Comapring metabin vs bin3 we get different outputs in this regard. It appears that this may only be happening at genus level - metabin is giving it "unknown"...in some cases...

example:

qseqid	BIN3	METABIN	IDENTICAL
011afebc-d053-40ed-aa39-764d06884887_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	NANANANANANANA	NANANANANANANA	IDENTICAL
0122ab94-162e-4198-b044-782b51ae0acd_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD23-A-UNIO-RUN7	NANANANANANANA	NANANANANAunknownNA	NOT
023be541-e63a-44ca-ad19-375672394340_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7	NANANANANANANA	NANANANANANANA	IDENTICAL
023c6ba8-769d-4ecb-a10e-6ff95081e199_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	NANANANANANANA	unknownunknownunknownunknownNAunknownNA	NOT
0dd84164-c359-4b25-9a3e-bfeb39ac293b_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD26-A-UNIO-RUN7	BacteriaunknownunknownunknownNANANA	BacteriaunknownunknownunknownNANANA	IDENTICAL

code used

#bin3
bin.blast3(filtered_blastfile = "2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",
           ncbiTaxDir = "//home/tutorial/TOOLS/metabinkit.install/db/",
             out = "/home/tutorial/TOOLS/metabinkit/tests/test_files/2019_UNIO.bin3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)

#metabin
metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt -o 2019_UNIO.metabin.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE

messages

WARNING! missing columns in input table with taxonomic information:K,P,C,O,F,G,S
Seems like error to user. Consider adding "Using taxid to retrieve taxonomy"

Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
                 If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
                 Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.

Consider revising to

Note: If none of the alignments for a query passed the binning thesholds, the results will be NA for all levels. 
If the lowest common ancestor for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels. 
Queries that do not appear in the input table (for example, they had no significant alignments) will not appear in results.

Consider an additional "rm.unclassified" filter

This was part of my filter.blast3 function. It is similar to the consider_sp. option. We often get species-level taxids in results, but the species equate to things like "uncultured nematode", "environmental eukaryote". These can cause bins to be pushed far back up the tree. option to remove these. The terms used here are just those I noted as I came across them.

#remove crappy hits 
  if(rm.unclassified==T){
    #1. btab$S contains uncultured
    message("Removing species containing the terms: uncultured, environmental, 
            unidentified,fungal, eukaryote or unclassified")
    if(length(grep("uncultured",btab$S,ignore.case = T))>0) btab<-btab[-grep("uncultured",btab$S,ignore.case = T),]
    if(length(grep("environmental",btab$S,ignore.case = T))>0) btab<-btab[-grep("environmental",btab$S,ignore.case = T),]
    if(length(grep("unclassified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unclassified",btab$S,ignore.case = T),]
    if(length(grep("unidentified",btab$S,ignore.case = T))>0) btab<-btab[-grep("unidentified",btab$S,ignore.case = T),]
    if(length(grep("fungal ",btab$S,ignore.case = T))>0) btab<-btab[-grep("fungal ",btab$S,ignore.case = T),]
    if(length(grep("eukaryote",btab$S,ignore.case = T))>0) btab<-btab[-grep("eukaryote",btab$S,ignore.case = T),]
  }

-taxidlist option

Perhaps a personal wish, but I often use the -taxidlist option to limit the blast to a certain taxonomical section of the database (and we used it for mussels). I see you have the negative taxids option, but not the "positive" option. Suggest adding this functionality.

example of previous blast function, in case it helps

> blast.min.bas2
function(infasta,refdb,blast_exec="blastn",wait=T,taxidlimit=NULL,inverse=F,ncbiTaxDir=NULL,overWrite=F,out=NULL,
                         opts=c("-task","blastn","-outfmt", "6 qseqid evalue pident qcovs saccver staxid ssciname sseq","-num_threads", 64,
                                "-max_target_seqs", 100, "-max_hsps",1,"-word_size", 11, "-perc_identity", 50,
                                "-qcov_hsp_perc", 98, "-gapopen", 0, "-gapextend", 2, "-reward", 1, "-penalty", -1)){
  
  t1<-Sys.time()
  
  require(processx)
  
  if(!is.null(taxidlimit)) if(is.null(ncbiTaxDir)) stop("to use taxidlimit, ncbiTaxDir must be supplied")
  if(is.null(out)) out<-paste0(gsub(".fasta", ".blast.txt",infasta))
  
  outdircheck<-
  
  if(overWrite==F) if(file.exists(out)) stop("The following file already exists ", out, "Use overWrite=T to overwrite")
  
  if(!is.null(taxidlimit)){
    
    h<-list()
    
    #generate children of taxids and store in file
    taxid.list<-list()
    
    taxids_fileA<-paste0("taxids",as.numeric(Sys.time()),".txt")
    
    for(i in 1:length(taxidlimit)){
      system2(command = "taxonkit",args = c("list", "--ids", taxidlimit[i], "--indent", '""',"--data-dir",ncbiTaxDir)
              ,wait=T,stdout = taxids_fileA)
      taxid.list[i]<-read.table(taxids_fileA)
    }
    
    write.table(unlist(taxid.list),taxids_fileA,row.names = F,quote = F,col.names = F,)
    
    #change options to include taxidlimit
    opts<-c(opts,"-taxidlist",taxids_fileA)
    
    if(inverse) opts<-gsub("-taxidlimit","-negative_taxids",opts)
  }
  
  #run BLAST  
  
  error.log.file<-paste0("blast.error.temp.processx.file",as.numeric(Sys.time()),".txt")
  
  h<-process$new(command = blast_exec, args=c("-query", infasta, "-db",refdb,opts, "-out", out),echo_cmd = T,
                 stderr = error.log.file)
  
  Sys.sleep(time = 2)
  
  #report PID
  message(paste("PID:",h$get_pid()))
  
  #check immediate exit status
  exits<-h$get_exit_status()
  
  if(1 %in% exits){
    message("************
             There was a problem with ", infasta[match(1,exits)], ", aborting blast
             ************")
    print(grep("Error",readLines(error.log.file),value = T))
    
    h$kill()
  }
  
  if(wait==T){
    h$wait()
    message(readLines(error.log.file))
    message("exit_status=",exits)
    file.remove(error.log.file)
    if(!is.null(taxidlimit)) file.remove(taxids_fileA)
  }
  
  headers<-paste0(paste("'1i",paste(unlist(strsplit(opts[match("-outfmt",opts)+1]," "))[-1],collapse = "\t"),collapse = "\t"),"'")
  
  system2("sed",c("-i", headers, out),wait = T)
  
  t2<-Sys.time()
  t3<-round(difftime(t2,t1,units = "mins"),digits = 2)
  
  message(c("All blasts complete in ",t3," mins."))
  
  return(h)
}

blacklisting should be done separately for each binning level

If I am seeing it properly, it appears that all blacklisting is done on the btab object before binning, rather than for each binning round as before.

I see this as an issue, for example, in the following case:

Two species belong to GenusA
On site, we know we have SpeciesA
On site, we know we do not have SpeciesB, and so it is in our species.blacklist
In our DB, we do not have SpeciesA
In our DB, we have SpeciesB

Thus, in our alignment results, we only have alignments to SpeciesB.

bin3 would do:

create a temporary copy of the table (btab), remove SpeciesB and attempt binning at species level
create a new temporary copy of the table (btab), do not remove GenusA and attempt binning at genus level
Result would be GenusA:NA

metabin would do:

remove SpeciesB from btab
attempt binning at species level
attempt binning at genus level
Result would be NA:NA

By removing all entries matching SpeciesB, metabin is mistakenly removing it from genus level binning.

-s is an available flag, but I think is not catered for in code

double check the family level unknown selection

## Bastian: not sure if the following conditions are correct
    btab.f<-btab[btab$F!="unknown" || btab$G!="unknown" || btab$S!="unknown",,drop=FALSE]

When only one hit exists OR only one hit passes -S filter, metabin is not assigning at species level

(the good news: metabin is installing and runnning well, and is much faster than bin.blast3!)

Example of issue:

Input: tests/test_files/2020_01.blast.filt.txt
metabin output: tests/test_files/2020_01.metabins.txt.tsv
bin.blast3 output: tests/test_files/2020_01.bins3.txt

This OTU (AllMergedReads.Uniq.1004;size=8) has only one hit in the blast.filt.txt file.

taxids	qseqid	evalue	staxid	pident	qcovs	min_pident	old_taxids	K	P	C	O	F	G	S
1701145	AllMergedReads.Uniq.1004;size=8	1.24E-45	1701145	99.057	100	98.06643	1701145	Eukaryota	Chordata	Mammalia	Lagomorpha	Leporidae	Lepus	Lepus tibetanus

bin.blast3 bins it as

Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;Lepus tibetanus

metabin bins it as

Eukaryota;Chordata;Mammalia;Lagomorpha;Leporidae;Lepus;NA

bin.blast3 command (top=default of 1 for all levels)

bin.blast3(filtered_blastfile = "2020_01.blast.filt.txt",
           ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/May2020/",
           out = "2020_01.bins3.txt",spident = 98,gpident = 95,fpident = 92,abspident = 80)

metabin command:
metabin -i 2020_01.blast.filt.txt -o 2020_01.metabins.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE

I started going through a few more differences, but so far all have the same likely reason. Note this also occurs when there are originally multiple hits, but only one hit survives -S filter:

qseqids that have different bin outcomes	possible reason	METABIN	BIN.BLAST3
AllMergedReads.Uniq.1004;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1009;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1015;size=8	only one hit remaining after S filter	EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaNA	EukaryotaChordataMammaliaArtiodactylaBovidaeGazellaGazella bennettii
AllMergedReads.Uniq.1032;size=8	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus
AllMergedReads.Uniq.1034;size=8	only one hit remaining after S filter	EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusNA	EukaryotaChordataMammaliaArtiodactylaCamelidaeCamelusCamelus dromedarius
AllMergedReads.Uniq.1041;size=7	only one hit remaining after S filter	EukaryotaChordataMammaliaCarnivoraFelidaeNANA	EukaryotaChordataMammaliaCarnivoraFelidaePantheraPanthera tigris
AllMergedReads.Uniq.1046;size=7	one hit only	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusNA	EukaryotaChordataMammaliaLagomorphaLeporidaeLepusLepus tibetanus

Metabin info

metabinkit v0.0.1
[info] Starting Binning
[info] binning at species level
[info] excluding 32253 entries with pident below 98
[info] Not considering species with 'sp.', numbers or more than one space
[info] applying species top threshold of 1
[info] binned 0 sequences at species level
[info] binning at genus level
[info] excluding 8720 entries with pident below 95
[info] applying genus top threshold of 1
[info] binned 1529 sequences at genus level
[info] binning at family level
[info] excluding 1138 entries with pident below 92
[info] applying family top threshold of 1
[info] binned 826 sequences at family level
[info] binning at higher-than-family level
[info] excluding 36 entries with pident below 80
[info] applying htf top threshold of 1
[info] binned 473 sequences at higher than family level
[info] Total number of binned 2828 sequences
[info] not binned 0 sequences
[info] Complete. 134002 hits from 2828 queries processed in 1.54 mins.
[info] Note: If none of the hits for a BLAST query pass the binning thesholds, the results will be NA for all levels.
                 If the LCA for a query is above kingdom, e.g. cellular organisms or root, the results will be 'unknown' for all levels.
                 Queries that had no BLAST hits, or did not pass the filter.blast step will not appear in results.  
[info] binned table written to 2020_01.metabins.txt.tsv
$total_hits
[1] 134002

$total_queries
[1] 2828

$binned.species.level
[1] 0

$binned.genus.level
[1] 1529

$binned.family.level
[1] 826

$binned.htf.level
[1] 473

$not.binned
[1] 0

[info] Binning complete in 1.57 min
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8 optparse_1.6.6   

loaded via a namespace (and not attached):
[1] compiler_3.6.0 magrittr_1.5   tools_3.6.0    getopt_1.20.3  stringi_1.4.6 
[6] stringr_1.4.0

ensure blast 2.9.0 or above is being used

Sorry if I missed this, but all the taxid options in blastn are only available after 2.9.0, so perhaps this should be a requirement. Also, these options will only work with version 5 databases (which is the default for makeblastdb in 2.10.0, not sure about 2.9.0...

Odd warning about taxid missing even though taxdb up to date

See the readme: Example 2. Custom settings. The error is in the output:

16:29:14.561 [WARN] taxid 2746931 not found

--SpeciesBL testspecies2exclude.txt consist of

$ cat testspecies2exclude.txt
6573

full command (using the files in metabinkit/tests/test_files)
$ metabin -i in1.blast.short.tsv -o out1.blast.short.bins -S 98 -G 94 -F 93 -A 88 --SpeciesBL testspecies2exclude.txt --FilterFile Mioduchowska2018_flaggedAccessions.txt --FilterCol saccver --TopSpecies 2 --TopGenus 2 --TopFamily 2 --TopAF 2 --sp_discard_sp --sp_discard_mt2w --sp_discard_num

Line 69 in exe/metabinkit_blast

presume typo:

metabibkit_blast

should be

metabinkit_blast

(didnt make any changes)

using top of 0.1

when applying top of 0.1 the message is

[info] applying top threshold of 0

Is it just an issue with the message or is the threshold being rounded down?

NCBI: sometimes there are spaces in order, family, genus etc.

Hi Nuno,
I was just looking at mussels results and noticed that sometimes bacteria get weird designations

e.g. taxid 1884634 has the 7-level taxonomy of:

K | P | C | O | F | G | S
Bacteria | Actinobacteria | Actinobacteria | Candidatus Nanopelagicales | Candidatus Nanopelagicaceae | Candidatus Nanopelagicus | Candidatus Nanopelagicus limnes

Note the spaces at order, family, genus level (and two spaces at species level). This definitely annoys my other scripts. Honestly, I dont know how it will affect things here in the end, but I wanted to have it here as a note.

makeblastdb checks

Maybe going one step too far, but I like to run a test blast as part of the makeblastdb command. i.e. take the first sequence of the fasta and blast it against the newly created db to ensure it is created correctly. Also, to run the blastdbcheck command to check the blastdb and to further ensure all seqs have taxids.

from my own makeblastdb function:

if(do.checks){
  message("Running test blast")
  phylotools::dat2fasta(head(tempfasta,n=1),gsub(".fasta",".blastdbformatted.test.fasta",infasta))
  h<-blast.min.bas(gsub(".fasta",".blastdbformatted.test.fasta",infasta),refdb = gsub(".fasta","",infasta))
  message("Does new db have taxids - column V3?")
  print(data.table::fread(gsub(".fasta",".blastdbformatted.test.blast.txt",infasta)))
  system2(command = "blastdbcheck",args = c("-must_have_taxids","-db",gsub(".fasta","",infasta)))
  }

excluding specific hits

Often after binning mis-classified entries in genbank become evident. Would be nice to be able to exclude these. i.e. if the user keeps track of accession numbers found to be mis-classified, then they can permanently use that list for their binning, and their list can grow. This would require sseqid to be a required column.

README and documentation

test_files

Added test files to a new test_files folder. Includes one UNIO set (blast.filt+bins) and one VENE set. Also added disabled_taxa.txt (They are all tsv format)

blacklisting if NCBI taxids not being used

Current get children from taxids approach will not work if people are using their own taxonomies (by providing the KPCOFGS columns)

But, by the time metabin reaches the blacklisting step, it will always have the KPCOFGS columns.

Perhaps it would be better to use names rather than taxids for this blacklisting (and they are possibly easier for user to provide). Downside: ambiguities with user-defined taxon names vs NCBI taxon names. Very difficult. So perhaps we need both options...

install problem? metabinkit_blast and metabinkit_blastgendb

With metabinkit installed and sourced, I try:

$ metabinkit_blast 
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blast: /bin/env: bad interpreter: No such file or directory

similar for

$ metabinkit_blastgendb -h
bash: /home/tutorial/TOOLS/metabinkit.install/exe/metabinkit_blastgendb: /bin/env: bad interpreter: No such file or directory

"tabbing" metabin I get

$ metabin
metabin                metabinkit_blast       metabinkit_blastgendb

contents of install folder/exe are:

tutorial@tutorial-VirtualBox:~/TOOLS/metabinkit.install/exe$ ls
metabin  metabinkit_blast  metabinkit_blastgendb  metabinkit_shared.sh  taxonkit_children.sh

new error

Hi installed ok now, but got this:

tutorial@tutorial-VirtualBox:~$ metabin -h
metabinkit v0.0.1
Error in library(optparse) : there is no package called ‘optparse’
Execution halted

change default tops to be "off"

I think it is a little dangerous to have a strict top (e.g. 1) as default. Actually would prefer 100, so that this is effectively not used, unless specified by user.

min_pident seems wrong (negative numbers and 0s)

See the README.md for example. Here is output again:

$ head -4 out0.bins.tsv 
qseqid	pident	min_pident	K	P	C	O	F	G	S
6fcff7c8-2031-4e3a-a8f0-72dc2da71c79_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7	97.015	-2.985	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
d36ef3ba-f3d5-4952-b683-301f1a959cfa_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD11-A-UNIO-RUN7	100	0	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana
9ef96e73-a5b6-4c4f-bc59-2b8238281d77_runid=407cb32920f83b2252d840c6a949244d8c2a3bb9_ss_sample_id=Mussels-ITD24-A-UNIO-RUN7	97.059	-2.941	Eukaryota	Mollusca	Bivalvia	Unionida	Unionidae	Sinanodonta	Sinanodonta woodiana

ERROR checking new db using metabinkit_blastgendb

Hi,

Everything seems to work and pass the checks, but the command reports an error.

SANGER_PLUS_GENBANK_CUT.zip

$ metabinkit_blastgendb -f SANGER_PLUS_GENBANK_CUT.fasta -o /home/tutorial/testgenDB -c
metabinkit 0.1.7


Building a new DB, current time: 06/16/2020 20:13:25
New DB name:   /home/tutorial/testgenDB
New DB title:  SANGER_PLUS_GENBANK_CUT.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 23 sequences in 0.0220871 seconds.


Checking database...
Writing messages to <stdout> at verbosity (Summary)
ISAM testing is ENABLED.
Legacy testing is DISABLED.
TaxID testing is ENABLED.
By default, testing 200 randomly sampled OIDs.

Testing 1 volume(s).
 Result=SUCCESS. No errors reported for 1 volume(s).
Testing 0 alias(es).
 Result=SUCCESS. No errors reported for 0 alias(es).
metabinkit version: 0.1.7
Warning: [blastn] Examining 5 or more matches is recommended
INFO: blast finished.
ERROR: query newly created database with /tmp/tmp.Uh0BHtj3NF did not work as expected
qseqid	evalue	pident	qcovs	saccver	staxids	ssciname	sseqid
WC_KR059210.1	4.41e-27	100.000	100	WC_KR059222.1	9922	Capra	WC_KR059222.1

Add option to remove in-silico generated NCBI accessions

Certain codes are used by NCBI to distinguish PREDICTED in-silico sequences. I wonder if we could add an option to remove these? The codes are XM_, XR_, XP_ so the option could be -rm_predicted [colname] something like

 if(length(grep("XM_.*",btab$colname))>0) btab<-btab[-grep("XM_.*",btab$colname),] 
  if(length(grep("XR_.*",btab$colname))>0) btab<-btab[-grep("XR_.*",btab$colname),] 
  if(length(grep("XP_.*",btab$colname))>0) btab<-btab[-grep("XP_.*",btab$colname),]

provide error when taxids not found

When a taxonomy dump is used that is older than the BLAST performed (or whatever was used to get the taxids), then there can often be taxids not found, leading to NAs

When this happens, I think the program should STOP (or at least provide obvious warning) and report an error like:

"Some taxids were not found in the taxonomy database, consider updating NCBI taxonomy database by running ./install -i your_metabinkit_install_directory -x taxonomy_db"

example in R

a<-data.table::fread("2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.txt",data.table = F)
b<-add.lineage.df(a,ncbiTaxDir = "/home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/") #an old taxonomy folder

#some stderr output
11:39:50.515 [WARN] taxid 1823760 was deleted
11:39:50.540 [WARN] taxid 1936990 was deleted
11:39:50.591 [WARN] taxid 2563896 was deleted
11:39:50.641 [WARN] taxid 2714934 not found
11:39:50.642 [WARN] taxid 2715212 not found
11:39:50.642 [WARN] taxid 2715678 not found
11:39:50.643 [WARN] taxid 2715735 not found

Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = "unknown") :
  invalid factor level, NA generated

In metabin

metabin -i 2019_August_002.UNIO.lenFilt.trimmed.ids.SC4.pol.blast.filt.nopaths.csv -o 2019_UNIO.metabins.new.nopath.txt -S 98 -G 95 -F 92 -A 80 --discard_sp TRUE -D /home/tutorial/TOOLS/DBS/ncbi_taxonomy/taxdump/


#some output

11:55:47.603 [WARN] taxid 2721245 not found
11:55:47.603 [WARN] taxid 2721246 not found
11:55:47.603 [WARN] taxid 2722751 not found
11:55:47.604 [WARN] taxid 2724150 not found
11:55:47.604 [WARN] taxid 2724191 not found
11:55:47.604 [WARN] taxid 2724192 not found

#but program completes

envmetagen / metabinkit Goto Github PK

metabinkit's Introduction

metabinkit

Overview

Conda

Docker

Manual installation

Supported OS

Getting sources

Installing metabinkit and dependencies

Programs

metabin

Expected file formats and contents

How it works

Examples

metabinkit_blast

metabinkit_blastgendb

FAQs

How do "Top.." thresholds work and what are their effects?

Why is only the classical seven-rank taxonomy considered?

Why are binning thresholds specifically implemented at species, genus and family ranks, but for above family are combined?

I have performed alignments, but do not have NCBI taxids, how can I use metabin?

How do I choose thresholds?

metabinkit's People

Contributors

Stargazers

Watchers

Forkers

metabinkit's Issues

Recommend Projects

Recommend Topics

Recommend Org