Giter Site home page Giter Site logo

derrickwood / kraken2 Goto Github PK

View Code? Open in Web Editor NEW
696.0 30.0 271.0 428 KB

The second version of the Kraken taxonomic sequence classification system

License: MIT License

Shell 4.71% Perl 11.76% Makefile 0.51% C++ 61.04% CMake 0.43% C 0.21% Python 21.35%

kraken2's Introduction

install with bioconda European Galaxy server

Kraken 2

The second version of the Kraken taxonomic sequence classification system

Please refer to the Operating Manual (in docs/MANUAL.html) for details on how to use Kraken 2.

Publications

For additional implementation details and guidance on using Kraken 2, you can also refer to:

kraken2's People

Contributors

benlangmead avatar bgruening avatar ch4rr0 avatar derrickwood avatar dfornika avatar essut avatar jenniferlu717 avatar martin-steinegger avatar miguelpmachado avatar phhere avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kraken2's Issues

no mpa-report with 16S databases

Hey,

I classified my 16S reads with "kraken2 --db $DB_Silva --threads 5 --output $out/20170519_Silva.kraken --report $out/20170519_Silva.mpareport --use-mpa-style $infile1". Although I got a normal kraken output file, I only obtained an empty mpa-report. The same for the other 16S databases (RDP and Greengenes) I tried the same command with a custom database that I used for my shotgun samples and got a normal mpa-report. So I would guess that the 16S databases could be the problem here. I recognized that in the taxonomy of the 16S databases only two files (only names.dmp and nodes.dmp) were present in contrast to the custom database with 22 files in that folder. Do you have an idea what could cause the problem here?

Thank you
Josephine

Availability via Conda

As far as I can tell, kraken2 isn't available from conda, although kraken1 is. I'm mainly interested in a conda submission because my research group uses it for installing and managing (almost) all of our bioinformatics software, and I would prefer to use the latest version of kraken. Are you planning (or currently working on) submitting to conda?

Thanks.

feature: accession blacklist

I have a list of 303 accession IDs in protozoa that apparently consist completely of sequences that produce hits from other domains (and so they would give false positives when using protozoa). I would now like to have the feature that kraken, when building a database, looks up IDs in my blacklist so the accessions are no longer included with any new database build. Sounds reasonable?

--special silva : No such file ‘tax_slv_ssu_nr_132.acc_taxid

First file was ok, but I don't think the second one exists?

This looks like it might be it? without the 'nr' ?

ftp://ftp.arb-silva.de//release_132/Exports/taxonomy/tax_slv_ssu_132.acc_taxid

kraken2-build  --db silva --threads 36 --special silva

--2018-06-28 14:48:38--  ftp://ftp.arb-silva.de//release_132/Exports/SILVA_132_SSURef_Nr99_tax_silva.fasta.gz
           => ‘SILVA_132_SSURef_Nr99_tax_silva.fasta.gz’
SILVA_132_SSURef_Nr99_ta 100%[=================================>] 229.21M   102KB/s    in 36m 10s
2018-06-28 15:24:52 (108 KB/s) - ‘SILVA_132_SSURef_Nr99_tax_silva.fasta.gz’ saved [240343558]

--2018-06-28 15:24:59--  ftp://ftp.arb-silva.de//release_132/Exports/taxonomy/tax_slv_ssu_nr_132.acc_taxid
           => ‘tax_slv_ssu_nr_132.acc_taxid’
==> PASV ... done.    ==> RETR tax_slv_ssu_nr_132.acc_taxid ...
No such file ‘tax_slv_ssu_nr_132.acc_taxid’.

Support KRAKEN2_CMDLINE_OPTIONS

Some examples:

export KRAKEN2_CMDLINE_OPTIONS="--memory-mapping"

export KRAKEN2_CMDLINE_OPTIONS="--output - --use-names --gzip-compressed"

Not sure about handling clash with other $KRAKEN2_* etc.

How can I extract the reads from a specific taxonomic group?

Dear Developer,

The Kraken2 give the results of classified sequence and taxonomic ID. A report including aggregrate counts/clade also was generated. But How can I extract the classified reads of a specific taxonomic group for the following assembly? Is there any script including in kraken2 that can be used?

kraken2-build --add-to-library too restrictive if sequences without acc.no present

Given a file having both thousands of sequences with accession numbers, and one sequence without accession number, kraken2-build --add-to-library will refuse to add the file. This is not necessary, especially because scan_fasta_file.pl takes a --lenient switch that causes it to ignore sequences without accession numbers. All it takes is to add that switch to the invocation of scan_fasta_file.pl in add_to_library.sh:

diff --git a/scripts/add_to_library.sh b/scripts/add_to_library.sh
index e4987e0..876d79b 100755
--- a/scripts/add_to_library.sh
+++ b/scripts/add_to_library.sh
@@ -25,7 +25,7 @@ fi
 add_dir="$LIBRARY_DIR/added"
 mkdir -p "$add_dir"
 prelim_map=$(cp_into_tempfile.pl -t "prelim_map_XXXXXXXXXX" -d "$add_dir" -s txt /dev/null)
-scan_fasta_file.pl "$1" > "$prelim_map"
+scan_fasta_file.pl --lenient "$1" > "$prelim_map"
 
 filename=$(cp_into_tempfile.pl -t "XXXXXXXXXX" -d "$add_dir" -s fna "$1")
 

rsync problem with NCBI

Similar to the issue here with Kraken v1,( DerrickWood/kraken#114 ), in Kraken2 the downloading and updating the RefSeq databases is not working. I get the following error:

kate@.../software/kraken2-master$ ./kraken2-build --standard --threads 50 --db Aug2018_RefSeq Downloading nucleotide est accession to taxon map...rsync: failed to connect to ftp.ncbi.nlm.nih.gov (165.112.9.229): Connection refused (111) rsync: failed to connect to ftp.ncbi.nlm.nih.gov (2607:f220:41e:250::7): Network is unreachable (101) rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.1]
There is a python workaround using wget instead of rsync ( https://github.com/sejmodha/MiscScripts/blob/master/UpdateKrakenDatabases.py ) but that has some issues too. Any chance you could fix this? Thanks.

#!/usr/bin/perl => #!/usr/bin/env perl

Can you please change #!/usr/bin/perl to #!/usr/bin/env perl so it uses the Perl in the PATH rather than forcing system perl?

I know most ppl will have system perl, and you only use core modules, but it causes problems for packaging in bioconda and brew.

Percent (%) symbol in kraken2 but not in kraken1 reports?

Was the new '%' sign deliberate?

(It causes parsing failures in some downstream tools)

v1

86.44  476324  476324  U       0       unclassified
 13.56  74700   1274    -       1       root
 13.33  73425   244     -       131567    cellular organisms
 13.27  73116   2776    D       2           Bacteria

v2

 41.51% 228715  228715  U       0       unclassified
 58.49% 322309  82      R       1       root
 58.31% 321318  405     R1      131567    cellular organisms
 57.91% 319096  4873    D       2           Bacteria

How to join custom databases?

Hi, Dear Kraken Team.

I downloaded four libraries I'm interested to: archaea, bacteria, fungi and viral.

My question is:
Can I join the four libraries to build a unique database or do I have to build each separately?

I want to:
$ kraken2-build --build --threads 56 --db 'this_folder_has_four_libraries_joined/'

Thanks for your help.

Maryo.

to implement kraken2-mpa-report and kraken2-report?

Hi there, is it possible to implement kraken-mpa-report/kraken-report like scripts? Once we need both report and mpa report, or try different confidence value, currently there seems an only way to rerun whole kraken2 classification.

Is it normal for the progress of database building?

Dear developer,
The building progress stop in the step " Step 1/2: Performing rsync file transfer of requested files" for more than 24 hours. Is it normal? The command is as follows:

./kraken2-build --standard --threads 40 --db /opt/kraken2/database

--2018-07-08 15:35:38-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_est.accession2taxid.gz
=> “nucl_est.accession2taxid.gz”
........................................
(The output is omitted)
2018-07-08 17:59:27 (0.00 B/s) - “taxdump.tar.gz” 已保存 [43921242]

Downloaded taxonomy tree data
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing rsync file transfer of requested files
Rsync file transfer complete.
Step 2/2: Assigning taxonomic IDs to sequences
Processed 272 projects (408 sequences, 687.34 Mbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library...done.
Step 1/2: Performing rsync file transfer of requested files

Rsync errors and possible solution

Hi,
I had the same/a similar problem as described here #38 and DerrickWood/kraken#114. However, for me rsync only failed once in a while: the taxonomy download via download_taxonomy.sh mostly ran through. However, the syncing of the genomic DNA files sometimes died. I'm not sure if it is a problem/timeout with the NCBI rsync server or if it's client side on our cluster.

However, a potential workaround is to add a check in the rsync system call in rsync_from_ncbi.pl to retry syncing in case rsync returns anything but 0:

my $rc = 1;
while($rc){
   print STDOUT "\nTrying rsync\n";
   $rc = system("rsync --no-motd --files-from=manifest.txt rsync://ftp.ncbi.nlm.nih.gov/genomes/ .");
}

This is a potential infinite loop, of course. Could be adjusted to only do it if the error msg is Network is unreachable (101) or a counter or something.

Typo in documentation

In Standard Kraken Output Format, the line:

For example, "562:15 561:4 A:31 0:1 562:3" would indicate that:

should be:

For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:

The HTML file has the same typo, since it was generated from the same Markdown file.

kraken2 --classified-out and --unclassified-out option are automatically adding .fq extension to the given filenames

Hi,

I started using Kraken2 and I tried to classified some reads and separate classified and unclassified reads with --classified-out and --unclassified-out options. Here is the command I launched:

kraken2 --db ${KRAKEN2_DB} reads.fastq --output reads.classification --classified-out reads.classified.fastq --unclassified-out reads.unclassified.fastq

And I obtained the following files:

  • reads.classified.fastq.fq
  • reads.unclassified.fastq.fq

It added .fq extention even if I gave filename with .fastq extension

Support taxid patching for bad NCBI records

Derrick,

There is a feature we would really like to add to Kraken, but before filing a pull request etc we wanted to discuss the idea.

Basically, as you know, NCBI has many taxanomic errors - either from sloppy submission, lack of knowledge at time of submission, or just ignorance. There is lots of domain expertise out in the world, especially in public health labs where species ID is our bread and butter.

Our colleague @cgorrie here has curated the whole Klebsiella complex, and is putting together a table as follows:

ACCESSION   OLD_TAXID   NEW_TAXID
...

I was hoping kraken2-build could support a --patch-taxids <file> option which would remap anything it finds in the above table.

My goal would for other groups eg. @happykhan @lskatz @andersgs @schultzm would probably also have similar knowledge that could be contributed to make Kraken2 databases more reliable.

What are your thoughts?

add genomes to existing database

Hi i have already built the standard database using the kraken2-build. How can i add more genomes to the bacterial library in the database.

problems with classification using nt database

Hi Derrick,

I tried classifying my sample reads with kraken database containing only "nt" sequences and found that all my reads were classified at the root. Here is the output from the first few lines of the report file:

100.00% 14552 14535 R 1 root
0.12% 17 0 R1 131567 cellular organisms
0.12% 17 17 D 2 Bacteria

I did the same classification using the "standard database" and I got similar results that I found in Kraken1:

100.00% 14552 0 R 1 root
100.00% 14552 0 R1 131567 cellular organisms
100.00% 14552 0 D 2 Bacteria
44.38% 6458 0 D1 1783270 FCB group
41.62% 6057 0 P 1224 Proteobacteria
14.00% 2037 7 D1 1783272 Terrabacteria group

I wonder if there are some kmers that are found in some bacterial species, are also annotated in weird categories placed near the root of the taxonomy tree; specifically in the case of the "nt" database ?

'SIZE_MAX' was not declared in this scope & 'UINT64_MAX' was not declared in this scope

my linux distribution is CentOS release 6.7 (Final) compiled kraken2 with gcc-4.9.3

g++ -fopenmp -Wall -std=c++11 -O3 -DLINEAR_PROBING   -c -o mmscanner.o mmscanner.cc
In file included from mmscanner.cc:7:0:
mmscanner.h:34:23: error: 'SIZE_MAX' was not declared in this scope
       size_t finish = SIZE_MAX);
                       ^
mmscanner.cc: In constructor 'kraken2::MinimizerScanner::MinimizerScanner(ssize_t, ssize_t, uint64_t, bool, uint64_t)':
mmscanner.cc:32:18: error: 'SIZE_MAX' was not declared in this scope
   if (finish_ == SIZE_MAX)
                  ^
mmscanner.cc: In member function 'void kraken2::MinimizerScanner::LoadSequence(std::string&, size_t, size_t)':
mmscanner.cc:43:18: error: 'SIZE_MAX' was not declared in this scope
   if (finish_ == SIZE_MAX)
                  ^
make: *** [mmscanner.o] Error 1

fixed it based on suggestion in the following post
https://stackoverflow.com/a/42097570

That lead to next issue

g++ -fopenmp -Wall -std=c++11 -O3 -DLINEAR_PROBING    classify.cc reports.o mmap_file.o compact_hash.o taxonomy.o seqreader.o mmscanner.o omp_hack.o aa_translate.o   -o classify
classify.cc: In function 'taxid_t ClassifySequence(kraken2::Sequence&, kraken2::Sequence&, std::ostringstream&, kraken2::KeyValueStore*, kraken2::Taxonomy&, IndexOptions&, Options&, ClassificationStats&, kraken2::MinimizerScanner&, std::vector<long unsigned int>&, taxon_counts_t&, std::vector<std::basic_string<char> >&)':
classify.cc:505:33: error: 'UINT64_MAX' was not declared in this scope
       uint64_t last_minimizer = UINT64_MAX;
                                 ^
make: *** [classify] Error 1

fixed it based on the suggestion from following post
https://stackoverflow.com/a/3233069

posting the patch here as a solution for any one facing this issue
fix_install_kraken2.patch.txt

Incorrect taxids in classified-out

I haven't taken the time to track down the source, but I have noticed that the annotated fastq files that are output by '--classified-out' have incorrect taxids.

For example, I have a certain taxid in the kraken2 output file and report file, and regex-ing the classified-out I can't even find that taxid. If I go find the read name in the output file and pull it by name out of the classified-out fastq, the read is what I think it is, but the "taxid|XXXX" annotation is wrong. I can't see a logical reason for the taxid that is printed. However, the number printed in the fastq is consistently wrong, so taxid 9606 is represented as 19742 every time.

--report not working, output is the same as if you don't use it

Hello,

I've been trying to generate a report with the formatting for --use-mpa-style, and my code is just giving the exact same output as when I don't use the options --report --use-mpa-style. Here is what I am inputting exactly:

kraken2 --db /projects/b1052/krakenDB/ /projects/b1052/Wells_b1042/Morgan/assembled/idba/S_AS1_contig.fa --report --use-mpa-style > /projects/b1052/Wells_b1042/Morgan/Kraken2/Output/S_AS1_Kraken_Output_Report.txt

But the output still looks like this, which is identical to the output without the extra report options specified:

C contig-101_31860 990316 600 990316:566

Can you please let me know why this isn't working?

Use rsync instead of wget for taxonomy files

FTP struggles in countries like AU where there are >25 hops to NCBI and ENA.

The way you've implemented rsync for the genome libraries has been awesome and makes it work within hours rather than days.

Can you do the same for the taxonomy files?
It makes it 50x faster for us. Yes fifty.

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

154 KB/sec
rsync --progress rsync://rsync.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz .

7.86 MB/sec

16S SILVA database now free of licencing restrictions

I noticed on the website for Kraken2 under 'specialist databases':

"Note that these databases may have licensing restrictions regarding their data, and it is your responsibility to ensure you are in compliance with those restrictions; please visit the databases' websites for further details."

This was presumably a reference to the dual-licencing of SILVA, which was not previously available without licences for commercial use. This restriction is now gone, it is part of ELIXIR and a 'Core Data Resource'

Scientific names of taxids

Hi!

I'm running kraken 2 to do the taxonomic classification of some microbiome data (assembly of PE reads) and I'm using the greengages databased. I want to get to know all the taxonomic classification of each taxids (all the leves) just like the output of kraken-translate, but I only get one level of taxonomy. I run the following lines:

kraken2-build --db./Kraken/ggdatabase --special greengages
(to build the database)

kraken2 --db ./Kraken/ggdatabase exported/dna-sequences.fasta > output_kraken.txt
(to do the classification)

kraken2 --db ./Kraken/ggdatabase --use-names exported/dna-sequences.fasta > output_kraken_names.txt
(To get the scientific names)

Any ideas on how to deal with this would be very much appreciated!

Camila

relative abundance

Hi
I wonder how to calculate the relative abundance with the report-mpa-style table to compare different samples.
any idea?
hu
thanks

Collapsing to species level?

For Kraken 1 "classic" i asked if collapsing taxids to species level would save space (like centrifuge?). The answer was no. Is this still true for Kraken 2?

If no, is there any existing code to collapse a --report to S level ?

Custom databases: lookup_accession_numbers.pl very slow

I'm currently trying to build a custom database but the lookup_accession_numbers.pl script seems to take a very long time to retrieve taxids.

I looked at the lookup_accession_numbers.pl code and it seems the problem is similar to the one I had with the lookup_accession_numbers.pl script in Kraken #94

Currently you are:
1) Adding each accession number from the lookup list to a Perl hash
2) Reading each *.accession2taxid NCBI file line by line and extract taxids if corresponding accession numbers are in the hash build in 1)

Do you think doing the opposite can be faster?
1) Creating a Perl hash based on each *.accession2taxid NCBI file
2) Reading the accession number lookup list line by line and extract corresponding taxids in the hash build in 1)

There is currently 674 511 357 entries in the *.accession2taxid NCBI files (downloaded with kraken2-build --taxonomy --db krakendb today) and it guess that the lookup list will always be smaller right?

kraken2 for short 16S

Hello,

I have older 1.5 Illumina 16S data. I am wondering if kraken2 can use it?
Also, how do I get the 16S databases? Just format from the fasta? Can I add custom sequences?

Cheers
Rick

terminate called after throwing an instance of 'std::bad_alloc'

I'm trying to build a database with refseq/nt and refseq/evn-nt on rhel 7 with a 24 core machine with 160gb ram. I have pulled a fresh copy from the master branch and compiled it according to the readme. I have successfully built the standard database, but when I try the nt and env_nt database it fails. I'm going to rerun it and collect memory usage statistics.

$ kraken2-build --db nt_nt-env/ --build --threads 24
Creating sequence ID to taxonomy ID map (step 1)...
Found 175778622/175780306 targets, searched through 681337449 accession IDs, search complete.
lookup_accession_numbers: 1684/175780306 accession numbers remain unmapped, see unmapped.txt in DB directory
Sequence ID to taxonomy ID map complete. [2h19m9.840s]
Estimating required capacity (step 2)...
Estimated capacity requirement: 198587649460 bytes
Capacity estimation complete. [49m13.776s]
Building database files (step 3)...
Taxonomy parsed and converted.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
xargs: cat: terminated by signal 13
/syn-bio/var/opt/kraken2/build_kraken2_db.sh: line 119:  3554 Done                    list_sequence_files
      3555 Exit 125                | xargs -0 cat
      3556 Aborted                 (core dumped) | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT

building special silva : No such file ‘tax_slv_ssu_nr_132.acc_taxid

Hi,
This could be related to #8
When applying the change suggested in the 16S_silva_installation.sh script
the kraken2-build command works.

bin/kraken2-build --special silva --db $PWD/silva --threads $(nproc)

However, when I run kraken2-inspect with the database that was created in the previous step the only named node I get is root. All other nodes are not named.

bin/kraken2-inspect --db /db/silva/  | head -100:$PWD -u $(id -u)
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111111111110011001100110011001100
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 13528
# Table size: 26477341
# Table capacity: 37889828
100.00% 26477341        13956   R       1       root
 73.25% 19395500        1121078 R1      3
 18.61% 4928033 76036   R2      2375
  9.73% 2575341 146135  R3      3303
  2.42% 640458  33886   R4      26341
  1.43% 377766  42528   R5      26352
  0.16% 43095   43095   R6      26368
  0.07% 18838   18838   R6      26414
  0.06% 16298   16298   R6      26384
  0.06% 16233   16233   R6      26471
  0.06% 15816   15816   R6      26355
  0.06% 14829   14829   R6      26361
  0.04% 11854   11854   R6      26358
  

Consequently, (which may be an unrelated issue), when I try to classify sequences with kraken2 I get the results with taxid but not names of taxa eg:

bin/kraken2 --db /db/silva --report $PWD/report.txt --classified-out $PWD/clas.txt --output $PWD/out.txt  --use-names $PWD/00A00044_Ext2_Rep2_S172_L001_R1_001.fastq.gz
118679 sequences (35.72 Mbp) processed in 3.534s (2014.7 Kseq/m, 606.43 Mbp/m).
  118140 sequences classified (99.55%)
  539 sequences unclassified (0.45%)
kmavrommatis@ip-10-112-17-141: Tue Aug 21 22:09:02 microbiome (0)
$more out.txt
C	M04141:136:000000000-B96YR:1:1101:22984:1662	918	301	912:1 3:3 912:1 3:36 913:5 918:1 3:1 912:3 918:9 912:5 3:40 918:2 913:5 918:19 912:5 918:2 3:3 918:3 3:49 913:2 912:5 3:1 912:1 918:25 913:1 3:2 918:2 3:1 0:5 3:1 918:5 3:2 913:7 918:1 0:3 913:5 0:5
C	M04141:136:000000000-B96YR:1:1101:19609:1681	1863	301	3:43 1863:27 3:17 1863:5 3:15 1863:28 3:1 1863:8 3:64 1863:11 3:38 1:5 0:5
C	M04141:136:000000000-B96YR:1:1101:14361:1737	1863	301	3:26 1863:1 3:16 1863:44 3:4 1863:50 3:5 1:1 3:42 1:1 3:72 1:2 0:3
C	M04141:136:000000000-B96YR:1:1101:22479:1741	918	301	3:51 913:5 3:1 918:3 912:5 3:36 918:34 3:3 918:8 3:54 918:2 912:5 918:16 3:10 918:5 3:1 918:5 3:2 913:5 0:6 913:5 1780:2 0:3
C	M04141:136:000000000-B96YR:1:1101:9321:1752	24660	301	3:131 913:10 3:6 913:4 3:23 913:27 912:1 24660:22 913:3 3:35 0:5
C	M04141:136:000000000-B96YR:1:1101:11659:1762	11328	301	3:59 1863:1 3:5 1863:16 3:10 1863:2 3:4 1863:10 1987:5 1863:4 1987:15 3:83 1863:5 1987:3 1863:7 1672:3 1863:9 3:5 1:7 3:1 1:3 11328:5 3:2 0:3
C	M04141:136:000000000-B96YR:1:1101:9249:1764	1987	301	3:59 1863:1 3:5 1863:16 3:10 1863:2 3:4 1863:10 1987:5 1863:4 1987:15 1672:5 3:3 1672:2 3:105 1:7 0:1 1:3 0:10
C	M04141:136:000000000-B96YR:1:1101:10307:1811	1863	301	3:46 1863:36 3:9 1863:55 1:1 3:42 1:1 3:63 0:1 3:3 0:5 2588:2 0:3

For comparison, when the same commands are run on a kraken2 database created from refseq all results include the taxa names.

Any advice?

Thanks

build feature: set default taxid for sequences without acc.no

At the moment sequences without accession number as build template either lead to bailout of kraken2-build, or are simply ignored. I propose an option --default-taxid which when set leads to sequences without accession number being given the specified taxid. This way one can train sequences that are not in NCBI's taxonomy---at the moment these are ignored and will give no hit. If this gives too much freedom the default taxid could be set as unknown organism. This would still be better than learning and recognizing nothing.

Some sequences labeled as classified but no taxid assigned

See DerrickWood/kraken#127 for original submission. Example output line (with initial "C" but taxid 0) is below:

C M00963:162:000000000-AHPVV:1:1101:14752:1397 0 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

@stuber Can you give me some more information to help me track this down? What kind of Kraken 2 database was being run against here? What was the full command (starting with kraken2) that would replicate this output?

Problem with 'kraken2-build --add-to-library' in parallel

Hi @DerrickWood

I'm truying to build a custom database and I have a lot of sequences to add so I tried to use the kraken2-build --add-to-library command in parallel. I used the command from the MANUAL with the -P option for xargs but it seems to produce some errors. Here is the command I used:

find ${GENOMESDIR} -name *'.fna' -print0 | xargs -I{} -0 -n1 -P${CPU} kraken2-build --add-to-library {} --db ${DBNAME} &>> log.add_to_library

Where:

  • ${GENOMESDIR} is the path to the directory containing your FASTA files
  • ${CPU} is the number of processors you want to use
  • ${DBNAME} the name of your Kraken2 database

The problem seems to come from the add_to_library.sh script. When using in parallel, the cp and rm commands at the end not always work because the temp_map.txt file does not exit. I obtain error messages like:

cat: ${DBNAME}/kraken2/library/added/temp_map.txt: No such file or directory and rm: cannot remove ‘${DBNAME}/kraken2/library/added/temp_map.txt’

It also seems to mess up the resulting prelim_map.txt file.

Don't re-download taxonomy files if already there

I was doing kraken-build --db XXX --download-taxonomy and the 4th FTP stalled, so i restarted it, and it started re-downloading everything from scratch (it takes 10 hours to get them from NCBI over FTP to my uni, they throttle FTP)

https://github.com/DerrickWood/kraken2/blob/master/scripts/download_taxonomy.sh#L24-L27

  1. Could you use this wget option to avoid re-download?
    Or only download if if [ ! -r $FILE ]; then wget ... ; fi ?
       --continue
           Continue getting a partially-downloaded file.  This is useful when you want to finish up a
           download started by a previous instance of Wget, or by another program.
  1. Do you know the Aspera URL for them? And use ascp if it is installed?

lookup unmapped accession numbers through NCBI eutils

Hi. I have encountered hundreds of unmapped accessions, for examples NZ_LS483329, when building database from virus, bacteria, archaea, fungi, protozoa and human reference genomes, with est, gb, gss and wgs accession2taxid. In fact these accessions exists. Is it possible to implement extra accession mapping through NCBI eutils in lookup_accession_numbers.pl? I have tried that, and it works.

masking of nt database is very slow

Hi,

I finished downloading "nt" database and I am now stuck at the step of masking low-complexity sequences. The dustmasker script has been running for almost a day. I see that it not multi-threaded. Is it typical for this step to take such a long time ? Would it be possible to replace "dustmasker" with a more efficient tool such as "repeatmasker" ?

Ajay.

Unknown library type "protozoa"

The protozoa) case is missing from the download script:

https://github.com/DerrickWood/kraken2/blob/master/scripts/download_genomic_library.sh

kraken2-build --db mydb --download-library protozoa

Unknown library type "protozoa"

Usage: kraken2-build [task option] [options]

Task options (exactly one must be selected):
  --download-taxonomy        Download NCBI taxonomic information
  --download-library TYPE    Download partial library
                             (TYPE = one of "archaea", "bacteria", "plasmid",
                             "viral", "human", "fungi", "plant", "protozoa",
                             "nr", "nt", "env_nr", "env_nt", "UniVec",
                             "UniVec_Core")

No output while waiting to load DB ?

I managed to build the standard DB, and it worked flawlessy - nice work!
The rsync is a great addition to the process.
The DB hash file was ~32GB.

When i ran the first kraken2 test, it took about 30 seconds before it printed anything to the screen, and i wasn't sure if it was working or not (top showed classify running with 40% of 1 CPU only).

Could you have a --verbose option, or just print something to STDERR at the start?
To pacify premature CTRL-C people like me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.