smdabdoub / kraken-biom Goto Github PK

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).

License: MIT License

Python 99.80% Dockerfile 0.20%

metagenomics taxonomy taxonomic-classification bioinformatics kraken biom-format

kraken-biom's Introduction

kraken-biom

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/).

Installation

From PyPI:

$ pip install kraken-biom

From GitHub:

$ pip install git+http://github.com/smdabdoub/kraken-biom.git

From source:

$ python setup.py install

From docker:

$ git clone https://github.com/smdabdoub/kraken-biom.git && cd kraken-biom
$ docker build . -t kraken_biom
$ docker run -it --rm -v ${pwd}:/data kraken_biom

Citation

kraken-biom does not yet have a published article, but it can be cited as:

Dabdoub, SM (2016). kraken-biom: Enabling interoperative format conversion for Kraken results (Version 1.2) [Software]. Available at https://github.com/smdabdoub/kraken-biom.

Requirements

biom-format >= 2.1.5

Documentation

The program takes as input, one or more files output from the kraken-report tool. Each file is parsed and the counts for each OTU (operational taxonomic unit) are recorded, along with database ID (e.g. NCBI), and lineage. The extracted data are then stored in a BIOM table where each count is linked to the Sample and OTU it belongs to. Sample IDs are extracted from the input filenames (everything up to the '.').

OTUs are defined by the --max and --min arguments. By default these are set to Order and Species respectively. This means that counts assigned directly to an Order, Family, or Genus are recorded under the associated OTU ID, and counts assigned at or below the Species level are assigned to the OTU ID for the species. Setting a minimum rank below Species is not yet available.

The BIOM format currently has two major versions. Version 1.0 uses the JSON (JavaScript Object Notation) format as a base. Version 2.x uses the HDF5 (Hierarchical Data Format v5) as a base. The output format can be specified with the --fmt option. Note that a tab-separated (tsv) output format is also available. The resulting file will not contain most of the metadata, but can be opened by spreadsheet programs.

Version 2 of the BIOM format is used by default for output, but requires the Python library 'h5py'. If the library is not installed, kraken-biom will automatically switch to using version 1.0. Note that the output can optionally be compressed with gzip (--gzip) for version 1.0 and TSV files. Version 2 files are automatically compressed.

Currently the taxonomy for each OTU ID is stored as row metadata in the BIOM table using the standard seven-level QIIME format: k__K; p__P; ... s__S. If you would like another format supported, please file an issue or send a pull request (note the contribution guidelines).

usage: kraken-biom [-h] [--max {D,P,C,O,F,G,S}] [--min {D,P,C,O,F,G,S}]
                      [-o OUTPUT_FP] [--fmt {hdf5,json,tsv}] [--gzip]
                      [--version] [-v]
                      kraken_reports [kraken_reports ...]

Usage examples

Basic usage with default parameters:
```
$ kraken-biom S1.txt S2.txt
```

This produces a compressed BIOM 2.1 file: table.biom

BIOM v1.0 output:
```
$ kraken-biom S1.txt S2.txt --fmt json
```

Produces a BIOM 1.0 file: table.biom

Compressed TSV output:

$ kraken-biom S1.txt S2.txt --fmt tsv --gzip -o table.tsv

Produces a TSV file: table.tsv.gz

Change the max and min OTU levels to Class and Genus:
```
$ kraken-biom S1.txt S2.txt --max C --min G
```
Basic usage with default parameters and metadata:
```
$ kraken-biom S1.txt S2.txt -m metadata.tsv
```

This produces a compressed BIOM 2.1 file: table.biom

Program arguments

positional arguments:

kraken_reports        Results files from the kraken-report tool.

optional arguments:

 -h, --help            show this help message and exit
 --max {D,P,C,O,F,G,S}
                       Assigned reads will be recorded only if they are at or
                       below max rank. Default: O.
 --min {D,P,C,O,F,G,S}
                       Reads assigned at and below min rank will be recorded
                       as being assigned to the min rank level. Default: S.
 -o OUTPUT_FP, --output_fp OUTPUT_FP
                       Path to the BIOM-format file. By default, the table
                       will be in the HDF5 BIOM 2.x format. Users can output
                       to a different format using the --fmt option. The
                       output can also be gzipped using the --gzip option.
                       Default path is: ./table.biom
-m METADATA, --metadata METADATA
                       Path to the sample metadata file. This should be in
                       TSV format. The first column should be the sample id.
                       This is the same name as the input files. If no
                       metadata is given, basic metadata is added to help
                       when importing the biom file on sites like phinch
                       (http://phinch.org/index.html).

 --fmt {hdf5,json,tsv}
                       Set the output format of the BIOM table. Default is
                       HDF5.
 --gzip                Compress the output BIOM table with gzip. HDF5 BIOM
                       (v2.x) files are internally compressed by default, so
                       this option is not needed when specifying --fmt hdf5.
 --version             show program's version number and exit
 -v, --verbose         Prints status messages during program execution.

kraken-biom's People

Contributors

Stargazers

Watchers

Forkers

jrherr eclarke maxibor lauraklausen krischan misazaa aa-m-sa jimaz casperp midnighter khemlalnirmalkar mishkb mdsefero absartalat carolro

kraken-biom's Issues

Taxonomic hierarchy incorrect for certain cases

We've found that in some edge cases, the taxonomic hierarchy kraken-biom assigns to a given ID is incorrect. It looks like the assumption that a given rank in the Kraken report always falls under the most recent higher rank (for example, for a given "S" entry for species, the closest "P" for phylum in the previous lines in the file) is not always true.

This is with kraken-biom 1.0.1 under Python 3.5 (in Anaconda) on Linux.

Here's a chunk from an example Kraken report, where I'm specifically looking at 5693 (Trypanosoma cruzi) near the bottom.

...
  0.00  1       0       P       3041            Chlorophyta
  0.00  1       1       C       75966             Trebouxiophyceae
  0.00  2       0       -       556282        Jakobida
  0.00  2       0       G       221723          Seculamonas
  0.00  2       2       S       221724            Seculamonas ecuadoriensis
  0.00  1       0       -       33682         Euglenozoa
  0.00  1       0       O       5653            Kinetoplastida
  0.00  1       0       F       5654              Trypanosomatidae
  0.00  1       0       G       5690                Trypanosoma
  0.00  1       0       -       47570                 Schizotrypanum
  0.00  1       0       S       5693                    Trypanosoma cruzi
  0.00  1       1       -       353153                    Trypanosoma cruzi strain CL Brener
...

Looking from ID 5693 on up, in terms of indentation in the last column: Kraken shows taxa up through "O", then an un-ranked taxon (Euglenozoa), and then nothing for a very long time until Eukaryota many lines above (not shown). The phylum Chlorophyta and class Trebouxiophyceae do not actually contain Trypanosoma cruzi; they're just the closest previous phylum and class shown above that species in the file. But kraken-biom's output gives this Consensus Lineage for ID 5693:

k__Eukaryota; p__Chlorophyta; c__Trebouxiophyceae; o__Kinetoplastida; f__Trypanosomatidae; g__Trypanosoma; s__cruzi

The NCBI Taxonomy Browser seems to match what I saw in Kraken, with Kingdom=Eukaryota; Unranked=Euglenozoa; Order=Kinetoplastida; Family=Trypanosomatidae; etc. (No explicit phylum or class listed.)

I can't say for sure because the Kraken documentation doesn't go into detail, but it looks to me like it's the indentation for the scientific name that corresponds to the hierarchy and to what rank sits above a given entry, and not necessarily the rank of the previous taxa listed. So in my case even though the previous "P" in the report file is Chlorophyta, that group doesn't actually include ID 5693, so we shouldn't have a phylum or class assigned.

Errors while using kraken-biom

Hi, I used pip install git+http://github.com/smdabdoub/kraken-biom.git
Following which I tried kraken-biom on 2 samples for trial:
kraken-biom C1C_S92_krak_out.txt C1T1_S39_krak_out.txt and also
kraken-biom C1C_S92_krakenreport.txt C1T1_S39_krakenreport.txt

In both cases I ended up getting the following error:
TypeError: expected string or bytes-like object

My kraken report file is a mpa style file while the output is standard format.

Any help would be appreciated. Thank you.
DP

New release needed

Hi @smdabdoub,

for a bioconda and Galaxy integration we would need a new release tarball.

Thanks a lot,
Engy and Bjoern

Question Regarding Read Assignments in Biom Files

Dear author,

Firstly, I would like to express my appreciation for your excellent work on the kraken-biom software. It has been incredibly useful for my research.

However, I have come across a small issue that I'm hoping you can clarify. When I open a biom file (created by kraken-biom)in R, I have noticed that reads assigned to a genus are counted under "Number of reads assigned directly to this taxon". Meanwhile, reads assigned to species are counted under "Number of reads covered by the clade rooted at this taxon".

However, in the output file of the kreport2mpa.py script, all the reads are counted under "Number of reads covered by the clade rooted at this taxon". I'm wondering why there's a discrepancy here.

I would be grateful if you could help me understand the reason behind this difference.

Thank you very much in advance for your help.

Best regards,
Tonnyz

it will be great to share the testing examples

Hi, Thanks for your wonderful tool and it will be even better to share the S1.txt and S2.txt file as mentioned in the tutorial. The output format from Kraken may vary so users may encounter confusion the exact input file format. Thank you!

Best,
Cheng

It seems that unclassified reads are ignored

Thanks a lot for this very handy tool! It makes it very convenient to create a biom file and to then use this in R, especially in combination with the phyloseq and ampvis packages.

I observed an unexpected behavior in that the biom file seems to ignore the Unclassified reads information from the Kraken2 report.

Is this intended or am I missing something?

Thank you very much again!

Best wishes and stay safe,

Cedric

kraken2?

Does it work with kraken2 outputs?

Cheers and many thanks
Rick

Setting --max and --min to the same level causes an error.

Users should be able to extract reads assigned to a single taxonomic level, e.g. Species. With Bracken now available this is especially useful (see issue #2).

However, setting both to the same level triggers the sanity checking mechanism (making sure users don't accidentally invert the order of --max and --min) because the check uses a >= instead of just a >.

The offending line: https://github.com/smdabdoub/kraken-biom/blob/master/kraken_biom.py#L366

empty .biom files from normal-looking bracken reports?

Im trying to make .biom files from three sets of reports, 1 from kraken2 and 2 from bracken.
The kraken2 and first bracken work fine (theyre both down to species level, and i use --fmt json and give a -o FILE NAME). the second bracken collection (i set the level at genus when running bracken here) gives an empty .biom file.

to show what is going on, this is a "normal" .biom file:

head brackenTry_S.biom
{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","generated_by": "kraken-biom v1.0.1 (http://github.com/smdabdoub/kraken-biom)","date": "2022-06-20T16:05:23.802520","matrix_element_type": "float","shape": [187, 5],"type": "OTU table","matrix_type": "sparse","data": [[0,0,18139.0],[0,1,152.0],[0,2,14961.0],[0,3,11830.0],[0,4,2683.0],[1,0,9957.0],[1,1,112.0],[1,2,6612.0],[1,3,17086.0],[1,4,12027.0],[2,0,8659.0],[2,1,169.0],[2,2,96.0],[2,3,3599.0],[2,4,119.0],[3,0,891.0],[3,1,177.0],[3,2,2879.0],[3,3,1999.0],[3,4,2521.0],[4,0,746.0],[4,1,858.0],[4,2,477.0],[4,3,228.0],[4,4,219.0],[5,0,386.0],[5,1,96.0],[5,2,257.0],[5,3,1494.0],[5,4,211.0],[6,0,59.0],[6,1,13.0],[6,2,180.0],[6,3,170.0],[6,4,63.0],[7,0,27.0],[7,1,38.0],[7,2,346.0],[7,3,230.0],[8,0,195.0],[8,1,41.0],[8,2,1028.0],[8,3,393.0],[8,4,586.0],[9,0,115.0],[9,2,231.0],[9,3,98.0],[9,4,83.0],[10,0,40.0],[10,2,100.0],

this is the "empty" .biom file:

head brackenTry_G.biom
{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","generated_by": "kraken-biom v1.0.1 (http://github.com/smdabdoub/kraken-biom)","date": "2022-06-21T08:48:03.724794","matrix_element_type": "int","shape": [0, 0],"type": "OTU table","matrix_type": "sparse","data": [],"rows": [],"columns": []}

also, when trying to import the "empty" .biom file in phyloseq (R-studio):

> brackenGenuses <- import_biom("brackenTry_G.biom")
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'otu_table': 'names' attribute [3] must be the same length as the vector [0]

Not sure what happens here. I've inspected all the original kraken2 and bracken report files and there seem to be no issue here, can post examples if needed.
Any ideas?

many thanks
Martin

Bracken input error: "biom.exception.TableException: Cannot delimit self if I don't have data..."

Hello,

I have been trying to run kraken on some bracken files I have but there seems to be an issue of compatibility.

I have been running the command:
kraken-biom /PATH/*.bracken --fmt tsv --gzip -o table.tsv

However, I get the following error:

Traceback (most recent call last):
File "/home/people/tatfeu/Software/kraken-biom/bin/kraken-biom", line 11, in
sys.exit(main())
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/kraken_biom.py", line 379, in main
out_fp = write_biom(biomT, args.output_fp, args.fmt, args.gzip)
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/kraken_biom.py", line 228, in write_biom
biom_f.write(biomT.to_tsv())
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/biom/table.py", line 5096, in to_tsv
direct_io=direct_io)
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/biom/table.py", line 1583, in delimited_self
raise TableException("Cannot delimit self if I don't have data...")
biom.exception.TableException: Cannot delimit self if I don't have data...

I have looked and the bracken files are all accounted for and contain tab delimited data. I have tried running with json output as well and although I don't receive an error the output only contains column names and no data.

Can bracken files be used with kraken? If they can do you know what the root of this error might be?

Thanks

Update documentation to reflect executable name change

The command is now kraken-biom instead of kraken-biom.py. Update the README and program help text to reflect that.

Instructions

I have kraken2 outputs in the form of *.report and *.kraken

I am trying to make a biom file for phyloseq and would like some elaboration on the kraken_reports parameter? what is it for and how do you use it?

I have generated a biom table from just the reports - however when reading the table into R I am missing taxa names..

Any help appreciated!

Thanks

Hi, is there a bracken to biom tool. I have seen this issue requested before but with no conclusion. Thx m

Updating conda and PyPI

Hello! We use your software regularly and it has allowed us to integrate Kraken results seamlessly with our downstream analysis scripts.
It would be a great help if you could publish a new stable release on conda and PyPI though. The published version 1.0.1 still has bugs that have been fixed in the current GitHub version. We have been using the old version without realizing and it's not doing your code justice!
Thanks!

Can you add a new version tag to cite properly?

As mentioned above, and thank you very much for this utility.

Compatibility with krakenhll

Hi,
Thanks a lot for your useful script.
I would like to use kraken-biom in order to process krakenhll output (krakenhll adds some additional functionality to kraken to decrease false positive detection rate).

The kraken-report is a bit different and I guess that is why I got the following error running kraken-biom from report generated using krakenhll:

Traceback (most recent call last):
File "/usr/local/bioinfo/kraken-biom/1.0.1a/venv/bin/kraken-biom", line 11, in
sys.exit(main())
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 377, in main
biomT = create_biom_table(sample_counts, taxa)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 196, in create_biom_table
generated_by=gen_str, input_is_dense=True)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/table.py", line 397, in init
errcheck(self)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/err.py", line 472, in errcheck
raise ret
biom.exception.TableException: Number of sample IDs differs from matrix size!

e.g.:
kraken report:
98.27 98274 98274 U 0 unclassified
1.73 1726 74 - 1 root
1.60 1601 9 - 131567 cellular organisms
1.56 1560 142 D 2 Bacteria
1.06 1056 77 P 1224 Proteobacteria
0.61 615 62 C 28211 Alphaproteobacteria
0.35 351 3 O 204455 Rhodobacterales
0.34 336 66 F 31989 Rhodobacteraceae
0.06 55 0 G 97050 Ruegeria
0.03 34 0 S 89184 Ruegeria pomeroyi
0.03 34 34 - 246200 Ruegeria pomeroyi DSS-3
0.02 21 21 S 292414 Ruegeria sp. TM1040
0.04 41 11 G 302485 Phaeobacter
0.03 27 0 S 60890 Phaeobacter gallaeciensis
0.02 17 17 - 1423144 Phaeobacter gallaeciensis DSM 26640
0.01 10 10 - 383629 Phaeobacter gallaeciensis 2.10
0.00 3 0 S 221822 Phaeobacter inhibens
0.00 3 3 - 391619 Phaeobacter inhibens DSM 17395
0.03 31 0 G 1060 Rhodobacter

krakenhll
% reads taxReads kmers dup cov taxID rank taxName
99.12 991219 991219 349731445 1.17 NA 0 no rank unclassified
0.8781 8781 0 43875 1.98 4.21e-05 1 no rank root
0.8781 8781 0 43875 1.98 4.21e-05 131567 no rank cellular organisms
0.8781 8781 55 43875 1.98 4.21e-05 2157 superkingdom Archaea
0.8388 8388 101 41001 1.71 4.384e-05 28890 phylum Euryarchaeota
0.657 6570 747 30799 1.43 5.622e-05 183963 class Halobacteria
0.2435 2435 85 10206 1.32 4.847e-05 1644055 order Haloferacales
0.149 1490 68 5681 1.39 5.224e-05 1963271 family Halorubraceae
0.0896 896 371 3346 1.39 4.358e-05 56688 genus Halorubrum
0.0037 37 37 99 1.42 4.197e-05 1419722 species Halorubrum sp. SD626R
0.0034 34 34 144 1.16 6.174e-05 1765655 species Halorubrum tropicale

Do you have any idea how to use kraken-biom with this different format?
Many thanks

Include taxonomic levels below species

Dear @smdabdoub ,

Thank you for creating this tool!
I noticed that there is an option to export kraken report to a level below species level. Kraken reports the levels species (S), followed by levels S1, S2 and S3. I however cannot get this option to work below species level and receive an empty biom-file.

kraken-biom file.kreport --fmt json --min SS -o file.biom (version 1.2.0)

Using the option on species level (--min S) does generate a complete biom file.
Ideally I would like to generate a biom-file and include everything up until S3 level. Is this possible with the current version and/or could you send/publish the code that would allow for this?

Cheers,
Paul

compatibility with bracken

Will this tool still be compatible with the kreport output from bracken ?

Add sample_data info to biom file

Hello,
I am using kraken-biom script to convert kraken2 report files to a biom file to run in phyloseq R. I managed to produce a unique biom file from 90 kraken reports, but when after using import_biom form phyloseq package I have a phylose-class object with only otu_table and tax_table, no sample_table.

How can we add the sample_table to the biom files? I tried using also biom add-metadata with a text file with ID and some group info, but it seems to doesn't work.

Thanks in advance for the help

regards

Nicolas

Abundance calculation question

Hello I used kraken-biom to perfrom an abundance calculation from kraken2 taxonomic classification output.

My question is given a classification made from shotgun metagenomic reads. kraken2 can assign more than two reads (for this example lets say 4) that proceed from the same cell as the same species. at this point when calculating the relative abundance of a species it can mean that this species is present 4 times on the sample but in reality those 4 reads came from the same cell so the real relative abundance is 1.

How kraken-biom handles this? or is it handled under kraken2 classification?

biom file not working with Phinch

Hi,

have analysed some metagenomic samples using Kraken2 and I would like to visualise my taxonomic profiles using Phinch (http://phinch.org).
I used kraken-biom to transform the kraken output to biom format but when I upload the files to Phinch the biom files are not eccepted by the software.

Have you ever faced this issue before?
I would appreceate any suggestions

Thank you in advance.

Best
Leonardos