Giter Site home page Giter Site logo

smdabdoub / kraken-biom Goto Github PK

View Code? Open in Web Editor NEW
44.0 3.0 15.0 45 KB

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).

License: MIT License

Python 99.80% Dockerfile 0.20%
metagenomics taxonomy taxonomic-classification bioinformatics kraken biom-format

kraken-biom's People

Contributors

casperp avatar eclarke avatar maxibor avatar ressy avatar shaunchuah avatar smdabdoub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

kraken-biom's Issues

Taxonomic hierarchy incorrect for certain cases

We've found that in some edge cases, the taxonomic hierarchy kraken-biom assigns to a given ID is incorrect. It looks like the assumption that a given rank in the Kraken report always falls under the most recent higher rank (for example, for a given "S" entry for species, the closest "P" for phylum in the previous lines in the file) is not always true.

This is with kraken-biom 1.0.1 under Python 3.5 (in Anaconda) on Linux.

Here's a chunk from an example Kraken report, where I'm specifically looking at 5693 (Trypanosoma cruzi) near the bottom.

...
  0.00  1       0       P       3041            Chlorophyta
  0.00  1       1       C       75966             Trebouxiophyceae
  0.00  2       0       -       556282        Jakobida
  0.00  2       0       G       221723          Seculamonas
  0.00  2       2       S       221724            Seculamonas ecuadoriensis
  0.00  1       0       -       33682         Euglenozoa
  0.00  1       0       O       5653            Kinetoplastida
  0.00  1       0       F       5654              Trypanosomatidae
  0.00  1       0       G       5690                Trypanosoma
  0.00  1       0       -       47570                 Schizotrypanum
  0.00  1       0       S       5693                    Trypanosoma cruzi
  0.00  1       1       -       353153                    Trypanosoma cruzi strain CL Brener
...

Looking from ID 5693 on up, in terms of indentation in the last column: Kraken shows taxa up through "O", then an un-ranked taxon (Euglenozoa), and then nothing for a very long time until Eukaryota many lines above (not shown). The phylum Chlorophyta and class Trebouxiophyceae do not actually contain Trypanosoma cruzi; they're just the closest previous phylum and class shown above that species in the file. But kraken-biom's output gives this Consensus Lineage for ID 5693:

k__Eukaryota; p__Chlorophyta; c__Trebouxiophyceae; o__Kinetoplastida; f__Trypanosomatidae; g__Trypanosoma; s__cruzi

The NCBI Taxonomy Browser seems to match what I saw in Kraken, with Kingdom=Eukaryota; Unranked=Euglenozoa; Order=Kinetoplastida; Family=Trypanosomatidae; etc. (No explicit phylum or class listed.)

I can't say for sure because the Kraken documentation doesn't go into detail, but it looks to me like it's the indentation for the scientific name that corresponds to the hierarchy and to what rank sits above a given entry, and not necessarily the rank of the previous taxa listed. So in my case even though the previous "P" in the report file is Chlorophyta, that group doesn't actually include ID 5693, so we shouldn't have a phylum or class assigned.

empty .biom files from normal-looking bracken reports?

hi

Im trying to make .biom files from three sets of reports, 1 from kraken2 and 2 from bracken.
The kraken2 and first bracken work fine (theyre both down to species level, and i use --fmt json and give a -o FILE NAME). the second bracken collection (i set the level at genus when running bracken here) gives an empty .biom file.

to show what is going on, this is a "normal" .biom file:

head brackenTry_S.biom
{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","generated_by": "kraken-biom v1.0.1 (http://github.com/smdabdoub/kraken-biom)","date": "2022-06-20T16:05:23.802520","matrix_element_type": "float","shape": [187, 5],"type": "OTU table","matrix_type": "sparse","data": [[0,0,18139.0],[0,1,152.0],[0,2,14961.0],[0,3,11830.0],[0,4,2683.0],[1,0,9957.0],[1,1,112.0],[1,2,6612.0],[1,3,17086.0],[1,4,12027.0],[2,0,8659.0],[2,1,169.0],[2,2,96.0],[2,3,3599.0],[2,4,119.0],[3,0,891.0],[3,1,177.0],[3,2,2879.0],[3,3,1999.0],[3,4,2521.0],[4,0,746.0],[4,1,858.0],[4,2,477.0],[4,3,228.0],[4,4,219.0],[5,0,386.0],[5,1,96.0],[5,2,257.0],[5,3,1494.0],[5,4,211.0],[6,0,59.0],[6,1,13.0],[6,2,180.0],[6,3,170.0],[6,4,63.0],[7,0,27.0],[7,1,38.0],[7,2,346.0],[7,3,230.0],[8,0,195.0],[8,1,41.0],[8,2,1028.0],[8,3,393.0],[8,4,586.0],[9,0,115.0],[9,2,231.0],[9,3,98.0],[9,4,83.0],[10,0,40.0],[10,2,100.0],

this is the "empty" .biom file:

head brackenTry_G.biom
{"id": "None","format": "Biological Observation Matrix 1.0.0","format_url": "http://biom-format.org","generated_by": "kraken-biom v1.0.1 (http://github.com/smdabdoub/kraken-biom)","date": "2022-06-21T08:48:03.724794","matrix_element_type": "int","shape": [0, 0],"type": "OTU table","matrix_type": "sparse","data": [],"rows": [],"columns": []}

also, when trying to import the "empty" .biom file in phyloseq (R-studio):

> brackenGenuses <- import_biom("brackenTry_G.biom")
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'otu_table': 'names' attribute [3] must be the same length as the vector [0]

Not sure what happens here. I've inspected all the original kraken2 and bracken report files and there seem to be no issue here, can post examples if needed.
Any ideas?

many thanks
Martin

it will be great to share the testing examples

Hi, Thanks for your wonderful tool and it will be even better to share the S1.txt and S2.txt file as mentioned in the tutorial. The output format from Kraken may vary so users may encounter confusion the exact input file format. Thank you!

Best,
Cheng

Compatibility with krakenhll

Hi,
Thanks a lot for your useful script.
I would like to use kraken-biom in order to process krakenhll output (krakenhll adds some additional functionality to kraken to decrease false positive detection rate).

The kraken-report is a bit different and I guess that is why I got the following error running kraken-biom from report generated using krakenhll:

Traceback (most recent call last):
File "/usr/local/bioinfo/kraken-biom/1.0.1a/venv/bin/kraken-biom", line 11, in
sys.exit(main())
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 377, in main
biomT = create_biom_table(sample_counts, taxa)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 196, in create_biom_table
generated_by=gen_str, input_is_dense=True)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/table.py", line 397, in init
errcheck(self)
File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/err.py", line 472, in errcheck
raise ret
biom.exception.TableException: Number of sample IDs differs from matrix size!

e.g.:
kraken report:
98.27 98274 98274 U 0 unclassified
1.73 1726 74 - 1 root
1.60 1601 9 - 131567 cellular organisms
1.56 1560 142 D 2 Bacteria
1.06 1056 77 P 1224 Proteobacteria
0.61 615 62 C 28211 Alphaproteobacteria
0.35 351 3 O 204455 Rhodobacterales
0.34 336 66 F 31989 Rhodobacteraceae
0.06 55 0 G 97050 Ruegeria
0.03 34 0 S 89184 Ruegeria pomeroyi
0.03 34 34 - 246200 Ruegeria pomeroyi DSS-3
0.02 21 21 S 292414 Ruegeria sp. TM1040
0.04 41 11 G 302485 Phaeobacter
0.03 27 0 S 60890 Phaeobacter gallaeciensis
0.02 17 17 - 1423144 Phaeobacter gallaeciensis DSM 26640
0.01 10 10 - 383629 Phaeobacter gallaeciensis 2.10
0.00 3 0 S 221822 Phaeobacter inhibens
0.00 3 3 - 391619 Phaeobacter inhibens DSM 17395
0.03 31 0 G 1060 Rhodobacter

krakenhll
% reads taxReads kmers dup cov taxID rank taxName
99.12 991219 991219 349731445 1.17 NA 0 no rank unclassified
0.8781 8781 0 43875 1.98 4.21e-05 1 no rank root
0.8781 8781 0 43875 1.98 4.21e-05 131567 no rank cellular organisms
0.8781 8781 55 43875 1.98 4.21e-05 2157 superkingdom Archaea
0.8388 8388 101 41001 1.71 4.384e-05 28890 phylum Euryarchaeota
0.657 6570 747 30799 1.43 5.622e-05 183963 class Halobacteria
0.2435 2435 85 10206 1.32 4.847e-05 1644055 order Haloferacales
0.149 1490 68 5681 1.39 5.224e-05 1963271 family Halorubraceae
0.0896 896 371 3346 1.39 4.358e-05 56688 genus Halorubrum
0.0037 37 37 99 1.42 4.197e-05 1419722 species Halorubrum sp. SD626R
0.0034 34 34 144 1.16 6.174e-05 1765655 species Halorubrum tropicale

Do you have any idea how to use kraken-biom with this different format?
Many thanks

Question Regarding Read Assignments in Biom Files

Dear author,

Firstly, I would like to express my appreciation for your excellent work on the kraken-biom software. It has been incredibly useful for my research.

However, I have come across a small issue that I'm hoping you can clarify. When I open a biom file (created by kraken-biom)in R, I have noticed that reads assigned to a genus are counted under "Number of reads assigned directly to this taxon". Meanwhile, reads assigned to species are counted under "Number of reads covered by the clade rooted at this taxon".

However, in the output file of the kreport2mpa.py script, all the reads are counted under "Number of reads covered by the clade rooted at this taxon". I'm wondering why there's a discrepancy here.

I would be grateful if you could help me understand the reason behind this difference.

Thank you very much in advance for your help.

Best regards,
Tonnyz

Updating conda and PyPI

Hello! We use your software regularly and it has allowed us to integrate Kraken results seamlessly with our downstream analysis scripts.
It would be a great help if you could publish a new stable release on conda and PyPI though. The published version 1.0.1 still has bugs that have been fixed in the current GitHub version. We have been using the old version without realizing and it's not doing your code justice!
Thanks!

Include taxonomic levels below species

Dear @smdabdoub ,

Thank you for creating this tool!
I noticed that there is an option to export kraken report to a level below species level. Kraken reports the levels species (S), followed by levels S1, S2 and S3. I however cannot get this option to work below species level and receive an empty biom-file.

kraken-biom file.kreport --fmt json --min SS -o file.biom (version 1.2.0)

Using the option on species level (--min S) does generate a complete biom file.
Ideally I would like to generate a biom-file and include everything up until S3 level. Is this possible with the current version and/or could you send/publish the code that would allow for this?

Cheers,
Paul

biom file not working with Phinch

Hi,

have analysed some metagenomic samples using Kraken2 and I would like to visualise my taxonomic profiles using Phinch (http://phinch.org).
I used kraken-biom to transform the kraken output to biom format but when I upload the files to Phinch the biom files are not eccepted by the software.

Have you ever faced this issue before?
I would appreceate any suggestions

Thank you in advance.

Best
Leonardos

Instructions

I have kraken2 outputs in the form of *.report and *.kraken

I am trying to make a biom file for phyloseq and would like some elaboration on the kraken_reports parameter? what is it for and how do you use it?

I have generated a biom table from just the reports - however when reading the table into R I am missing taxa names..

Any help appreciated!

Thanks

It seems that unclassified reads are ignored

Thanks a lot for this very handy tool! It makes it very convenient to create a biom file and to then use this in R, especially in combination with the phyloseq and ampvis packages.

I observed an unexpected behavior in that the biom file seems to ignore the Unclassified reads information from the Kraken2 report.

Is this intended or am I missing something?

Thank you very much again!

Best wishes and stay safe,

Cedric

Abundance calculation question

Hello I used kraken-biom to perfrom an abundance calculation from kraken2 taxonomic classification output.

My question is given a classification made from shotgun metagenomic reads. kraken2 can assign more than two reads (for this example lets say 4) that proceed from the same cell as the same species. at this point when calculating the relative abundance of a species it can mean that this species is present 4 times on the sample but in reality those 4 reads came from the same cell so the real relative abundance is 1.

How kraken-biom handles this? or is it handled under kraken2 classification?

Add sample_data info to biom file

Hello,
I am using kraken-biom script to convert kraken2 report files to a biom file to run in phyloseq R. I managed to produce a unique biom file from 90 kraken reports, but when after using import_biom form phyloseq package I have a phylose-class object with only otu_table and tax_table, no sample_table.

How can we add the sample_table to the biom files? I tried using also biom add-metadata with a text file with ID and some group info, but it seems to doesn't work.

Thanks in advance for the help

regards

Nicolas

Errors while using kraken-biom

Hi, I used pip install git+http://github.com/smdabdoub/kraken-biom.git
Following which I tried kraken-biom on 2 samples for trial:
kraken-biom C1C_S92_krak_out.txt C1T1_S39_krak_out.txt and also
kraken-biom C1C_S92_krakenreport.txt C1T1_S39_krakenreport.txt

In both cases I ended up getting the following error:
TypeError: expected string or bytes-like object

My kraken report file is a mpa style file while the output is standard format.

Any help would be appreciated. Thank you.
DP

kraken2?

Does it work with kraken2 outputs?

Cheers and many thanks
Rick

New release needed

Hi @smdabdoub,

for a bioconda and Galaxy integration we would need a new release tarball.

Thanks a lot,
Engy and Bjoern

Bracken input error: "biom.exception.TableException: Cannot delimit self if I don't have data..."

Hello,

I have been trying to run kraken on some bracken files I have but there seems to be an issue of compatibility.

I have been running the command:
kraken-biom /PATH/*.bracken --fmt tsv --gzip -o table.tsv

However, I get the following error:

Traceback (most recent call last):
File "/home/people/tatfeu/Software/kraken-biom/bin/kraken-biom", line 11, in
sys.exit(main())
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/kraken_biom.py", line 379, in main
out_fp = write_biom(biomT, args.output_fp, args.fmt, args.gzip)
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/kraken_biom.py", line 228, in write_biom
biom_f.write(biomT.to_tsv())
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/biom/table.py", line 5096, in to_tsv
direct_io=direct_io)
File "/home/people/tatfeu/Software/kraken-biom/lib64/python3.6/site-packages/biom/table.py", line 1583, in delimited_self
raise TableException("Cannot delimit self if I don't have data...")
biom.exception.TableException: Cannot delimit self if I don't have data...

I have looked and the bracken files are all accounted for and contain tab delimited data. I have tried running with json output as well and although I don't receive an error the output only contains column names and no data.

Can bracken files be used with kraken? If they can do you know what the root of this error might be?

Thanks

Setting --max and --min to the same level causes an error.

Users should be able to extract reads assigned to a single taxonomic level, e.g. Species. With Bracken now available this is especially useful (see issue #2).

However, setting both to the same level triggers the sanity checking mechanism (making sure users don't accidentally invert the order of --max and --min) because the check uses a >= instead of just a >.

The offending line: https://github.com/smdabdoub/kraken-biom/blob/master/kraken_biom.py#L366

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.