ivarz / conifer Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 7.0 406 KB

Calculate confidence scores from Kraken2 output

License: BSD 2-Clause "Simplified" License

C 97.94% Makefile 2.06%

kraken2 confidence-scores metagenomic-analysis

conifer's Introduction

Dependencies

gcc

zlib

cmake if building tests

Building

git clone https://github.com/Ivarz/Conifer && cd Conifer
git submodule update --init --recursive
make

To build tests use

make tests

Building docker image

To build docker image follow instructions at conifer-docker (thanks to @Midnighter).

Basic usage

To use this tool you need standard output file from kraken2 and taxonomy database file (taxo.k2d). The following command will calculate confidence score for each classified read. Note that this kind of output does not include header. For paired end reads confidence score for both reads and the average of the two reads is reported. Only classified reads are reported by default.

./conifer -i test_files/example.out.txt -d test_files/taxo.k2d

Kraken standard output	read1 confidence score	read2 confidence score	average confidence score
C V100006960L1C001R001000420 853 100\|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 \|:\| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12	0.1515	0.3939	0.2727

Use --rtl option to obtain RTL scores

./conifer --rtl -i test_files/example.out.txt -d test_files/taxo.k2d

Kraken standard output	read1 RTL score	read2 RTL score	average RTL score
C V100006960L1C001R001000420 853 100\|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 \|:\| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12	0.3636	0.6364	0.5000

Use --both_scores option to obtain confidence and RTL scores simultaneously.

./conifer --both_scores -i test_files/example.out.txt -d test_files/taxo.k2d

Kraken standard output	read1 confidence score	read2 confidence score	average confidence score	read1 RTL score	read2 RTL score	average RTL score
C V100006960L1C001R001000420 853 100\|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 \|:\| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12	0.1515	0.3939	0.2727	0.3636	0.6364	0.5000

./conifer -i test_files/example.out.txt -d test_files/taxo.k2d

To calculate 25th, 50th and 75th percentiles of the confidence score for each assigned taxonomy use -s option. For paired end reads, average score of each pair is summarized. For the sake of brevity, only first 5 lines of the summary are shown.

./conifer -s -i test_files/example.out.txt -d test_files/taxo.k2d

taxon_name	taxid	reads	P25	P50	P75
Faecalibacterium prausnitzii	853	3	0.2200	0.2730	0.4320
Anaerobutyricum hallii	39488	1	0.5000	0.5000	0.5000
Lachnospiraceae	186803	1	0.5000	0.5000	0.5000
Clostridiales	186802	3	0.4920	0.7200	1.0000

Similar report can be generated for RTL scores:

./conifer --rtl -s -i test_files/example.out.txt -d test_files/taxo.k2d

taxon_name	taxid	reads	P25	P50	P75
Faecalibacterium prausnitzii	853	3	0.3480	0.3480	0.4320
Anaerobutyricum hallii	39488	1	0.5000	0.5000	0.5000
Lachnospiraceae	186803	1	0.5000	0.5000	0.5000
Clostridiales	186802	3	0.7200	1.0000	1.0000

and simultaneous reporting of both scores:

./conifer --both_scores -s -i test_files/example.out.txt -d test_files/taxo.k2d

taxon_name	taxid	reads	P25_conf	P50_conf	P75_conf	P25_rtl	P50_rtl	P75_rtl
Faecalibacterium prausnitzii	853	3	0.2200	0.2730	0.4320	0.3480	0.3480	0.4320
Anaerobutyricum hallii	39488	1	0.5000	0.5000	0.5000	0.5000	0.5000	0.5000
Lachnospiraceae	186803	1	0.5000	0.5000	0.5000	0.5000	0.5000	0.5000
Clostridiales	186802	3	0.4920	0.7200	1.0000	0.7200	1.0000	1.0000

Note on score calculation

Schematic representation of confidence and RTL score calculation from classification tree. White nodes represent the final assigned taxonomy. Numbers indicate read k-mer count assigned to a particular taxonomy. Confidence score is calculated as the fraction of k-mers assigned to the final taxonomy and its descendants, as denoted by the blue rectangle (left); RTL score is calculated from descendants and ascendants of the final taxonomy (right).

conifer's People

Contributors

Stargazers

Watchers

Forkers

ditag slw287r twelvesummer duttaanik mbhall88

conifer's Issues

length and confidence length differ

Hi, Ivarz:
Thanks for your convenient tool.
I am trying to calculate confidence score using result from kraken2. I am wondering why len not equal to 100?

C V100006960L1C001R001000420 853 100|100 0：16 853：8 1783272：2 748224：2 1783272：2 168384：5 186801：6 0：2 168384：5 0：18 |：|748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12
read1 : 16+8+2+2+2+5+6+2+5+18=66,
read2: 7+2+5+21+4+7+5+3+12=66.
Thanks!

Bioconda package

Hi @Ivarz,

I recently created a bioconda package for conifer. Hope it's useful for you or someone else.

https://anaconda.org/bioconda/conifer

Report the taxid and or name

Hello again,

I've been using Conifer for a bit now and I find it very useful. Thank you for that. At the moment, Conifer in its simplest form reports

kraken output	read1 confidence	read2 confidence	average

Since Conifer can obviously do this, as seen for the summary report, I would love to get the output as

taxid	name (optional)	read1 confidence	read2 confidence	average

and simply have additional rows for the same taxid. Does this make sense? Would you consider adding this output option? Or maybe there is a different simple way to map the kraken output to the taxid that I am missing right now.

New release

Would it be possible to make a new release after adding the --help message/option? That way the bioconda recipe will trigger a new release too and the bioconda install of conifer will then have access to that option.

conifer output not to specific readIDs

Hello,

Thanks for developing this tool! I have recently come across it and thought it would help me to fine tune the accuracy of my Kraken2 results. I read in the readme file that it generates the confidence scores for each readID. However, in my conifer output file, I see the confidence score for each taxid/taxname, as opposed to readID. Is there anything I did wrong?

Thanks,
Elly

Docker image

Hi @Ivarz,

I recently created a Docker image for conifer. I thought, I'd leave it here in case it's useful for you or someone else.

# Copyright (c) 2020, Moritz E. Beber.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM bitnami/minideb:buster AS builder

RUN set -eux \
    && install_packages \
        build-essential \
        ca-certificates \
        git \
        libz-dev

WORKDIR /opt

RUN set -eux \
    && git clone https://github.com/Ivarz/Conifer.git \
    && cd Conifer \
    && git submodule update --init --recursive \
    && gcc -static -std=c99 -Wall -Wextra -O3 -D_POSIX_C_SOURCE=200809L -I third_party/uthash/src -I . src/utils.c src/kraken_stats.c src/kraken_taxo.c src/main.c -o conifer -l:libm.a -l:libz.a

FROM busybox:glibc

COPY --from=builder /opt/Conifer/conifer /

ENTRYPOINT ["/conifer"]

I'll track progress of this file over here.

Downstream analysis with dask

I wanted to let you know that I've created a Python package that uses dask for distributed analysis of conifer output files. You can find the repo on GH and install the Python package from PyPI. It's quite minimal so far but feel free to create issues or contribute other kinds of analyses.

read length percentile

Hi Ivar,

Thank you for making this useful tool! I was wondering if it is possible to know the read length percentiles also for each taxa assignment.

Thanks!
Hena

Missing output

I wanted to try out your tool as you recommended in my issue on kraken. I started it with:

./conifer --both_scores -s -i kraken.out.txt -d /scratch/databases/Standard_v2/taxo.k2d

then saw output

1000000 lines processed...                                                                                                                                                                    
2000000 lines processed...        
3000000 lines processed...
4000000 lines processed...
5000000 lines processed...
6000000 lines processed...
7000000 lines processed...
8000000 lines processed...
9000000 lines processed...
10000000 lines processed...
11000000 lines processed...
12000000 lines processed...
13000000 lines processed...
14000000 lines processed...
15000000 lines processed...
16000000 lines processed...
17000000 lines processed...
18000000 lines processed...
19000000 lines processed...
20000000 lines processed...
21000000 lines processed...
22000000 lines processed...
23000000 lines processed...
24000000 lines processed...
25000000 lines processed...
26000000 lines processed...
27000000 lines processed...
28000000 lines processed...
29000000 lines processed...
30000000 lines processed...
31000000 lines processed...
32000000 lines processed...
33000000 lines processed...
34000000 lines processed...
35000000 lines processed...
36000000 lines processed...
37000000 lines processed...
38000000 lines processed...
39000000 lines processed...
40000000 lines processed...
41000000 lines processed...
42000000 lines processed...
taxon_name      taxid   reads   P25_conf        P50_conf        P75_conf        P25_rtl P50_rtl P75_rtl

I expected to see more in the table. Any ideas what could cause this?