Giter Site home page Giter Site logo

conifer's Introduction

Dependencies

gcc

zlib

cmake if building tests

Building

git clone https://github.com/Ivarz/Conifer && cd Conifer
git submodule update --init --recursive
make

To build tests use

make tests

Building docker image

To build docker image follow instructions at conifer-docker (thanks to @Midnighter).

Basic usage

To use this tool you need standard output file from kraken2 and taxonomy database file (taxo.k2d). The following command will calculate confidence score for each classified read. Note that this kind of output does not include header. For paired end reads confidence score for both reads and the average of the two reads is reported. Only classified reads are reported by default.

./conifer -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 confidence score read2 confidence score average confidence score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.1515 0.3939 0.2727

Use --rtl option to obtain RTL scores

./conifer --rtl -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 RTL score read2 RTL score average RTL score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.3636 0.6364 0.5000

Use --both_scores option to obtain confidence and RTL scores simultaneously.

./conifer --both_scores -i test_files/example.out.txt -d test_files/taxo.k2d
Kraken standard output read1 confidence score read2 confidence score average confidence score read1 RTL score read2 RTL score average RTL score
C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:| 748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12 0.1515 0.3939 0.2727 0.3636 0.6364 0.5000
./conifer -i test_files/example.out.txt -d test_files/taxo.k2d

To calculate 25th, 50th and 75th percentiles of the confidence score for each assigned taxonomy use -s option. For paired end reads, average score of each pair is summarized. For the sake of brevity, only first 5 lines of the summary are shown.

./conifer -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25 P50 P75
Faecalibacterium prausnitzii 853 3 0.2200 0.2730 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.4920 0.7200 1.0000

Similar report can be generated for RTL scores:

./conifer --rtl -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25 P50 P75
Faecalibacterium prausnitzii 853 3 0.3480 0.3480 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.7200 1.0000 1.0000

and simultaneous reporting of both scores:

./conifer --both_scores -s -i test_files/example.out.txt -d test_files/taxo.k2d
taxon_name taxid reads P25_conf P50_conf P75_conf P25_rtl P50_rtl P75_rtl
Faecalibacterium prausnitzii 853 3 0.2200 0.2730 0.4320 0.3480 0.3480 0.4320
Anaerobutyricum hallii 39488 1 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
Lachnospiraceae 186803 1 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
Clostridiales 186802 3 0.4920 0.7200 1.0000 0.7200 1.0000 1.0000

Note on score calculation

Schematic representation of confidence and RTL score calculation from classification tree. White nodes represent the final assigned taxonomy. Numbers indicate read k-mer count assigned to a particular taxonomy. Confidence score is calculated as the fraction of k-mers assigned to the final taxonomy and its descendants, as denoted by the blue rectangle (left); RTL score is calculated from descendants and ascendants of the final taxonomy (right).

conifer's People

Contributors

ivarz avatar mbhall88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

conifer's Issues

length and confidence length differ

Hi, Ivarz:
Thanks for your convenient tool.
I am trying to calculate confidence score using result from kraken2. I am wondering why len not equal to 100?

C V100006960L1C001R001000420 853 100|100 0:16 853:8 1783272:2 748224:2 1783272:2 168384:5 186801:6 0:2 168384:5 0:18 |:|748224:7 0:2 748224:5 0:21 853:4 748224:7 0:5 748224:3 0:12
read1 : 16+8+2+2+2+5+6+2+5+18=66,
read2: 7+2+5+21+4+7+5+3+12=66.
Thanks!

Report the taxid and or name

Hello again,

I've been using Conifer for a bit now and I find it very useful. Thank you for that. At the moment, Conifer in its simplest form reports

kraken output read1 confidence read2 confidence average

Since Conifer can obviously do this, as seen for the summary report, I would love to get the output as

taxid name (optional) read1 confidence read2 confidence average

and simply have additional rows for the same taxid. Does this make sense? Would you consider adding this output option? Or maybe there is a different simple way to map the kraken output to the taxid that I am missing right now.

New release

Would it be possible to make a new release after adding the --help message/option? That way the bioconda recipe will trigger a new release too and the bioconda install of conifer will then have access to that option.

conifer output not to specific readIDs

Hello,

Thanks for developing this tool! I have recently come across it and thought it would help me to fine tune the accuracy of my Kraken2 results. I read in the readme file that it generates the confidence scores for each readID. However, in my conifer output file, I see the confidence score for each taxid/taxname, as opposed to readID. Is there anything I did wrong?

Thanks,
Elly

Docker image

Hi @Ivarz,

I recently created a Docker image for conifer. I thought, I'd leave it here in case it's useful for you or someone else.

# Copyright (c) 2020, Moritz E. Beber.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM bitnami/minideb:buster AS builder

RUN set -eux \
    && install_packages \
        build-essential \
        ca-certificates \
        git \
        libz-dev

WORKDIR /opt

RUN set -eux \
    && git clone https://github.com/Ivarz/Conifer.git \
    && cd Conifer \
    && git submodule update --init --recursive \
    && gcc -static -std=c99 -Wall -Wextra -O3 -D_POSIX_C_SOURCE=200809L -I third_party/uthash/src -I . src/utils.c src/kraken_stats.c src/kraken_taxo.c src/main.c -o conifer -l:libm.a -l:libz.a

FROM busybox:glibc

COPY --from=builder /opt/Conifer/conifer /

ENTRYPOINT ["/conifer"]

I'll track progress of this file over here.

read length percentile

Hi Ivar,

Thank you for making this useful tool! I was wondering if it is possible to know the read length percentiles also for each taxa assignment.

Thanks!
Hena

Missing output

I wanted to try out your tool as you recommended in my issue on kraken. I started it with:

./conifer --both_scores -s -i kraken.out.txt -d /scratch/databases/Standard_v2/taxo.k2d

then saw output

1000000 lines processed...                                                                                                                                                                    
2000000 lines processed...        
3000000 lines processed...
4000000 lines processed...
5000000 lines processed...
6000000 lines processed...
7000000 lines processed...
8000000 lines processed...
9000000 lines processed...
10000000 lines processed...
11000000 lines processed...
12000000 lines processed...
13000000 lines processed...
14000000 lines processed...
15000000 lines processed...
16000000 lines processed...
17000000 lines processed...
18000000 lines processed...
19000000 lines processed...
20000000 lines processed...
21000000 lines processed...
22000000 lines processed...
23000000 lines processed...
24000000 lines processed...
25000000 lines processed...
26000000 lines processed...
27000000 lines processed...
28000000 lines processed...
29000000 lines processed...
30000000 lines processed...
31000000 lines processed...
32000000 lines processed...
33000000 lines processed...
34000000 lines processed...
35000000 lines processed...
36000000 lines processed...
37000000 lines processed...
38000000 lines processed...
39000000 lines processed...
40000000 lines processed...
41000000 lines processed...
42000000 lines processed...
taxon_name      taxid   reads   P25_conf        P50_conf        P75_conf        P25_rtl P50_rtl P75_rtl

I expected to see more in the table. Any ideas what could cause this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.