Giter Site home page Giter Site logo

cramino's Introduction

CRAMINO

A tool for quick quality assessment of cram and bam files, intended for long read sequencing.

Installation

Preferably, for most users, download a ready-to-use binary for your system to add directory on your $PATH from the releases.
You may have to change the file permissions to execute it with chmod +x cramino

Alternatively, use conda to install
conda install -c bioconda cramino

Or for Rust developers, build cramino with cargo:
cargo install cramino

Usage

cramino [OPTIONS] <INPUT>

Arguments:
  [INPUT]  cram or bam file to check [default: -]

Options:
  -t, --threads <THREADS>            Number of parallel decompression threads to use [default: 4]
      --reference <REFERENCE>        reference for decompressing cram
  -m, --min-read-len <MIN_READ_LEN>  Minimal length of read to be considered [default: 0]
      --hist                         If histograms have to be generated
      --checksum                     If a checksum has to be calculated
      --arrow <ARROW>                Write data to an arrow format file
      --karyotype                    Provide normalized number of reads per chromosome
      --phased                       Calculate metrics for phased reads
      --spliced                      Provide metrics for spliced data
      --ubam                         Provide metrics for unaligned reads
  -h, --help                         Print help
  -V, --version                      Print version

Example output

File name       example.cram
Number of reads 14108020
% from total reads  83.45
Yield [Gb]      139.91
N50     17447
Median length   6743.00
Mean length     9917
Median identity 94.27
Mean identity   92.53
Path    alignment/example.cram
Creation time   09/09/2022 10:53:36

A 140Gbase bam file is processed in 12 minutes, using <1Gbyte of memory. Note that the identity score above is defined as the gap-compressed identity. The --ubam flag will provide metrics for all reads in the file, regardless of whether they are aligned or not. The % from total reads output field contains the percentage of reads used for this report, depending on the --min-read-len and --ubam settings. Without both of those, this indicates the % of reads that are mapped, primary or supplementary.

Optional output

  • a checksum to check if files were updated/changed or corrupted. (--checksum)
  • an arrow file for use within NanoPlot and NanoComp (--arrow <filename>)
  • calculating a normalised number of reads per chromosome, e.g. to determine the sex or aneuploidies (--karyotype)
  • information about the phase blocks. (--phased)
  • information about number of splice sites. (--spliced)
  • histograms of read lengths and read identities, as below. (--hist). With --phased, also a histogram of phase block lengths. Please let me know if the histograms look inappropriately scaled for your data.
# Histogram for read lengths:
     0-2000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  2000-4000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  4000-6000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  6000-8000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 8000-10000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
10000-12000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
12000-14000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
14000-16000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
16000-18000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
18000-20000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
20000-22000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
22000-24000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
24000-26000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
26000-28000 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
28000-30000 ∎∎∎∎∎∎∎∎∎∎∎∎
30000-32000 ∎∎∎∎∎∎∎∎∎
32000-34000 ∎∎∎∎∎∎
34000-36000 ∎∎∎∎
36000-38000 ∎∎
38000-40000 ∎
40000-42000 ∎
42000-44000 ∎
44000-46000 
46000-48000 
48000-50000 
50000-52000 
52000-54000 
54000-56000 
56000-58000 
58000-60000 
     60000+ 


# Histogram for Phred-scaled accuracies:
  Q0-1 
  Q1-2 
  Q2-3 
  Q3-4 
  Q4-5 
  Q5-6 ∎∎∎
  Q6-7 ∎∎∎∎∎∎∎∎∎∎∎∎
  Q7-8 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  Q8-9 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 Q9-10 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q10-11 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q11-12 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q12-13 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q13-14 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q14-15 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q15-16 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q16-17 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q17-18 ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Q18-19 ∎∎∎∎
Q19-20 ∎
Q20-21 
Q21-22 
Q22-23 
Q23-24 
Q24-25 
Q25-26 
Q26-27 
Q27-28 
Q28-29 
Q29-30 
Q30-31 
Q31-32 
Q32-33 
Q33-34 
Q34-35 
Q35-36 
Q36-37 
Q37-38 
Q38-39 
Q39-40 
  Q40+ 

CITATION

If you use this tool, please consider citing our publication.

cramino's People

Contributors

adoni5 avatar asleonard avatar fellen31 avatar mbhall88 avatar wdecoster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cramino's Issues

Reads length within specific regions

Hi, could you consider adding an option to calculate statistics from one or more specific regions of a long-read sequencing dataset? Thank you in advance.

Option to exclude contigs in the reference

Hi! Thanks for a great tool. I was wondering if you could implement an option to remove a contig or a list of contigs when calculating the stats. Sometimes I have bacterial contaminations, or the mitochondrial contig has crazy coverage compared to the chromosomes, so I fear those values might inflate upwards the average depth of coverage in the report. Rather than re-map the reads to a reduced reference genome, I think it would be great if Cramino could consider it if asked.

Thanks!

Should --ubam really include secondary alignments?

The --ubam flag will provide metrics for all reads in the file, regardless of whether they are aligned or not

When running with --ubam on a file with aligned reads, secondary alignments will get a length of 0 and be included in the median read length calculations, arrow output etc.

test-data/small-test-phased.bam

3fda06e9-62ef-4448-9993-b90124a793d5    0       chr7    152743763       60      52S53M1I40M2I9M3I654M1D304M2I53>
19d9337f-4fb6-46e5-b484-14d05f562506    16      chr7    152743766       60      99M1D599M6D86M1D93M2D4M1D5M1I79>
35febf09-dcbc-424c-987e-9f3f80fe73a5    16      chr7    152748627       60      7S304M2I78M1D306M1D440M3I350M1I>
34884519-a6b4-4cf9-9c07-0822ab6d199d    16      chr7    152755375       60      393M1I105M1I7M2I766M2D145M1I860>
13dbd4b1-d897-42ac-9a9c-a575ddbb70d5    272     chr7    152762381       0       1503S363M1I3M1D566M13D97M1I509M>
3597be79-ad8a-462b-aea9-234c35cad0c8    16      chr7    152763666       60      14190S651M2D325M2D359M2D410M1I2>
d49d98b0-50ce-46a6-a316-e122906e4582    16      chr7    152766418       60      301M1I3M1D371M1D319M1D2M1D1205M>
896f0cb5-0863-4180-818c-c1d2ef49beec    256     chr7    152768160       0       37S175M1I69M1I2M2I28M1I9M1I6M2I

out.arrow

 lengths
     <int>
 1   46025
 2   46029
 3   33090
 4   39496
 5       0
 6   45333
 7   37036
 8       0

If not also running with --min-read-len 1.

 lengths
     <int>
 1   46025
 2   46029
 3   33090
 4   39496
 5   45333
 6   37036
 7   18865
 8   28644

However, running with --ubam will no longer removed soft-clipped bases from the primary alignments.
Compare the output to running without --ubam.

 identities lengths
        <dbl>   <int>
 1       99.2   45973
 2       99.1   45992
 3       99.3   33043
 4       99.1   39464
 5       99.0   16780
 6       99.2   36999
 7       99.3   18802
 8       99.1   15550

If secondary alignments are filtered out by default, should they also not be filtered out by default when running with --ubam?
And should soft-clipped bases then not be removed from primary alignments when running with --ubam?

thread 'main' panicked at 'index out of bounds: when running with `--karyotype`

Command: cramino -t 8 --karyotype --checksum /bams/NA24143.sorted.bam - using latest v0.13.0 release

Output:

File name       NA24143.sorted.bam
Number of alignments    4363
% from total reads      99.73
Yield [Gb]      0.01
Mean coverage   0.00
Yield [Gb] (>25kb)      0.00
N50     7449
N75     3574
Median length   1570.00
Mean length     3267
Median identity 98.05
Mean identity   97.07
[test_bam.zip](https://github.com/wdecoster/cramino/files/13039744/test_bam.zip)

Path    /bams/NA24143.sorted.bam
Creation time   NA
Checksum        6F2F7CAE821EC794587D8E0AAD1AC578
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 18446744073709551615', src/calculations.rs:38:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Test file for reproduction:
test_bam.zip

I've attached the BAM file and index. It's from mapping two FASTQ files from a human reference cell line, to HG38, with NO alt chromosomes, i.e chr1-22, X,Y Mt only.

The number 18446744073709551615 suggests some kind of unsigned overflow type issue, as it's the max u64 value, but is also equal to i64 -1.

https://github.com/wdecoster/cramino/blob/b054f4dec3ff2d732d100a449b3a6310bf5fc85a/src/calculations.rs#L38C29-L38C29

Looking at the line, I suspect when you calculate ind_left and ind_right, you are getting a usize, which when divided by 2 is 0, and minus 1 is overflowing back to the max value.

I will have a quick bash now and doing a checked divide, and if it fixes it will open a PR!
Cheers,
Rory

Feature: Compute & export data for bivariate plots

Thanks for developing cramino. It's very fast! The bivariate plots generated by Nanoplot are very useful but the arrow/feather file computed by cramino doesn't have that info. Could it possible to compute, or add, if already available in cramino, the bivariate data to the arrow/feather file? Nanoplot could then be used for plotting only.

Help understanding Normalized read count per chromosome across techinical replicates

Dear Wouter,

Thanks again for another super fast processing tool for long-read data.

I am comparing some technical replicates across two different sequencing centers and used Cramino to evaluate the Normalized read count per chromosome, only focusing on the main Chromosomes of the human genome (entries with chr).
After some plotting, I noticed a discrepancy from what I expected: the files whose name start with Novo_ have less output, and therefore, it is expected to have a reduced number of Normalized reads count per chromosome. However, I noticed the contrary.

Could you please help me to understand this behavior?
I attached one sample case. Please notice that the Novo_ file has almost half of the reads than the other file (~3millon reads less).

As an extra question, below the chromosomes is a metric with the Media/mean number of exons. Are those numbers also Coverage values for exons? Eg. Novo file Median = 2. Then, normalized read count of 2 for each exon across the genome?

Thanks and all the best,
Niko
115_T00_16_FC2_F.cramino.txt
Novo_115_T00_FC515_F.cramino.txt

Multiple file inputs

It would be useful to be able to put in multiple files as inputs and iterate through them to populate the table.

Mean identity calculated as more than 100%

I'm seeing several cases where mean identity is calculated as over 100%. For example:

Yield [Gb]	39.86
Mean coverage	12.86
N50	21602
Median length	1600.00
Mean length	5308
Median identity	97.98
Mean identity	103.01

I wouldn't expect this to be possible, correct?

Does cramino support unmapped bam?

It works with the cram file I have but crashes when I give it an unmapped bam:

~/utils/cramino -t 4 ABC.ubam --arrow ABC.unmapped.feather
File name ABC.ubam
[2023-07-21T04:19:31Z ERROR cramino] Not enough reads to calculate metrics!
thread 'main' panicked at 'explicit panic', src/main.rs:155:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

By the way, is it possible to also output number of mapped bases and unmapped bases for aligned cram/bam?

Dorado BAM output and stats

Hi Wouter,

Have you run this across an aligned BAM file as output by dorado? It seerms to me that the calculated data is reflecting alignments and not the underlying read data - for example Yield and N50 calculations are incorrect as is Mean coverage etc.

This might be an issue with the dorado BAM file but I think it may also be that you are really looking at alignments?

Cheers

Matt

Not including chromosome with `_` in the name

Hi @wdecoster - this is a continuation of #18!

if !chrom.contains('_') {

This line was the guilty party as to why I had empty normalised alignment counts per contig. The problem is the HG38 reference has _ in the contig names (example : NC_000024.10), and I'm not sure why this check was included. What was it's purpose originally? Could it be removed?

Thanks
Rory

--spliced not supported? (Question)

Is the --spliced supported?, I tried to use but I keep getting this error

error: Found argument '--spliced' which wasn't expected, or isn't valid in this context
If you tried to supply --spliced as a value rather than a flag, use -- --spliced

Plus, it no longer shows in the command line help (-h) message.

thread 'main' panicked at 'Unexpected type of Aux for phaseset

Hi,

I'm trying to run cramino for a phased .bam that was generated from PacBio Revio data using PacBio HiPhase. When I used the --phased option I get the following error

thread 'main' panicked at 'Unexpected type of Aux for phaseset: I32(11863)', src/extract_from_bam.rs:171:18
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
   2: cramino::extract_from_bam::extract
   3: cramino::main

If I remove --phased, cramino runs. I'm wondering if somehow the phaseset ID is inconsistent with what cramino is expecting. When I look at representative phase set IDs in .bam it looks something like this: PS:i:11863. Could this be an issue with integer type I32 vs U32? Seems to think this is I32 however looking in your code looks like its expected U32, which I thought "11863" would be.

fn get_phaseset(record: &bam::Record) -> Option<u32> {
    match record.aux(b"PS") {
        Ok(value) => match value {
            Aux::U8(v) => Some(u32::from(v)),
            Aux::U16(v) => Some(u32::from(v)),
            Aux::U32(v) => Some(v),
            _ => panic!("Unexpected type of Aux for phaseset: {:?}", value),
        },
        Err(_e) => None,
    }
}

Thank you !

thread 'main' panicked at 'Unexpected type of Aux I32(8) error

Hi,

I just wanted to try out Cramino and I installed it using conda. However, on testing it on one of the bam files produced using pbmm2, I am getting this error-

thread 'main' panicked at 'Unexpected type of Aux I32(8)', src/extract_from_bam.rs:164:18 note: run with 'RUST_BACKTRACE=1' environment variable to display a backtrace

This is the cramino command I am using-

time cramino -t 8 --hist --karyotype second_hg38.movie.bam

Any workaround? Is it a bug in the code by any chance? I haven't installed Rust yet, but I guess that it should be installed separately?

Regards,
Prasun

PS-Just to add, I installed Rust (v1.70.0), but the error still persists. I am using cramino v0.9.7 which I installed using conda

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.