Giter Site home page Giter Site logo

bacpop / ska.rust Goto Github PK

View Code? Open in Web Editor NEW
55.0 4.0 4.0 545 KB

Split k-mer analysis – version 2

Home Page: https://docs.rs/ska/latest/ska/

License: Apache License 2.0

Rust 95.38% Shell 0.67% Python 3.95%
bioinformatics sequence ska rust alignment k-mer

ska.rust's Introduction

Split K-mer Analysis (version 2)

Cargo Build & Test docs.rs Clippy check codecov Crates.io GitHub release (latest SemVer)

Description

This is a reimplementation of the SKA package in the rust language, by Johanna von Wachsmann, Simon Harris and John Lees. We are also grateful to have received user contributions from:

  • Romain Derelle
  • Tommi Maklin
  • Joel Hellewell
  • Timothy Russell
  • Nicholas Croucher
  • Dan Lu

Split k-mer analysis (version 2) uses exact matching of split k-mer sequences to align closely related sequences, typically small haploid genomes such as bacteria and viruses.

SKA can only align SNPs further than the k-mer length apart, and does not use a gap penalty approach or give alignment scores. But the advantages are speed and flexibility, particularly the ability to run on a reference-free manner (i.e. including accessory genome variation) on both assemblies and reads.

Documentation

Can be found at https://docs.rs/ska. We also have some tutorials available:

Installation

Choose from:

  1. Download a binary from the releases.
  2. Use cargo install ska or cargo add ska.
  3. Use conda install -c bioconda ska2 (note the two!).
  4. Build from source

For 2) or 4) you must have the rust toolchain installed.

OS X users

If you have an M1/M2 (arm64) Mac, we aren't currently automatically building binaries, so would recommend either option 2) or 4) for best performance.

If you get a message saying the binary isn't signed by Apple and can't be run, use the following command to bypass this:

xattr -d "com.apple.quarantine" ./ska

Build from source

  1. Clone the repository with git clone.
  2. Run cargo install --path . or RUSTFLAGS="-C target-cpu=native" cargo install --path . to optimise for your machine.

Differences from SKA1

Optimisations include:

  • Integer DNA encoding, optimised parsing from FASTA/FASTQ.
  • Faster dictionaries.
  • Full parallelisation of build phase.
  • Smaller, standardised input/output files. Faster to save/load.
  • Reduced memory footprint and increased speed with read filtering.

And other improvements:

  • IUPAC uncertainty codes for multiple copy split k-mers.
  • Uncertainty with self-reverse-complement split k-mers (palindromes).
  • Fully dynamic files (merge, delete samples).
  • Native VCF output for map.
  • Support for known strand sequence (e.g. RNA viruses).
  • Stream to STDOUT, or file with -o.
  • Simpler command line combining ska fasta, ska fastq, ska alleles and ska merge into the new ska build.
  • Option for single commands to run ska align or ska map.
  • New coverage model for filtering FASTQ files with ska cov.
  • Logging.
  • CI testing.

All of which make ska.rust run faster and with smaller file size and memory footprint than the original.

Planned features

  • Sparse data structure which will reduce space and make parallelisation more efficient. Issue #47.
  • 'fastcall' mode. Issue #52.

Feature ideas (not definitely planned)

  • Add support for ambiguity in VCF output (ska map). Issue #5.
  • Non-serial loading of .skf files (for when they are very large). Issue #22.
  • Alternative mixture models for read error correction. Issue #50.

Things you can no longer do

  • Use k > 63 (shouldn't be necessary? Let us know if you need this and why).
  • ska annotate (use bedtools).
  • ska compare, ska humanise, ska info or ska summary (replaced by ska nk --full-info).
  • ska unique (you can parse ska nk --full-info if you want this functionality, but we didn't think it's used much).
  • ska type (use PopPUNK instead of MLST 🙂)
  • Ns are always skipped, and will not be found in any split k-mers.
  • .skf files are not backwards compatible with version 1.

ska.rust's People

Contributors

apollis44 avatar johnlees avatar tmaklin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ska.rust's Issues

Degraded performance of k-mer filter from FASTQ files

Reported by @rderelle

I generated 2 simulated set of 60x reads from 2 genomes. When analysed using ska 0.3.0, I obtained a total of 4.348.071 kmers. But with the 'old' ska 0.2.1, I obtained 4.361.745 kmers. It seems to me that, between versions 0.2.1 and 0.3.0, perhaps too many kmers are filtered out. I then tried different version of ska:

v0.3.0 4.348.071 kmers
v0.2.4 4.348.071 kmers
v0.2.3 4.348.071 kmers
v0.2.2 4.361.745 kmers
v0.2.1 4.361.745 kmers
v0.2.0 4.361.745 kmers

For all these versions, my command lines were:
ska build --threads 2 -k 41 --min-count 5 -f list_files.txt -o out
ska nk --full-info out.skf > out.txt

the potential issue seems to be related to the bloom filter implemented in "count_min_filter.rs".

I could increase the number of kmers observed by ska by increasing the size of the bloom filter** in v0.2.3.

** I'm not familiar with bloom filters, so I increased everything:

const BLOOM_WIDTH: usize = 1 << 28; (previously 1 << 27)
const BITS_PER_ENTRY: usize = 14; (previously 12)
const CM_WIDTH: usize = 1 << 28; (previously 1 << 24)

Suggestion: consistent output suffixes in ska weed

Not an issue per se.

I noticed that the commands 'build' and 'merge' seem to systematically add a '.skf' extension to the output name file while the command 'weed' does not.

For instance, the command 'ska build -o name_file.skf' generates a file 'name_file.skf.skf' while the command 'ska weed -o name_file.skf' generates a file 'name_file.skf'.

Do you have an estimated RAM usage per sample?

I'm really excited to see the Rust version. Thank you!!

With the original SKA, the memory was a limiting factor on how many samples we could process in 1 go (see here simonrharris/SKA#26). Do you have any thoughts on the relationship between number of samples and memory usage for the new implementation?

Thanks!

Possible changes to fastq input

Two possible additions

An equivalent to the -m option in ska1:

Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.

We can filter on ambiguous, but not frequency of ambig.

Early stop/fast mode

This could probably made to go much faster by:

  1. Stopping reading the input after n reads (set by e.g. 15x coverage)
  2. For 'online' or ref-based analyses, only including k-mers in the ref and turning off filters (may need minimiser based bins to improve caching).

For now, the first of these could be a useful addition

Reference bias in map with repeats

Proposed fix here: master...nickjcroucher:ska.rust:master

Reference copy can overwrite ambiguous bases in some cases:

There is a reference sequence
Query A contains the sequences: TTTTATCTTGATTTTCT
Query B contains the sequences:

TTTTATCTTGATTTTCT
TTTATCTTAATTTTCTT

Hence Query A contains the split k-mer: AAGAAAAT AAGATAAA C
And Query B contains the split k-mer: AAGAAAAT AAGATAAA Y
However, when mapped, a SNP is called distinguishing the sequences:

>Query B
TTTTAGTTTTATCTTAATTTTCTTA
>Query A
-------TTTATCTTGATTTTCTT

The expected Y is converted to A because A is in the reference and matches the split k-mer arm.

`ska map` fails with some fragmented reference sequences

  • ska version: v0.3.2
  • installed with cargo

I ran into an issue with ska map where it fails with some fragmented reference sequences (but not all, I also had some cases with fragmented references that ran just fine). This seems to have something to do with the reference assembly being fragmented because manually concatenating the contigs with something like

cat reference.fa | grep -v "^>" | tr -d '\n' | awk '{print ">concatenated_ref\n" $0}' > concatenated.fa

produces a reference assembly that is okay. I've included instructions and data for how to reproduce this on my x86_64 Linux laptop below.

Reproducing

Data that fails: https://urly.fi/3fm0

Commands that fail:

ska build -f SKA_input_list.tsv -o assemblies -k 31 --threads 1
ska map -o assemblies.aln --ambig-mask --threads 1 assemblies/LA069.fa.gz assemblies.skf

Commands that work (reference concatenated):

ska build -f SKA_input_list.tsv -o assemblies -k 31 --threads 1
gzip -dc assemblies/LA069.fa.gz | grep -v "^>" | tr -d '\n' | awk '{print ">LA069\n" $0}' > ref.fa
ska map -o assemblies.aln --ambig-mask --threads 1 ref.fa assemblies.skf
Errors from failed commands with `RUST_BACKTRACE=1` (click to expand):
SKA: Split K-mer Analysis (the alignment-free aligner)
2023-10-11T18:44:07.178Z WARN  [ska::ska_ref] Reference contained multiple contigs, in the output they will be concatenated
thread '<unnamed>' panicked at 'range end index 1988 out of range for slice of length 1283', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: rayon_core::join::join_context::{{closure}}
  13: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  14: rayon_core::registry::WorkerThread::wait_until_cold
  15: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 12660 out of range for slice of length 4873', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: rayon_core::join::join_context::{{closure}}
  19: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  20: rayon_core::registry::WorkerThread::wait_until_cold
  21: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 17466 out of range for slice of length 4873', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
  11: rayon_core::join::join_recover_from_panic
  12: rayon_core::join::join_context::{{closure}}
  13: rayon_core::registry::in_worker
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  19: rayon_core::registry::WorkerThread::wait_until_cold
  20: rayon_core::join::join_recover_from_panic
  21: rayon_core::join::join_context::{{closure}}
  22: rayon_core::registry::in_worker
  23: rayon::iter::plumbing::bridge_producer_consumer::helper
  24: rayon_core::join::join_context::{{closure}}
  25: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  26: rayon_core::registry::WorkerThread::wait_until_cold
  27: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2974 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
  11: rayon_core::join::join_recover_from_panic
  12: rayon_core::join::join_context::{{closure}}
  13: rayon_core::registry::in_worker
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  16: rayon_core::registry::WorkerThread::wait_until_cold
  17: rayon_core::join::join_recover_from_panic
  18: rayon_core::join::join_context::{{closure}}
  19: rayon_core::registry::in_worker
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: rayon_core::join::join_context::{{closure}}
  22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  23: rayon_core::registry::WorkerThread::wait_until_cold
  24: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 12706 out of range for slice of length 4873', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::job::StackJob<L,F,R>::run_inline
   7: rayon_core::join::join_context::{{closure}}
   8: rayon_core::registry::in_worker
   9: rayon::iter::plumbing::bridge_producer_consumer::helper
  10: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  11: rayon_core::registry::WorkerThread::wait_until_cold
  12: rayon_core::join::join_recover_from_panic
  13: rayon_core::join::join_context::{{closure}}
  14: rayon_core::registry::in_worker
  15: rayon::iter::plumbing::bridge_producer_consumer::helper
  16: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  17: rayon_core::registry::WorkerThread::wait_until_cold
  18: rayon_core::join::join_recover_from_panic
  19: rayon_core::join::join_context::{{closure}}
  20: rayon_core::registry::in_worker
  21: rayon::iter::plumbing::bridge_producer_consumer::helper
  22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  23: rayon_core::registry::WorkerThread::wait_until_cold
  24: rayon_core::join::join_recover_from_panic
  25: rayon_core::join::join_context::{{closure}}
  26: rayon_core::registry::in_worker
  27: rayon::iter::plumbing::bridge_producer_consumer::helper
  28: rayon_core::join::join_context::{{closure}}
  29: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  30: rayon_core::registry::WorkerThread::wait_until_cold
  31: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 17647 out of range for slice of length 4873', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  17: rayon_core::registry::WorkerThread::wait_until_cold
  18: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::job::StackJob<L,F,R>::run_inline
  10: rayon_core::join::join_context::{{closure}}
  11: rayon_core::registry::in_worker
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  14: rayon_core::registry::WorkerThread::wait_until_cold
  15: rayon_core::join::join_recover_from_panic
  16: rayon_core::join::join_context::{{closure}}
  17: rayon_core::registry::in_worker
  18: rayon::iter::plumbing::bridge_producer_consumer::helper
  19: rayon_core::join::join_context::{{closure}}
  20: rayon_core::registry::in_worker
  21: rayon::iter::plumbing::bridge_producer_consumer::helper
  22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  23: rayon_core::registry::WorkerThread::wait_until_cold
  24: rayon_core::join::join_recover_from_panic
  25: rayon_core::join::join_context::{{closure}}
  26: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  27: rayon_core::registry::WorkerThread::wait_until_cold
  28: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   7: rayon_core::registry::WorkerThread::wait_until_cold
   8: rayon_core::join::join_recover_from_panic
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: rayon_core::job::StackJob<L,F,R>::run_inline
  13: rayon_core::join::join_context::{{closure}}
  14: rayon_core::registry::in_worker
  15: rayon::iter::plumbing::bridge_producer_consumer::helper
  16: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  17: rayon_core::registry::WorkerThread::wait_until_cold
  18: rayon_core::join::join_recover_from_panic
  19: rayon_core::join::join_context::{{closure}}
  20: rayon_core::registry::in_worker
  21: rayon::iter::plumbing::bridge_producer_consumer::helper
  22: rayon_core::join::join_context::{{closure}}
  23: rayon_core::registry::in_worker
  24: rayon::iter::plumbing::bridge_producer_consumer::helper
  25: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  26: rayon_core::registry::WorkerThread::wait_until_cold
  27: rayon_core::join::join_recover_from_panic
  28: rayon_core::join::join_context::{{closure}}
  29: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  30: rayon_core::registry::WorkerThread::wait_until_cold
  31: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  19: rayon_core::registry::WorkerThread::wait_until_cold
  20: rayon_core::join::join_recover_from_panic
  21: rayon_core::join::join_context::{{closure}}
  22: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  23: rayon_core::registry::WorkerThread::wait_until_cold
  24: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
  11: rayon_core::join::join_recover_from_panic
  12: rayon_core::join::join_context::{{closure}}
  13: rayon_core::registry::in_worker
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  19: rayon_core::registry::WorkerThread::wait_until_cold
  20: rayon_core::join::join_recover_from_panic
  21: rayon_core::join::join_context::{{closure}}
  22: rayon_core::registry::in_worker
  23: rayon::iter::plumbing::bridge_producer_consumer::helper
  24: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  25: rayon_core::registry::WorkerThread::wait_until_cold
  26: rayon_core::join::join_recover_from_panic
  27: rayon_core::join::join_context::{{closure}}
  28: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  29: rayon_core::registry::WorkerThread::wait_until_cold
  30: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 1988 out of range for slice of length 1283', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   7: rayon_core::registry::WorkerThread::wait_until_cold
   8: rayon_core::join::join_recover_from_panic
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: rayon_core::join::join_context::{{closure}}
  19: rayon_core::registry::in_worker
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  22: rayon_core::registry::WorkerThread::wait_until_cold
  23: rayon_core::join::join_recover_from_panic
  24: rayon_core::join::join_context::{{closure}}
  25: rayon_core::registry::in_worker
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  28: rayon_core::registry::WorkerThread::wait_until_cold
  29: rayon_core::join::join_recover_from_panic
  30: rayon_core::join::join_context::{{closure}}
  31: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  32: rayon_core::registry::WorkerThread::wait_until_cold
  33: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  19: rayon_core::registry::WorkerThread::wait_until_cold
  20: rayon_core::join::join_recover_from_panic
  21: rayon_core::join::join_context::{{closure}}
  22: rayon_core::registry::in_worker
  23: rayon::iter::plumbing::bridge_producer_consumer::helper
  24: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  25: rayon_core::registry::WorkerThread::wait_until_cold
  26: rayon_core::join::join_recover_from_panic
  27: rayon_core::join::join_context::{{closure}}
  28: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  29: rayon_core::registry::WorkerThread::wait_until_cold
  30: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   7: rayon_core::registry::WorkerThread::wait_until_cold
   8: rayon_core::join::join_recover_from_panic
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: rayon_core::join::join_context::{{closure}}
  13: rayon_core::registry::in_worker
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  16: rayon_core::registry::WorkerThread::wait_until_cold
  17: rayon_core::join::join_recover_from_panic
  18: rayon_core::join::join_context::{{closure}}
  19: rayon_core::registry::in_worker
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  22: rayon_core::registry::WorkerThread::wait_until_cold
  23: rayon_core::join::join_recover_from_panic
  24: rayon_core::join::join_context::{{closure}}
  25: rayon_core::registry::in_worker
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  28: rayon_core::registry::WorkerThread::wait_until_cold
  29: rayon_core::join::join_recover_from_panic
  30: rayon_core::join::join_context::{{closure}}
  31: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  32: rayon_core::registry::WorkerThread::wait_until_cold
  33: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: rayon_core::join::join_context::{{closure}}
   7: rayon_core::registry::in_worker
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
  11: rayon_core::join::join_recover_from_panic
  12: rayon_core::join::join_context::{{closure}}
  13: rayon_core::registry::in_worker
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  16: rayon_core::registry::WorkerThread::wait_until_cold
  17: rayon_core::join::join_recover_from_panic
  18: rayon_core::join::join_context::{{closure}}
  19: rayon_core::registry::in_worker
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  22: rayon_core::registry::WorkerThread::wait_until_cold
  23: rayon_core::join::join_recover_from_panic
  24: rayon_core::join::join_context::{{closure}}
  25: rayon_core::registry::in_worker
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  28: rayon_core::registry::WorkerThread::wait_until_cold
  29: rayon_core::join::join_recover_from_panic
  30: rayon_core::join::join_context::{{closure}}
  31: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  32: rayon_core::registry::WorkerThread::wait_until_cold
  33: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread '<unnamed>' panicked at 'range end index 2955 out of range for slice of length 595', src/ska_ref/aln_writer.rs:148:35
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_end_index_len_fail
   3: ska::ska_ref::aln_writer::AlnWriter::write_split_kmer
   4: rayon::iter::plumbing::Producer::fold_with
   5: rayon::iter::plumbing::bridge_producer_consumer::helper
   6: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
   7: rayon_core::registry::WorkerThread::wait_until_cold
   8: rayon_core::join::join_recover_from_panic
   9: rayon_core::join::join_context::{{closure}}
  10: rayon_core::registry::in_worker
  11: rayon::iter::plumbing::bridge_producer_consumer::helper
  12: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  13: rayon_core::registry::WorkerThread::wait_until_cold
  14: rayon_core::join::join_recover_from_panic
  15: rayon_core::join::join_context::{{closure}}
  16: rayon_core::registry::in_worker
  17: rayon::iter::plumbing::bridge_producer_consumer::helper
  18: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  19: rayon_core::registry::WorkerThread::wait_until_cold
  20: rayon_core::join::join_recover_from_panic
  21: rayon_core::join::join_context::{{closure}}
  22: rayon_core::registry::in_worker
  23: rayon::iter::plumbing::bridge_producer_consumer::helper
  24: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  25: rayon_core::registry::WorkerThread::wait_until_cold
  26: rayon_core::join::join_recover_from_panic
  27: rayon_core::join::join_context::{{closure}}
  28: rayon_core::registry::in_worker
  29: rayon::iter::plumbing::bridge_producer_consumer::helper
  30: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  31: rayon_core::registry::WorkerThread::wait_until_cold
  32: rayon_core::join::join_recover_from_panic
  33: rayon_core::join::join_context::{{closure}}
  34: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  35: rayon_core::registry::WorkerThread::wait_until_cold
  36: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Add option for repeat mapping

In ska map, having an option to automatically mask any repeats in the reference would be useful. I think converting bases to 'N', if mapped (leaving as gapped otherwise)

different results from ska2 0.3.2 and 0.3.6

I was trying to run the latest version 0.3.6 (from conda install -c bioconda ska2 today) with our data and the results are different from 0.3.2. The command for all the following analysis are the same:

ska build -o seqs_ska2_strict --min-count 4 --min-qual 20 --threads 4 -k 31 --qual-filter strict -f ska2_input.tsv
ska distance --filter-ambiguous seqs_ska2_strict.skf > distances_ska2_strict.txt

We have 40 samples:
0.3.2:
image

0.3.6:
image

To help debug, I took 5 samples from the cluster in bottom right corner and subsampled to 1/5 of the read counts so the file is smaller. You can find them here https://github.com/danrlu/debug_data/tree/main/ska:

With 5 samples (see ska2_input_more.tsv)
0.3.2:

Sample1	Sample2	Distance	Mismatches
pt59	pt60	6.00	0.19603
pt59	pt61	5.00	0.21242
pt59	pt74	7.00	0.21048
pt59	pt75	7.00	0.19414
pt60	pt61	7.00	0.21246
pt60	pt74	7.00	0.21003
pt60	pt75	7.00	0.19471
pt61	pt74	2.00	0.22424
pt61	pt75	6.00	0.21081
pt74	pt75	7.00	0.21033

and 0.3.6

Sample1	Sample2	Distance	Mismatches
pt59	pt60	20.00	0.19603
pt59	pt61	21.17	0.21242
pt59	pt74	23.17	0.21048
pt59	pt75	24.00	0.19414
pt60	pt61	21.50	0.21246
pt60	pt74	20.00	0.21003
pt60	pt75	19.50	0.19471
pt61	pt74	16.17	0.22424
pt61	pt75	20.67	0.21081
pt74	pt75	19.67	0.21033

With 2 samples (see ska2_input.tsv), both 0.3.2 and 0.3.6 gave the same results:

Sample1	Sample2	Distance	Mismatches
pt60	pt61	12.00	0.21246

I checked the documentation and didn't see changes of setting for the options in the command. Let me know what else I should try~~ Thanks!!

'Palindrome middle base not W/S: 78' in ska build

Hi,

I've downloaded Gubbins into a conda environment and I've tried to run it with some different datasets, but all of them result in the same issue when trying to run generate_ska_alignment.py:

SKA: Split K-mer Analysis (the alignment-free aligner)
████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2/12thread 'main' panicked at 'Palindrome middle base not W/S: 78', src/ska_dict.rs:87:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/jvfe/miniconda3/envs/recombination/bin/generate_ska_alignment.py", line 97, in <module>
    subprocess.check_output('ska build -o ' + args.out + ' -k ' + str(args.k) + \
  File "/home/jvfe/miniconda3/envs/recombination/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/jvfe/miniconda3/envs/recombination/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'ska build -o out.aln -k 17 -f gubbins_samplesheet.txt --threads 1' returned non-zero exit status 101.

The error is the same when using only ska build (ska build -o out.aln -k 17 -f gubbins_samplesheet.txt --threads 1:

SKA: Split K-mer Analysis (the alignment-free aligner)
██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 2/12thread 'main' panicked at 'Palindrome middle base not W/S: 78', src/ska_dict.rs:87:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is this something related to the input dataset or something else?
The input data consists of open assemblies (.fna) assembled through Unicycler.

Gubbins: v3.3.0
ska: v0.3.0

Add ska distance

See older implementation here:
https://github.com/simonrharris/SKA/wiki/ska-distance

Should be easy enough to do with XX^T, though another option would be to convert to sparse ACGT vecs.

  • Matches: ACGT equal
  • SNPs: ACGT but not equal
  • Mismatches: '-' and any non-'-'

What about ambiguous bases? Could just ignore, or could add 1/2 or 1/3 for 'partial' match (i.e. multiply probability vectors)

Add ambiguous genotypes in VCF output

Currently any non-ACGT IUPAC codes are written as an 'N'. It would be possible to keep the information by writing out multiple output genotypes for a sample, with equal probabilities.

Not yet sure whether this will be a much used feature, and would take a little effort to implement.

nanopore data

Hi,
quick question, would this also work on nanopore data (10.4 flow cells)? what would be the optimal settings to run ska2 on?

thanks!
Pieter

ska merge and delete not working

Using ska v0.3.2, I'm getting an error message when trying to merge 2 skf files together or to delete a sample from a skf file. Here is the error message with 'ska delete file.skf sample' (I checked with 'ska nk' that the sample I was trying to remove was present in the skf file):

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ShapeError/IncompatibleShape: incompatible shapes', src/merge_ska_array.rs:112:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I could post more information if needed.

The distance result is very different from the old version

Thanks so much for the Rust implementation and various improvements! Very exciting!

I tried out ska2 v0.3.0 distance and the result is a bit unexpected so trying to understand the difference.
Fresh install:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install ska

The version shows ska 0.3.0

Prepare ska_input.tsv

A	A_R1.fq.gz	A_R2.fq.gz
B	B_R1.fq.gz	B_R2.fq.gz

Then ran ( --threads 4 b/c there were 45 samples)

ska build -o seqs --threads 4 -f ska_input.tsv
ska distance seqs.skf > distances.txt

In the distances.txt file

Sample1	Sample2	Distance	Mismatches
A	B	12465.53	0.02555

If I run the same sample with the old ska installed with conda, whose version is Version: 1.0

ska fastq -o A A_R1.fq.gz A_R2.fq.gz
ska fastq -o B B_R1.fq.gz B_R2.fq.gz
ska distance -s 25 -i 0.95 A.skf B.skf > distance.tsv

In the distances.distances.tsv

Sample 1	Sample 2	Matches	Mismatches	Jaccard Index	Mash-like distance	SNPs	SNP distance
A	B	2700348	30201	0.98894	0.000179886	7	2.59226e-06

So these 2 samples have 7 SNPs difference which is plausible given them came from a suspected outbreak, and none of the numbers match SKA2.
Among the 45 samples, SKA picked up a nice cluster but SKA2 gave at least a few thousands in the distance column and the numbers are not integers.
Reporting in case helpful. Thanks again!

Documentation improvements

As suggested in #38

  • Make the fastq input clearer in the docs, see if I can add type to the verbose output
  • Make k-mer length definition change versus the old ska clearer in the docs

Maybe also add a tips and tricks section (including running ska map serially)

Defaults and filtering tweaks

Two tasks:

  • Change min count default to 5 (better with strict mode as default)
  • Add mode to consider ambiguous bases as missing in freq filtering

Consider palindromic self-rc k-mers

As reported by @nickjcroucher. We can't determine the direction so the middle base may be reverse complemented. Might actually be best to put as ambiguous if own RC present.

e.g. GTGGAAG[X]CTTCCAC
seaview

So here we would store a W as the middle base

Alternatives for .skf format

The current .skf format has the following issues:

  • It can become very large, especially for diverse datasets. As it is all deserialised (see #22 for attempts at altering this) this means any operation, including looking at metadata is potentially slow.
  • The entire object needs to fit in main memory.

This would be a large and breaking change. I think if we do this, it would make sense to have a second file format (.ska/.sklarge/.h5?) which can be selected by the user for these cases, and continue to allow the previous .skf format to work as the default.

File format alternatives

For the file format, HDF5 comes to mind (which uses btrees too), but the rust bindings require the library to be installed, which is the main disadvantage (but there is good support for building it). Range queries/slices of datasets can be returned, and it's easy to add attributes to the same file. So it definitely fits the bill here.

An alternative would be Apache Parquet which has a native rust implementation, and snap compression. This would be suitable for the kmers and variants array, but it would make more sense to store the other fields (version, k size etc) with serde as before. To keep these together as a single file, could we just use tar? Starting to feel too complex imo.

Streaming from disk

For the second issue, blocks of btrees by range, and careful merging, could allow streaming of the relevant parts during build & align phases. For example see https://github.com/wspeirs/btree

Both file formats above would work here. Arrow can read and write blocks of rows. HDF5 can take a slice/range query.

Other notes

I feel like serde should be capable of at least some of this, see e.g. https://serde.rs/stream-array.html and https://serde.rs/ignored-any.html. But intial attempts with the current format weren't working well and I'm not sure why, so if it needs to be changed anyway introducing a format more designed for streaming operations might be sensible.

add back ska distance and ska annotate?

ska distance (from version 1) was helpful in clustering and created a cluster. tsv and a .dot file for uploading to Microreact, which was quite handy for visualisation. Would you consider adding that function? (correct me if I need to be corrected, becos pp-sketch can only calculate the distance with fasta/fastq, but not with skf?!)

Also, would like to ask if we use bedtools to replace ska annotate. If I'm using a reference-free approach, how am I supposed to transfer the file formats from skf to bedtools-compatible formats?

Appreciate your reply and help.

Support k up to 63

Use the u128 type, but will be better to implement with generics so we can keep u64 for k <= 31.
(countmin stays as u64)

How do I cite this package?

We're writing up the paper where we used this package for analysis! How do I best cite it? Also should I call it ska2 or SKA2 or ska.rust?

Thanks!

Add kmer count per sample to ska nk output

There is a really useful QC step with SKA which is to look at the total kmers for each sample. The total kmer numbers should be close to the genome size of the organism to indicate (1) good genome coverage (2) no contamination which will show more kmers, so it's always the first thing we check.

It would be helpful to have that metric for each sample in the current ska nk summary output.

Thanks for the new and more powerful implementation!

Feature requests: different mixture model; combined ska build & cov

Hi John,

Thanks a lot for implementing the calculation of the optimal kmer count threshold (ska cov).
I tested it using ska 0.3.1 and a paired-end sample (fastq files available here: https://drive.google.com/drive/folders/1HVO-6mOd7bh7CPOjXhA3lWAT0GA_8SC8?usp=sharing). My command line was:

./ska_0.3.1 cov reads_kraken/ERR8158003_1.fastq.gz reads_kraken/ERR8158003_2.fastq.gz -k 31 > cov.txt

Given the distribution of kmer counts, the cutoff should be around 4-6, but ska outputs a value of 17. I think this high treeshold is related to the "Error" message below obtained for low kmer sizes during the model fittng.

This is the beginning of the file 'cov.txt':

Count K_mers Mixture_density Component
1 25838523 3.2408096206707826e-1 Error
2 1993278 1.620404810335391e-1 Error
3 129884 5.4013493677846365e-2 Error
4 97141 1.3503373419461602e-2 Error
5 28053 2.7006746838923196e-3 Error
6 35628 4.5011244731538646e-4 Error
7 24702 6.430177818791246e-5 Error
8 35486 8.03772227348904e-6 Error
9 30571 8.93080252609934e-7 Error
10 39312 8.930802526126602e-8 Error
11 40110 8.118911389065957e-9 Error
12 44332 6.765759585478362e-10 Error
13 43658 5.20443537193471e-11 Error
14 49357 3.7176916177683685e-12 Error
15 47325 2.4891833425171915e-13 Error
16 51653 2.009020801325787e-14 Error
17 52514 1.921693705685228e-14 Coverage
18 53472 6.883935794424811e-14 Coverage
19 52873 2.4488922538072723e-13 Coverage
20 53588 8.282019181080119e-13 Coverage
21 54322 2.667583813189892e-12 Coverage
22 54876 8.201562459958256e-12 Coverage
23 55197 2.411959247393554e-11 Coverage
24 54703 6.797667675764525e-11 Coverage
25 54701 1.8391668285808207e-10 Coverage
26 54189 4.784636868075234e-10 Coverage
27 52711 1.1986335327845915e-9 Coverage
28 52430 2.895540187815414e-9 Coverage
29 51131 6.753560645868026e-9 Coverage
30 51195 1.522694413626287e-8 Coverage
31 49749 3.322402658582442e-8 Coverage
32 48342 7.02268990941827e-8 Coverage
33 48280 1.4394306882794722e-7 Coverage
34 47080 2.863604561107574e-7 Coverage
35 46010 5.534089852758182e-7 Coverage
36 45616 1.039788261971188e-6 Coverage
37 45274 1.9008348747272694e-6 Coverage
38 44701 3.383467426815247e-6 Coverage
39 44463 5.868114750185855e-6 Coverage
40 43372 9.922927345843403e-6 Coverage
41 43026 1.637031965870808e-5 Coverage

Also, would it be possible to perform the calculation of the estimated cutoff 'on the fly' while extracting the kmers from fastq files? As it is, we have to run ska twice (first to estimate the cutoff and then to build the kmer dictionary), which kind of defeat the purpose (twice the runtimes).

Thanks and best,
Romain

request: Iterator trait for MergeSkaArray

Not urgent.

It would be great to add the Iterator trait to MergeSkaArray. It would allow users to iterate over split-kmers + middle bases after loading the skf file.

strange behaviour for min-count (ska build)

Using ska2 v0.3.1, I analysed a large dataset of 330 Mtb samples at a kmer size of 51, and tested 2 different values of min-count (all samples have been filtered to have at least a coverage of 30x):
_ min-count = 10 (default) -> alignment of 247 SNPs
_ min-count = 6 -> alignment of 201 SNPs

Shouldn't we expect more SNPs using min-count = 6 since we are incorporating more kmers to the analysis?
I obtained similar results for the detection of indels with these 2 min-count values using skalo.

Thanks!

Partial loading of skf

If the skf is very large, loading and decompressing the whole thing is slow. It might be possible just to define and get a header struct, which would be useful for ska nk.

Suggestion: ska map to only report positions of variable split-kmers

Not issue, a suggestion:

ska map generates a VCF or alignment file corresponding to all positions of the reference genome.

Would it be interesting to have an option to restrict the mapping only to variable split-kmers? This could generate smaller output files and would avoid to manually filter the output if only variable sites (SNPs) are of interest.

Feature request: filter out missing bases in ska align

The command 'ska align --filter no-ambig-or-const' seems to output all positions with at least one missing data ('-'), resulting in large alignments of constant positions.
nb: same behaviour observed with the command 'ska align --filter no-const'

Using the following command lines, and a dataset of 67 TB samples, I obtained an alignment of 1Mb (100-200 nucleotides expected; mostly because the dataset contains 3 samples with very low coverage and a lot of missing data (see pictures below)):

./ska_0.3.1 build --threads 1 --min-count 5 -f list_ska_GC.txt -o GC -k 31
./ska_0.3.1 nk --full-info GC.skf > GC.txt
./ska_0.3.1 align --filter no-ambig-or-const GC.skf > GC.fas

Screenshot 2023-07-20 at 10 23 58 Screenshot 2023-07-20 at 10 24 09

Thanks
Romain

Non-nucleotide bases produced within ska map alignment

Version: 0.3.0
Command run:

cat ${FILE_LIST} | parallel -j 8 "NGSID=\$(basename {} | sed 's/_a.*fasta/_temp_skf/g');
 ska build -o \$NGSID -k 31 {} ;
MAPOUT=\$(basename {} | sed 's/_a.*fasta/_aln/g');
 ska map $REFERENCE_LOC \${NGSID}.skf -o \${MAPOUT}.aln"

Example output:

head -n 2 test_aln.aln | tail -n 1 | sed 's/./&\n/g' | sort | uniq -ic
      1 
 128836 -
 574319 A
     39 B
 380006 C
     63 D
 382082 G
     56 H
    154 K
    176 M
    800 R
     35 S
 569372 T
     37 V
    126 W
    766 Y

Context

Hi guys

Just been running into some issues with the mapped alignments for my large collection of pneumo isolates. The code above has run fine, but when I've put the resultant large alignment, created from cat *.aln > tot_alignment.aln , through Gubbins it highlighted odd non-nucleotide characters in the sequence.

I've double checked through my assemblies used to create these alignments, and these characters listed above aren't present in there. An example in the alignment in the seaview screenshot here:

ambiguous_bases_example

There is no N count in the example output above, maybe these are misassigned Ns?
Let me know what you think is going on and if you need some more detail.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.