Giter Site home page Giter Site logo

lskatz / fasten Goto Github PK

View Code? Open in Web Editor NEW
76.0 5.0 6.0 13.96 MB

:construction_worker: Fasten toolkit, for streaming operations on fastq files

License: MIT License

Rust 77.40% Shell 20.45% Dockerfile 0.57% TeX 0.96% Makefile 0.62%
bioinformatics fastq-files rust

fasten's Introduction

Fasten

Crates.io CI DOI

A powerful manipulation suite for interleaved fastq files. Executables can read/write to stdin and stdout, and they are compatible with the interleaved fastq format. This makes it much easier to perform streaming operations using unix pipes.

Synopsis

read metrics

$ cat testdata/R1.fastq testdata/R2.fastq | \
    fasten_shuffle | fasten_metrics | column -t
totalLength  numReads  avgReadLength  avgQual
800          8         100            19.53875

read cleaning

$ cat testdata/R1.fastq testdata/R2.fastq | \
    fasten_shuffle | \
    fasten_clean --paired-end --min-length 2 | \
    gzip -c > cleaned.shuffled.fastq.gz

$ zcat cleaned.shuffled.fastq.gz | fasten_metrics | column -t
totalLength  numReads  avgReadLength  avgQual
800          8         100            19.53875
# No reads were actually filtered with cleaning, with --min-length=2

Installation

Installation from source

Fasten is programmed in the Rust programming language. More information about Rust, including installation and the executable cargo, can be found at rust-lang.org.

After downloading, use the Rust executable cargo like so:

cd fasten
cargo build --release
export PATH=$PATH:$(pwd)/target/release

All executables will be in the directory fasten/target/release.

note: there are some Makefile methods to help including

  • make all to make the following
    • make release install fast executables
    • make debug install executables quickly (although the executables will not be optimized)
    • make fasten/doc compile lastest documents
  • make clean uninstall local binaries

Installation without git

You can also install Fasten straight from https://crates.io using the following command:

cargo install fasten

Detailed information on how this works can be found in the cargo handbook at https://doc.rust-lang.org/cargo/commands/cargo-install.html.

General usage

All scripts accept the parameters, read uncompressed fastq format from stdin, and print uncompressed fastq format to stdout. All paired end fastq files must be in interleaved format, and they are written in interleaved format, except when deshuffling with fasten_shuffle.

  • --help
  • --numcpus Not all scripts will take advantage of numcpus. (not currently implemented)
  • --paired-end Input reads are interleaved paired end
  • --verbose Print more status messages

Documentation

Please see the inline documentation at https://lskatz.github.io/fasten/fasten

This documentation was built with cargo doc --no-deps

Other documentation

  • Some wrapper scripts are noted in the scripts page.

Contributing

Instructions for how to contribute can be found in CONTRIBUTING.md.

Fasten script descriptions

All executables read and write in the fastq format except fasten_convert.

executable Description
fasten_clean Trims and cleans a fastq file.
fasten_convert Converts between different sequence formats like fastq, sam, fasta.
fasten_straighten Convert any fastq file to a standard four-line-per-entry format.
fasten_metrics Prints basic read metrics.
fasten_pe Determines paired-endedness based on read IDs.
fasten_randomize Randomizes reads from input
fasten_combine Combines identical reads and updates quality scores.
fasten_kmer Kmer counting.
fasten_normalize Normalize read depth by using kmer counting.
fasten_sample Downsamples reads.
fasten_shuffle Shuffles or deshuffles paired end reads.
fasten_validate Validates your reads (deprecated in favor of fasten_inspect and fasten_repair
fasten_inspect adds information to read IDs such as seqlength
fasten_repair Repairs corrupted reads
fasten_quality_filter Transforms nucleotides to "N" if the quality is low
fasten_trim Blunt-end trims reads
fasten_replace Find and replace using regex
fasten_mutate introduce random mutations
fasten_regex Filter for reads using regex
fasten_progress Add progress to any place in the pipeline
fasten_sort Sort fastq entries

Etymology

Many of these scripts have inspiration from the fastx toolkit, and I wanted to make a fasty which was already the name of a bioinformatics program. Therefore I cycled through other letters of the alphabet and came across "N." So it is possible to pronounce this project like "Fast-N" or in a way that indicates that you are securing your analysis by "fasten"ing it (with a silent T).

Citation

DOI

To cite, please refer to Katz et al., (2024). Fasten: a toolkit for streaming operations on fastq files. Journal of Open Source Software, 9(94), 6030, https://doi.org/10.21105/joss.06030

fasten's People

Contributors

bovee avatar jhphan avatar lskatz avatar telatin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fasten's Issues

Feature Request: handle seqlen != quallen

Can you explicitly state (perhaps in readme or help menu) which tests are used for FastQ validation in fasten_validate? As far as I can tell in the src the seqlen and quallen aren't compared, or am I wrong?

I've only ever seen 2 options to handle this issue when Field 2 and Field 4 lengths aren't the same:

  1. repair:: trim to the lowest length
  2. clean/remove: discard the read entirely

Both options some might find useful in something like --seq-and-qual-len-diff [repair,remove] but I think remove is the safer more ideal option for quality concerns if only 1 can be implemented. The remove creates a broken sister read pair, so it might be tougher to implement.

JOSS Review

Hello pips,
I open this issue to keep track of the review.

Benchmark: version of the tools
I started reproducing the benchmark and wanted to know if you could share the exact version of the programs used in your benchmark. I made an environment with the latest version of Fasten, SeqTK, Seqkit, SeqFu, fastx but maybe it's better to synchronise this (and to mention the versions used in the paper)


openjournals/joss-reviews#6030

Consider std::io::Stdin instead of File('/dev/stdin')

I'm not sure if there's a performance difference (in which case please feel free to add a comment and disregard this), but I think opening /dev/stdin instead of using std::io::Stdin might limit the platforms fasten works on to only POSIX ones (i.e. probably not Windows). Not a high priority, but an easy search/change and probably a good beginner bug.

Clarify README summary

I think the summary would help potential users understand the package better if "random operations" and "secure your analysis" were reworded/expanded to be a bit more clear.

This covers the "statement of need" for documentation in the JOSS checklist: openjournals/joss-reviews#6030

fastq_mutate should always give -m mutations?

for fasten_mutate is not clear if the number of SNPs should be guaranteed or it's just a maximum.

From the tests it looks like the second:

for i in {1..30}; 
do
  ./target/debug/fasten_mutate < testdata/four_reads.fastq  --snps 5 -m | awk 'NR % 4 == 2' | \
   sed 's/[a-z]//g' | perl -ne 'chomp; if (length($_)!=5){die "Error: unexpected SNPs count: $_\n"}'; 
done

If this is the intended behaviour I would specify so in the help page

 -s, --snps INT      Number of SNPs (point mutations) to include per read.

linked to #20

add progress meter

A simple script to read in reads and print them out again but print something to stderr

docs on what fasten_validate does?

I'm curious as to what fasten_validate checks on fastq files. I think it would be useful to have a little blurb or bullet points on what is validated when running that tool.

Just a thought, nothing urgent!

[suggestion] Document, eventually enforce, input format

I would specify when a tool supports FASTA and FASTQ or only FASTQ as input, including in the description (the --help page).

It's self-explanatory to have quality-related functions not working on FASTA files. Still, some tools like fasten_sort do not work on FASTA file, but there's no evident mention and no input validation.

I understand that sometimes the philosophy of keeping it simple can be the right thing to do; in this case, just a line in the helpfile would suffice :)

fasten_trim reporting what it trimmed

Have an option to write to a file or files what fasten_trim removes. Maybe an output directory since it will be undoubtedly just be intermediate files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.