Giter Site home page Giter Site logo

llei2 / seqkit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shenwei356/seqkit

0.0 0.0 0.0 62.85 MB

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Home Page: https://bioinf.shenwei.me/seqkit

License: MIT License

Shell 5.96% Perl 0.72% R 2.34% Go 90.75% HLSL 0.01% Dockerfile 0.24%

seqkit's Introduction

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Features

  • Easy to install (download)
    • Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
    • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
    • conda install -c bioconda seqkit
  • Easy to use
    • Ultrafast (see technical-details and benchmark)
    • Seamlessly parsing both FASTA and FASTQ formats
    • Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
    • Reproducible results (configurable rand seed in sample and shuffle)
    • Supporting custom sequence ID via regular expression
    • Supporting Bash/Zsh autocompletion
  • Versatile commands (usages and examples)

Installation

Go to Download Page for more download options and changelogs, or install via conda:

conda install -c bioconda seqkit

Subcommands

category command function input strand-sensitivity multi-threads popularity
basic seq transform sequences: extract ID/seq, filter by length/quality, remove gaps, reverse complement… FASTA/Q ★★★★★
stats simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… FASTA/Q ★★★★★
sum compute message digest for all sequences in FASTA/Q files FASTA/Q + or both
subseq extract subsequences or flanking sequences by region/gtf/bed, FASTA/Q + or/and - ★★★
sliding extract subsequences in sliding windows FASTA/Q + only ★★
faidx create FASTA index file and extract subsequence (with more features than samtools faidx) FASTA + or/and -
watch monitoring and online histograms of sequence features FASTA/Q
sana sanitize broken single line FASTQ files FASTQ
scat real time concatenation and streaming of fastx files FASTA/Q
format conversion fq2fa convert FASTQ to FASTA FASTQ ★★
fa2fq retrieve corresponding FASTQ records by a FASTA file FASTA/Q
fx2tab convert FASTA/Q to tabular format FASTA/Q ★★
tab2fx convert tabular format to FASTA/Q format FASTA/Q
convert convert FASTQ quality encoding between Sanger, Solexa and Illumina FASTA/Q
translate translate DNA/RNA to protein sequence FASTA/Q + or/and - ★★
searching grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed FASTA/Q + and - partly, -m ★★★★★
locate locate subsequences/motifs, mismatch allowed FASTA/Q + and - partly, -m ★★★★★
amplicon extract amplicon (or specific region around it), mismatch allowed FASTA/Q + and - partly, -m
fish look for short sequences in larger sequences FASTA/Q + and -
set operation sample sample sequences by number or proportion FASTA/Q ★★★★
rmdup remove duplicated sequences by ID/name/sequence FASTA/Q + and - ★★★
common find common sequences of multiple files by id/name/sequence FASTA/Q + and -
duplicate duplicate sequences N times FASTA/Q
split split sequences into files by id/seq region/size/parts (mainly for FASTA) FASTA preffered
split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ) FASTA/Q ★★
head print first N FASTA/Q records FASTA/Q
head-genome print sequences of the first genome with common prefixes in name FASTA/Q
range print FASTA/Q records in a range (start:end) FASTA/Q
pair match up paired-end reads from two fastq files FASTA/Q
edit concat concatenate sequences with same the ID from multiple files FASTA/Q + only ★★★
replace replace name/sequence by regular expression FASTA/Q + only ★★
restart reset start position for circular genome FASTA/Q + only
mutate edit sequence (point mutation, insertion, deletion) FASTA/Q + only
rename rename duplicated IDs FASTA/Q
ordering sort sort sequences by id/name/sequence/length FASTA preffered ★★
shuffle shuffle sequences FASTA preffered
BAM processing bam monitoring and online histograms of BAM record features BAM

Notes:

  • Strand-sensitivity:
    • + only: only processing on the positive/forward strand.
    • + and -: searching on both strands.
    • + or/and -: depends on users' flags/options/arguments.
  • Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.
  • Popularity: Bases on statistics of 227 publications citing seqkit since 2020.

Citation

W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Contributors

Acknowledgements

We thank Lei Zhang for testing SeqKit, and also thank Jim Hester, author of fasta_utilities, for advice on early performance improvements of for FASTA parsing and Brian Bushnell, author of BBMaps, for advice on naming SeqKit and adding accuracy evaluation in benchmarks. We also thank Nicholas C. Wu from the Scripps Research Institute, USA for commenting on the manuscript and Guangchuang Yu from State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, HK for advice on the manuscript.

We thank Li Peng for reporting many bugs.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

Contact

Create an issue to report bugs, propose new functions or ask for help.

License

MIT License

Starchart

Stargazers over time

seqkit's People

Contributors

amblina avatar botond-sipos avatar bricoletc avatar bsipos avatar chenrui333 avatar ctava avatar dependabot[bot] avatar likelet avatar nileshpatra avatar ocxtal avatar shenwei356 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.