Giter Site home page Giter Site logo

shenwei356 / seqkit Goto Github PK

View Code? Open in Web Editor NEW
1.2K 27.0 156.0 63.66 MB

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Home Page: https://bioinf.shenwei.me/seqkit

License: MIT License

Go 91.23% R 2.20% Perl 0.67% Shell 5.66% Dockerfile 0.23% HLSL 0.01%
bioinformatics cross-platform fasta golang sequence toolkit fastq tool manipulation

seqkit's Introduction

SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Subcommands of SeqKit2

Features

  • Easy to install (download)
    • Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
    • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
    • conda install -c bioconda seqkit
  • Easy to use
    • Ultrafast (see technical-details and benchmark)
    • Seamlessly parsing both FASTA and FASTQ formats
    • Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
    • Reproducible results (configurable rand seed in sample and shuffle)
    • Supporting custom sequence ID via regular expression
    • Supporting Bash/Zsh autocompletion
  • Versatile commands (usages and examples)

Installation

Go to Download Page for more download options and changelogs, or install via conda:

conda install -c bioconda seqkit

Subcommands

Category Command Function Input Strand-sensitivity Multi-threads
Basic operation seq Transform sequences: extract ID/seq, filter by length/quality, remove gaps… FASTA/Q
stats Simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… FASTA/Q
subseq Get subsequences by region/gtf/bed, including flanking sequences FASTA/Q + or/and -
sliding Extract subsequences in sliding windows FASTA/Q + only
faidx Create the FASTA index file and extract subsequences (with more features than samtools faidx) FASTA + or/and -
translate translate DNA/RNA to protein sequence FASTA/Q + or/and -
watch Monitoring and online histograms of sequence features FASTA/Q
scat Real time concatenation and streaming of fastx files FASTA/Q
Format conversion fq2fa Convert FASTQ to FASTA format FASTQ
fx2tab Convert FASTA/Q to tabular format FASTA/Q
fa2fq Retrieve corresponding FASTQ records by a FASTA file FASTA/Q + only
tab2fx Convert tabular format to FASTA/Q format TSV
convert Convert FASTQ quality encoding between Sanger, Solexa and Illumina FASTA/Q
Searching grep Search sequences by ID/name/sequence/sequence motifs, mismatch allowed FASTA/Q + and - partly, -m
locate Locate subsequences/motifs, mismatch allowed FASTA/Q + and - partly, -m
amplicon Extract amplicon (or specific region around it), mismatch allowed FASTA/Q + and - partly, -m
fish Look for short sequences in larger sequences FASTA/Q + and -
Set operation sample Sample sequences by number or proportion FASTA/Q
rmdup Remove duplicated sequences by ID/name/sequence FASTA/Q + and -
common Find common sequences of multiple files by id/name/sequence FASTA/Q + and -
duplicate Duplicate sequences N times FASTA/Q
split Split sequences into files by id/seq region/size/parts (mainly for FASTA) FASTA preffered
split2 Split sequences into files by size/parts (FASTA, PE/SE FASTQ) FASTA/Q
head Print first N FASTA/Q records FASTA/Q
head-genome Print sequences of the first genome with common prefixes in name FASTA/Q
range Print FASTA/Q records in a range (start:end) FASTA/Q
pair Patch up paired-end reads from two fastq files FASTA/Q
Edit replace Replace name/sequence by regular expression FASTA/Q + only
rename Rename duplicated IDs FASTA/Q
concat Concatenate sequences with same ID from multiple files FASTA/Q + only
restart Reset start position for circular genome FASTA/Q + only
mutate Edit sequence (point mutation, insertion, deletion) FASTA/Q + only
sana Sanitize broken single line FASTQ files FASTQ
Ordering sort Sort sequences by id/name/sequence/length FASTA preffered
shuffle Shuffle sequences FASTA preffered
BAM processing bam Monitoring and online histograms of BAM record features BAM
Miscellaneous sum Compute message digest for all sequences in FASTA/Q files FASTA/Q
merge-slides Merge sliding windows generated from seqkit sliding TSV

Notes:

  • Strand-sensitivity:
    • + only: only processing on the positive/forward strand.
    • + and -: searching on both strands.
    • + or/and -: depends on users' flags/options/arguments.
  • Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.

Citation

  1. Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
  2. Wei Shen, Shuai Le, Yan Li*, and Fuquan Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Contributors

Acknowledgements

We thank all users for their valuable feedback and suggestions. We thank all contributors for improving the code and documentation.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

Contact

Create an issue to report bugs, propose new functions or ask for help.

License

MIT License

Starchart

Stargazers over time

seqkit's People

Contributors

amblina avatar botond-sipos avatar bricoletc avatar bsipos avatar chenrui333 avatar corneliusroemer avatar ctava avatar dependabot[bot] avatar elliotwutingfeng avatar likelet avatar nileshpatra avatar ocxtal avatar photocyte avatar samuell avatar shenwei356 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seqkit's Issues

split error

I am using seqkit version 0.7.1. Attempting to split a FASTA file of protein causes the following error.
$ seqkit split -2 -p 2 test.faa
[ERRO] different line length in sequence

Apparently, it seems that an error occurs if the number of amino residues per line is different. In the base sequence, no error occurs even if the number of bases per line is different. Is there a solution to this problem?

[stats] Could you add N50 and N90 for Fasta format file?

Prerequisite

version : 0.8.0

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

For Fasta format files, usually need to know its N50 and N90 after assembly.

  • describe the problem
  • provide a reproducible example

[Feature] duplicating sequences XX times

Hi there:

I think that a great feature that can be added to seqkit sample is duplicating the samples, that is, if I want to copy XX times the whole multifasta or the sampled sequences, for example:

seqkit sample -s 10 -t 2

Thinking as the t argument as times. With that option, I am sampling 10 sequences two times. Something similar can be used for duplicating the whole multifasta XX times. Currently, I need to generate a multifasta copying the initial multifasta 675 times, only for testing purposes.

By now, I couldn't find an option for doing that, so... I will use the shell.

Thanks in advance.
Best.
~g

Possible bug seqkit stats?

Tool used: seqkit

Dummy fasta file (fasta.fa):

>test1
GCATCGATCAGCTACGATCATCACTA
GNNNNNNTACATCAGCACTACATCACTNNNNN
>test2
GTACGCTACGANNNGCTACGACTACGATATATATATATATATATATATATATATATATATATAT
GCTACGATCACNTACATCGACTA
>test3
GTGTGCTACATCATCACTACGTACTACAT
>test4
AA

Command:

./seqkit stat fasta.fa

Output:

file      format  type  num_seqs  sum_len  min_len  avg_len  max_len
fasta.fa  FASTA   DNA          4      176        0       44       87

Problem:
min_len =0 (however, minimum length should be 2; sequence id "test4")

Validation using seqkit:

Command:

./seqkit fx2tab -l fasta.fa

Output:

test1	GCATCGATCAGCTACGATCATCACTAGNNNNNNTACATCAGCACTACATCACTNNNNN		58
test2	GTACGCTACGANNNGCTACGACTACGATATATATATATATATATATATATATATATATATATATGCTACGATCACNTACATCGACTA		87
test3	GTGTGCTACATCATCACTACGTACTACAT		29
test4	AA		2

Notice: length of sequence test4 is "2"

Is it a bug or I misunderstood something?

Way to "natural sort" Fasta with seqkit sort ?

Hello,

using version: seqkit v0.8.1

I havent figured out a way to Natural Sort fasta file with seqkit. Am I missing something ?

tried -s and -n flags with sort

Example:

seqkit sort -2 -n -j 8 -t dna -o Homo_sapiens.GRCh38.dna.primary_assembly_sorted.fa  Homo_sapiens.GRCh38.dna.primary_assembly.fa 

Gives:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF
>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
>16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF
>17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF
>18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF
>19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
>2 dna:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>20 dna:chromosome chromosome:GRCh38:20:1:64444167:1 REF
>21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF
>22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF
>3 dna:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna:chromosome chromosome:GRCh38:8:1:145138636:1 REF
>9 dna:chromosome chromosome:GRCh38:9:1:138394717:1 REF

When I want:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>2 dna:chromosome chromosome:GRCh38:2:1:242193529:1 REF
>3 dna:chromosome chromosome:GRCh38:3:1:198295559:1 REF
>4 dna:chromosome chromosome:GRCh38:4:1:190214555:1 REF
>5 dna:chromosome chromosome:GRCh38:5:1:181538259:1 REF
>6 dna:chromosome chromosome:GRCh38:6:1:170805979:1 REF
>7 dna:chromosome chromosome:GRCh38:7:1:159345973:1 REF
>8 dna:chromosome chromosome:GRCh38:8:1:145138636:1 REF
>9 dna:chromosome chromosome:GRCh38:9:1:138394717:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF
>13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF
>14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF
>15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF
>16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF
>17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF
>18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF
>19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF
>20 dna:chromosome chromosome:GRCh38:20:1:64444167:1 REF
>21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF
>22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF

What am I missing ?

Thanks,

B.

Improvement: seqkit replace

  • Show warning when flag --replacement not given if flag --kv-file is used.
  • Fix bug of cases that pattern has multiple matches

mask fasta

Hello,

it would be nice if seqkit could implement something similar to bedtools maskfasta, but also support streaming of input fasta and output fasta.

fin swimmer

seqkit common and grep alternative do not return the same results

Good day, Wei.

Thanks for providing and continuously improving seqkit.

My question is regarding the documentation on seqkit common (I'm using the latest version 0.8.0): https://bioinf.shenwei.me/seqkit/usage/#common

Maybe I'm missing something but I get different results when I run the following:

seqkit common -s sample.txt sample.txt | seqkit stats
image

seqkit grep -s -f <(seqkit seq -s sample.txt) sample.txt | seqkit stats
image

The alternative syntax using grep is not returning all the sequences in sample.txt.

I've placed here sample.txt for your reference.

Filtering by sequence length (for FASTQ/FASTA)

Hi there,

Thanks for your work on seqkit. It is a really nice toolkit for quickly working with sequences.

A major manipulation that I've been wishing was in the toolkit is filtering by sequence length. Would it be possible to add this? E.g. keep if greater than length, or less than, or have the ability to invert the matching criteria

Thank you!
-Tim

"[ERRO] fastx: bad fastq format" when sampling fastq by proportion

This is the version of Seqkit I am using right now.
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.9.0

the current stable view installed by conda

I am not sure that this is the issue of seqkit, probably my data files are problematic.

Here is the problem which I had:

I was trying to sample sequence from a fastq file ( the original files have very large sizes),
the original files

cat original1.fastq | seqkit sample -p 0.001 -o new1_0.001.fastq &
cat original2.fastq | seqkit sample -p 0.001 -o new2_0.001.fastq &

it worked in the beginning,

[INFO] sample by proportion
[INFO] sample by proportion

but exited with an error:

[ERRO] fastx: bad fastq format
[ERRO] fastx: bad fastq format
[2]  + Broken pipe                   cat original1.fastq |
       Exit 255                      seqkit sample -p 0.001 -o original1_0.001.fastq

I searched with the term "fastx: bad fastq format",
found something like https://github.com/shenwei356/bio/blob/master/seqio/fastx/reader.go ,
but could no make sense what went wrong.
Any idea about what went wrong or suggestion for solving the problem?

Thanks!

Feature request for `seqkit stats -a`

Dear @shenwei356,
First of all I would like to thank you for the SeqKit toolkit, it's really gorgeous!

It would be great if you could add some additional statistics of FASTA/Q files to the output of seqkit stats -a. In particular, the first (Q1) and the third (Q3) quartiles of the sequence length distribution will be very useful. Given this quartiles one will have a sense of sequence length variability and can easily estimate where the central 50% of lengths are dispersed within the dataset.

Thank you,
Vladimir

Split a big MSA file into smaller ones every 5000 bp

Hi,
Thanks for this great tool.

I have a large multiple sequence alignment file for 200 bacterial strains (3.25 Mb).
Is that possible to split this alignment file each 5000 bp
so at the end the output will be 700 alignment files for the 200 strains. 1st file from 0 to 5000, 2nd from 5001 to 10000 and so on.

When using seqkit sliding -s 5000 -W 5000 file.fasta, the output is gathered into a single file. Also the remaining sequences, which are less than 5000 bp are ignored.

Thanks and best regards,

Chain subcommands/operations

Hi, this is a simple feature suggestion which might be easy to implement and would speed up usage even further. I sometimes end up chaining different operations of seqkit using shell pipes. However, this involves serialization and deserialization with each pipe which is actually unnecessary. What about allowing to chain operations on the seqkit command line by doing:

seqkit -general-opts sub_command1 -opt1 ... sub_command2 -opt2 ...

I don't know how you pass sequence objects internally but if there is a general representation this should be no more than adjusting the command line parsing.

Keep up the good work, this is a brilliant piece of software!

Best,
Johannes

fx2tab slower than shell commands

fx2tab seems to be about 4 times slower than shell commands. Is that expected?

$ time (zcat R1_001.fastq.gz | paste - - - - | cut -f 1,2,4 > /dev/null)

real    0m3.000s
user    0m4.673s
sys     0m0.324s

$ time seqkit fx2tab R1_001.fastq.gz  > /dev/null

real    0m12.991s
user    0m13.025s
sys     0m0.202s

$ time seqkit -j 2 fx2tab R1_001.fastq.gz  > /dev/null

real    0m12.672s
user    0m12.693s
sys     0m0.233s

subseq --chr does not work with fasta headers containing spaces

Input is:

cat test.fa
>seq1   sample sequence 1
ACGT
>seq2   sample sequence 2
TGCA

seqkit subseq test.fa -r 1:3 works as expected:

>seq1   sample sequence 1
ACG
>seq2   sample sequence 2
TGC

seqkit subseq test.fa -r 1:3 --chr seq1 does not work because it does not skip after ID:

>seq1_1-3 seq1
sam

Feature Request: automatic quality score conversion

A common issue encountered by many involves determining and then converting quality scores from various older sequencing datasets to modern Phread+33 (Sanger / Illumina 1.8+) encoding. There does not yet appear to be a single unified system to perform both of these functions, as part of a larger pipeline on the Linux command line. seqtk itself provides the ability to convert from Illumina1.5 to Illumina 1.8+ formats, for instance, but does not convert from formats prior to Illumina 1.3. A useful Python script exists to guess the correct encoding. Conversion is further complicated by the fact that valid conversion between Sanger and Solexa involves a non-linear transformation and may require some precautions to ensure numerical stability.

SeqKit seems like an ideal toolkit to include this functionality, which many would likely find highly useful.

seqkit sample: line length

Hi Wei,
Is there a rationale for breaking the reads in a FASTQ file into 60-bp lines? I find that some downstream tools can not correctly recognize the sequence sections of the FASTQ files generated by seqkit sample command.
Thank you

seqkit doesn't consider strand under some circumstances

seqkit subseq doesn't work for "-" strand

Hi, I was using seqkit to extract ortholog sequences among several species, I found this issue when some species doesn't look like ortholog compare to others from time to time, the extracted sequence seems to be wrong, but seqkit considers +/- most of the time, I don't know what happened on that scaffold.

  • I was extracting a sequence using seqkit subseq --bed my.bed my.scaffold, in some locus, the "-" strand hasn't been considered, the extracted sequence won't get reversed and complemented
  • please remove the .txt suffix and test the following command : seqkit subseq --bed tmp.bed scaffold.fa

Best,
Lingfeng

scaffold.fa.txt
tmp.bed.txt

setting maximum memory usage to seqkit

Hi there,

I am using seqkit rmdup to reduce redundancy and file size of illumina fastq reads. Is there a way to limit the amount of memory that will be used by seqkit?

When I provide 48GB (maximum on my computational node) to process a fastq.gz of 30 GB size the process collapses soon, because seqkit exceeds this limit.

Thanks!

seqkit handles empty files inconsitently

Hi!

seqkit exits with 1 when getting an empty file. I use seqkit in scripts and makefiles, and sometimes the fasta files processed are empty and imho they should be treated like any file containing sequences.

seqkit also handles empty files and files just contain one newline differently:

$ touch file.fasta; seqkit seq file.fasta; echo $?
2018/08/01 11:34:36 EOF
1
$ echo > file.fasta; seqkit seq file.fasta; echo $?
0

Would be nice if the first command would exit with 0 without any messages? or to have a flag to enable this behaviour?

Thanks :)
Bela

N50 and L50 jargon is confusing

Prerequisites

  • [ X ] make sure you're are using the latest version by seqkit version
  • [ X ] read the usage

Describe your issue

Thanks for building seqkit, it is an extremely useful tool that I use every day.

seqkit stats -a produces N50 and L50 statistics. These labels are very confusing; 'N50' is the 'N50 length', the length of read such that 50% of the bases are in reads of this length or longer. 'L50' is the 'N50 number', the number of reads in this set. The term L50 has no connection with its meaning and in fact suggests it is to do with a length, which is not true. It would be much better to to use the terms 'N50 length' and 'N50 number' (or similar terms) to make the meaning of these statistics clear. I realise other tools use the same jargon but it is unclear and would be better replaced.

seqkit split consumes lots of memory

seqkit version v0.7.2

Describe your issue

I try to use seqkit to split a fastq.gz into small parts
the command is as follows:

seqkit split -j 4 -p 10 $fq -O split

$fq is about 2G, and it use about 10 minutes to finish the job, which is quite fast

the problem is that it consumes a lot of memory, about 60G

can you help reduce the memory it consumes.

one possible function

Hi Shenwei,

I really like your seqkit tool, a really neat implement in golang. I was using pyfaidx for some simple sequence query work, such as obtain sequence record based on the sequence ID. I am sure you should have this function implemented in your package, but I could not find the function description after skimming through your documentation. Could you please show me how to use this function in seqkit?

Thanks,

Jie

Calling seqkit from other go programs

Hi Wei Shen,

Thank you for your hard work in developing seqkit. It's really a nice set of tools for handing various sequence formats.

I want to use seqkit functions inside my go program just like I would do from the command-line.

For example, from the command line:

seqkit grep -p "header" big.fasta -o small.fasta

Now, I would like to do the same from inside my go program, something similar to:

import (
..
..
..
	"github.com/shenwei356/bio/seq"
	"github.com/shenwei356/bio/seqio/fastx"
	"github.com/shenwei356/breader"
	"github.com/shenwei356/xopen"
	"github.com/spf13/cobra"
)
..
..
..
seqkit.grep("header", bigFasta, smallFasta)

Is this possible?

sliding -g added empty characters to the last sequence.

Hi, dear shenwei

Your seqkit does help me a lot.

However, when using sliding command with greedy options (-g), it seems that the program added some null characters to the end of the last sequence ( in order to fill the sequence to the windows size? ).

For example: ( I am using seqkit v0.9.1 )
By runing
seqkit sliding -g -s 1000 -W 1000 chr7.fa -o - | seqkit fx2tab | tail -n 1 | seqkit tab2fx | less
I got:

>chr7_sliding:177932001-177933000
GGGCTTTAGGGTTTGGGGCTCGGGATTTAGAGTTTTGGATTTTGGGGTTCGAGGTTTAGG
GTTTCGGTTTGAGGGTTAGCGTCTTGGGTTTAATGTTTAGGGTTTGGAGTTTGGGATTTG
AGATTTTGTGTTTTGGGTGTCGGATTTTAGTTTTAGGGTTCGGTGTTTGGGGTTTATGCT
TTAGTATTTGGGGCTCGAGGATAAGGATTTTGGGGTTTGGGGTTTAGTGTGATGTAGCTC
ACATGAGGCTAAAAACTTGTTTCAGTGGCCAAAACATGGGTGGCAGGCGCGAGGCGTGAA
TTTTTTGTTCGACGACGAAAACATGTGTCGCGGGAGATCAAATTTGGAGTTTGGGGTTTG
GAGTTTTGTTTTCTGGTGATCGGGTTTTTGTTTTAGGGTTCGAGGTATAGGGTTCATAAT
TTAGGGTTCGGGGTTCAGGGTTTGGGCTTTAGGGTTTGGGGCTCGGTGTTTGGCGTTCGA
GGTTTAGGGTTTCGGTTTGAGGGTTAGCGTTTGGGGATTAAGGTTTAGGGTTTGGAGTTT
GGCATTTAAGATTTTGTGTTTTGGGGGTCAGATTTTAGTTTTAGGGTTCGGTGTTTTATG
GTTTAGGGTTTGGTGCTCGAGGTTTAGGGTTTTGGGGTTTAGGGTTTGCTGTGATGTAGC
TCATGGGAGGCCAAAACCAGCCTTACCTGTTTCAGTGGTGAAAATATGGGTGGCCGACGA
GAGGCATGAAATCTTTGTTCGATGACAAAAACATGTGTCGCAGAAGGCCAAAAATAGTAT
TACCGGCCTCCTCGGACCAAAACGTGTACTATGGTTTGGACAATAAAATTATACTGTCTT
GAAATTTACAGAAT^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

This "^@" in the less command seems to be some null character, it may not influence the downstream analysis, but it caused several empty lines. Any idea to avoid this?

Thank you.

Best wishes,
SongtaoGui

seqkit locate

Example 1, this is OK.

$ echo -ne ">test\nAACGcgcA" | seqkit locate -i -P -p "CGC"
seqID   patternName     pattern strand  start   end     matched
test    CGC     CGC     +       3       5       CGc
test    CGC     CGC     +       5       7       cgc

Example 2, this is OK.

$ echo -ne ">test\nGGCGGCGG" | seqkit locate -i -P -p "GGCGG"
seqID   patternName     pattern strand  start   end     matched
test    GGCGG   GGCGG   +       1       5       GGCGG
test    GGCGG   GGCGG   +       4       8       GGCGG

Example 3, position 5-6 is redundant.

$ echo -ne ">test\nAACGcgcA" | seqkit locate -i -P -p "(CG)+"
seqID   patternName     pattern strand  start   end     matched
test    (CG)+   (CG)+   +       3       6       CGcg
test    (CG)+   (CG)+   +       5       6       cg

Well, the only way is to remove redundant intervals by segment tree.

--two-pass flag only uses two sequnces in sort

Prerequisites

I am using v0.9.1-dev and have read "seqkit sort --help"

Describe your issue

I am following the examples given in the chapter on SEQKIT in the Biostar Handbook Bioinformatics course.
When I use the --two-pass (-2) flag with the sort function the program just returns two sequences. This does not happen when the two-pass flag is omitted.

Example with two-pass flag:
Fri Oct 05 snerx@kali-linux:~/BioPython/biostar-handbook-September-2017/Lectures/seqkit_fastq
$ seqkit sort --by-length --two-pass viral.1.1.genomic.fna.gz > viral.genomic.sorted_2pass.fa
[INFO] read and write sequences to tempory file: viral.1.1.genomic.fna.gz.fastx ...
[INFO] create and read FASTA index ...
[INFO] create FASTA index for viral.1.1.genomic.fna.gz.fastx
[INFO] read sequence IDs and lengths from FASTA index ...
[INFO] 2 sequences loaded
[INFO] sorting ...
[INFO] output ...

Example without two-pass flag:
$ seqkit sort --by-length viral.1.1.genomic.fna.gz > viral.genomic.sorted.fa
[INFO] read sequences ...
[INFO] 7 sequences loaded
[INFO] sorting ...
[INFO] output ...

Seems like a bug of some kind to me.

Regards

TEF
Bodø
Norway

seqkit faidx with multiple regions on same sequence does not work

With version 0.9.0 (and 0.8.0), it is not possible to extract multiple sub-sequences from the same sequence.

samtools works as expected.

To reproduce this 🐛

test.fasta

>test1
actgatgcggctagcatcgtatatgctagc
>test2
ggctagctagctagctagatgctagctagctgatcg

test.fasta.fai

test1	30	7	30	31
test2	36	45	36	37

✔️ Multiple regions on different sequences

seqkit faidx test.fasta test1:1-5 test2:5-10

>test1:1-5
actga
>test2:5-10
agctag

❌ Multiple regions on same sequence

seqkit faidx test.fasta test1:1-5 test1:5-10

>test1:5-10
atgcgg
>test1:5-10
atgcgg

Here are instructions on installing via homebrew

brew tap homebrew/science
brew tap tseemann/bioinformatics-linux
brew install seqkit

(it works for OSX and Linux despite the -linux tap name)
(eventually we will build from source and add to homebrew/science directly)

Bug: seqkit locate misses locations

It's a classic mistake in motif searching.

$ echo -ne ">seq\nACGACGACGA\n" | seqkit locate -i -p acga
seqID   patternName     pattern strand  start   end     matched
seq     acga    acga    +       1       4       ACGA
seq     acga    acga    +       7       10      ACGA

Position 4-7 is missing.

regular expression syntax

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

Could you kindly provide more details on syntax of regular expressed used by seqkit? Does it use the
perl syntax or python, or other languages? I find the examples are useful, however, limited.

Thanks!

Changing TMPDIR

When I tried to split a file,

$ seqkit split -f -i Pabies1.0-genome.fasta
...
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_7828943.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_2441276.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_6321954.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_1366762.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_2176025.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_6910928.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_5226871.fasta
[INFO] write 1 sequences to file: /tmp/tmph43ji8m2/Pabies1.0-genome.id_MA_5799640.fasta
No space left on device

I run out of disk space in /tmp. The following code didn't work:

$ TEMPDIR=./mytemp seqkit split -f -i Pabies1.0-genome.fasta

Is there anyway to change the temporary directory which seqkit uses?

Thanks.

使用seqkit split的功能,结果文件输出问题

我想均等切分fq文件,使用seqkit split -p fq,结果输出的fa格式变成了如下格式:
@CL100006359L1C001R001_1/2
AAAAATTGAACATCAGGATGCAGCTAGGAAAAGGGGTTCCTGGAAACACCCTCACAAGCT
TTGTAAGTAATACAACTAAGTAGAAATAAACTAATATAAA
+
FDDFBDF:9A8CF;D7=DDF8F*ADF>@;DE07?<7FF>'E6FEC8CFD568DFFDFFAC
E4E@DEEBFE?F(EE4FF>@<D&A:FDF,D1F@F7D;'EA
@CL100006359L1C001R001_11/2
CCTAGTTTTCTGCTGAGCCTGCCGTAAAGTATTTTCAGCACCTTCGACCGTGACAAAGCT
CCAACCTGACTGGAGGCATTGTGTAGCGAAGGGAGCTTGG
+
24=9>89446=1686&B-;6=638C60?=17<7,65):?7)8'1,%))+8'&(:6=95
3=&,&3/.'183F0=(.&<'C/B19=19)A9-3=9>;1.&

也就是说不是标准的fq格式了,这个要怎么设置才能换换成标准fq格式?

Enhancement request

Thank you for a wonderful parsing tool. After much usage, I would like to request following features.
Please add

  • extract sequence by number (for fasta file). Use case: I would like to extract n th file
  • Tail function. Head function is already implemented. Tail function is missing
  • inverse selection for bases in a sequence. Current version allows user to select bases at the start, in the middle and at the end. It would be difficult for user to choose first few bases and last few bases. This would help user in removing sequences in the start and end of a read or a long sequence
  • Allow user to search by stop codon *. Stop codon * is present in few of predicted sequences. Currently user cannot search by * in the sequences.

stats in machine readable format?

seqkit stat:
File.fastq.gz  FASTQ   DNA   1,091,110  162,522,654       35      149      151

The numbers have commas (,) in them and the results are space formatted to be easy for humans to read.

Do you have an --option to write in TSV without the commas?
So scripts can safely use cut,paste,mlr etc?

grep with pattern file not finding matches in sequence names

Hello,

I'm trying to use seqkit to retrieve a few sequences using a pattern file. However, seqkit is not returning any matches. For example, here are a few sequence names:

lcl|NC_001566.1_prot_NP_008082.1_22444 [gene=ND2] [db_xref=GeneID:807701] [protein=NADH dehydrogenase subunit 2] [protein_id=NP_008082.1] [location=502..1503] [gbkey=CDS]
lcl|NC_001566.1_prot_NP_008083.1_22445 [gene=COX1] [db_xref=GeneID:807695] [protein=cytochrome c oxidase subunit I] [protein_id=NP_008083.1] [location=1794..3359] [gbkey=CDS]
lcl|NC_001566.1_prot_NP_008084.1_22446 [gene=COX2] [db_xref=GeneID:807700] [protein=cytochrome c oxidase subunit II] [protein_id=NP_008084.1] [location=3618..4295] [gbkey=CDS]

And I have a list with a string per line:

ND2
COX1
COX2

When I use grep, it finds the expected matches:
grep -w -f list.txt input.fasta

But when I try finding matches with seqkit, I don't get the expected result. I have tried a couple of commands:

seqkit grep --by-name --pattern-file list.txt input.fasta
seqkit grep --id-ncbi --by-name --pattern-file list.txt input.fasta

I appreciate any help.

sort fasta file by sequence_name

I got a fasta file whose header are like

MG958507.1 Cladonia apodocarpa voucher NYBG:Lendemer 48789 mitochondrion, complete genome
MG941021.1 Cladonia petrophila voucher NYBG:Lendemer 49138 mitochondrion, complete genome
MG725377.1 Cladonia leporina voucher 6543 mitochondrion, complete genome
MG949117.1 Cladonia subtenuis voucher NYBG:Lendemer 49895 mitochondrion, complete genome

I want to sort it by seqname(all characters after the id like Cladonia leporina voucher 6543 mitochondrion, complete genome)
I used the command 'seqkit sort --id-regexp "/ (.+?)+ /" test.fa > test_1.fa'
but got a result like this

MG725377.1 Cladonia leporina voucher 6543 mitochondrion, complete genome
MG941021.1 Cladonia petrophila voucher NYBG:Lendemer 49138 mitochondrion, complete genome
MG949117.1 Cladonia subtenuis voucher NYBG:Lendemer 49895 mitochondrion, complete genome
MG958507.1 Cladonia apodocarpa voucher NYBG:Lendemer 48789 mitochondrion, complete genome
The result is sorted by id,
I have checked my regex and it's correct on a regex online website.
I want to get a result like this
MG958507.1 Cladonia apodocarpa voucher NYBG:Lendemer 48789 mitochondrion, complete genome
MG725377.1 Cladonia leporina voucher 6543 mitochondrion, complete genome
MG941021.1 Cladonia petrophila voucher NYBG:Lendemer 49138 mitochondrion, complete genome
MG949117.1 Cladonia subtenuis voucher NYBG:Lendemer 49895 mitochondrion, complete genome

what command line should I use?

seqkit grep -r not working consistently between -f and -p

Hi there,

I'm trying to extract a set of FASTA records by matching a string in the record descriptions. It seems to work with a single string using "-p", but if I put the same string in a file and use "-f", it doesn't work. See below. Using seqkit v0.5.2 on linux. Same issue happens on Mac OS X.

seqkit grep -n -r -p "DN12594_" latest_peptides.fasta | seqkit stat
file  format  type     num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   Protein         1      513      513      513      513

(working as intended)

less IDs.txt
DN12594_

hexdump IDs.txt
0000000 4e44 3231 3935 5f34 000a               
0000009
seqkit grep -n -r -f DNelson_ids3.txt latest_peptides.fasta | seqkit stat
file  format  type     num_seqs     sum_len  min_len  avg_len  max_len
-     FASTA   Protein    68,107  25,884,756       99    380.1    8,773

(Now it pulls out all the records, not intended behavior)

seqkit stat  latest_peptides
file                                                      format  type     num_seqs     sum_len  min_len  avg_len  max_len
151_Ppyr_v5_Trinity.fasta.transdecoder.pep_cleaned.fasta  FASTA   Protein    68,107  25,884,756       99    380.1    8,773

sliding command

Hi there:

I have been using seqkit for a couple of days and I have to say that it works like a charm, works very well. I have an issue for the "sliding" command. Would be great to add a parameter to the sliding command in order to skeep over those windows with length less than the windows size specified. For example, por the given sequence:

GHTELKAVTPDVSSMAMNHKADVEALVGATFSTFEPVSYSTQVVAGTNYKILVNVGNDMQ
VDMVVHKTLPHAGGNTSLTSAAHLPVQVQANVNLGVGIGL

A slidiing with windows size = 2 and step = 1, the last window will be only the letter "L".

I think that a simple condition would be enough for that option (and adding the argument). Maybe something like if e > l { continue } before the condition of line 114 works well.

Thanks in advance.

seqkit split with pair of FASTQ and check pair-endness

This is a feature request for seqkit, which I am exploring for some of our production pipelines:

Would it be possible to have the seqkit command split pairs of FASTQ files in n parts, but checking that the pair-endness of file1 and file2 is kept for each file? E.g. something like:

seqtk splitpe -1 Sample1_S1_L001_R1_001.fastq.gz -2 Sample1_S1_L001_R2_001.fastq.gz -p 4 -j 4

The current implementation is useful, but memory intensive for FASTQ (since two-pass mode does not support FASTQ format), and does not check for pair-endness after splitting the files, since it has to be done one file at a time, and the pair-endness would have to be checked with an external tool like seqtk mergepe.

TODO

  • add subcommand to reset start position for circular genome, like fasta_reset_start_position_for_circular_genome.pl
  • seqkit stat or add a new command seqkit stat2, add N50, N90.. column for FASTA file
  • filter sequence by sequence length
  • seqkit seq -g -G and seqkit stat -a -g : bug when given non-ASCII gaps letters

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.