This is a question, not an issue. I would like to convert a paired-end set of FAST

convert paired-end FASTQ to FASTA and split in one step about seqkit HOT 3 CLOSED

vkkodali commented on June 13, 2024

convert paired-end FASTQ to FASTA and split in one step

from seqkit.

Comments (3)

shenwei356 commented on June 13, 2024 1

Just download the latest version.

from seqkit.

shenwei356 commented on June 13, 2024

will this always produce read_1.part_###.fasta with matching set of reads in read_2.part_###.fasta?

Yes.

time seqkit fq2fa read_1.fastq.gz \
     | seqkit split2 --by-size 500000 --out-dir split_seqs --by-size-prefix read_1.part_ --extension .gz
time seqkit fq2fa read_1.fastq.gz \
     | seqkit split2 --by-size 500000 --out-dir split_seqs --by-size-prefix read_1.part_ --extension .gz

Is there a way to check/validate the split files to make sure the reads are in correct order?

Use seqkit pair (match up paired-end reads from two fastq files), which saves unpaired reads if there are.

$  seqkit pair -1 split_seqs/read_1.part_010.fasta.gz -2 split_seqs/read_2.part_010.fasta.gz  -u
[INFO] 500000 paired-end reads saved to split_seqs/read_1.part_010.paired.fasta.gz and split_seqs/read_2.part_010.paired.fasta.gz
[INFO] no unpaired reads in split_seqs/read_1.part_010.fasta.gz
[INFO] no unpaired reads in split_seqs/read_2.part_010.fasta.gz

$ seqkit sum split_seqs/read_[12].part_010.fasta.gz  split_seqs/read_[12].part_010.paired.fasta.gz  |  more
processed files:  4 / 4 [======================================] ETA: 0s. done
seqkit.v0.1_DLS_k0_e734aaf2f526e889d5da00a7df2ccdde     split_seqs/read_1.part_010.fasta.gz
seqkit.v0.1_DLS_k0_f97ee32096bade173d093c37f4f592c8     split_seqs/read_2.part_010.fasta.gz
seqkit.v0.1_DLS_k0_e734aaf2f526e889d5da00a7df2ccdde     split_seqs/read_1.part_010.paired.fasta.gz
seqkit.v0.1_DLS_k0_f97ee32096bade173d093c37f4f592c8     split_seqs/read_2.part_010.paired.fasta.gz

from seqkit.

vkkodali commented on June 13, 2024

I am using seqkit version 2.3.0 and I get the following error when I use seqkit pair with FASTA files:

$ seqkit stats split_seqs/SRR25005537_1.part_001.fasta.gz split_seqs/SRR25005537_2.part_001.fasta.gz
file                                        format  type  num_seqs      sum_len  min_len  avg_len  max_len
split_seqs/SRR25005537_1.part_001.fasta.gz  FASTA   DNA    500,000  125,500,000      251      251      251
split_seqs/SRR25005537_2.part_001.fasta.gz  FASTA   DNA    500,000  125,500,000      251      251      251
$ seqkit pair -1 split_seqs/SRR25005537_1.part_001.fasta.gz -2 split_seqs/SRR25005537_2.part_001.fasta.gz -u 
[ERRO] fastq files needed

from seqkit.

Recommend Projects