opengene / fastp Goto Github PK
View Code? Open in Web Editor NEWAn ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
License: MIT License
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
License: MIT License
Hi all,
Trying to run fastp on a PE150 sample.
Here is the exact line I'm running:
fastp -i Emx1_1_11_CTRL_USPD16084012-4_HHG33BBXX_L6_1.fq.gz -I Emx1_1_11_CTRL_USPD16084012-4_HHG33BBXX_L6_2.fq.gz -o r1.fq.gz -O r2.fq.gz
Here is the error I get:
ERROR: 'r2.fq.gz' is a folder, not a file, quit now
I have no problems running fastp on either of these fastq files in single-end mode. Also tried using --out2 instead of -O, I get the same result.
Any idea how I can get this to run?
Best,
David
When I run fastp
there is no output until the end.
I would like to see what is happening.
Would it be possible to add some progress messages?
-V
/ --verbose
Thanks
fastqreader.cpp:32: undefined reference to gzoffset' ./obj/peprocessor.o: In function
PairEndProcessor::initOutput()':
peprocessor.cpp:35: undefined reference to gzbuffer' peprocessor.cpp:38: undefined reference to
gzbuffer'
collect2: error: ld returned 1 exit status
make: *** [fastp] Error 1
It's not really a bug but JSON and HTML reports are not generated if a directory is given to --json and --html options.
I wanted to generate JSON and HTML reports into a specific directory with by default names.
I just gave a directory to --json and --html options but it did not generate the reports.
Maybe it's possible to create --output_directory option (current directory by default) so that we just have to change names with --out1, --out2, --json and --html options. Just an idea 🙂 !
When I use the version 0.13.0, and run the follwing command
fastp -z 1 -i test_1.fq -I test_2.fq --out1 out.R1.fq.gz --out2 out.R2.fq.gz
and then I met the problem"'out.R1.fq.gz' is not a writable file, quit now"
After touch out.R1.fq.gz
firstly , this problem will be overcome.
When the DNA library is overly short, Is it possible that most reads overlap.
Can fastp
stitch these reads together (instead of just correcting errors) ?
So input R1, R2 would produce output R1, R2 and SR (stitched, longer single end reads)
HI, if I want know the content of rRNA, fastp ,can it be calculated?
Hi sfchen,
Today, I compared three software (fastp, cutadapt, trimmomatic) , and found fastp very fast but adapter cannot be remove clean.
I upload my result and hopefully you found this useful!
compare_software.xlsx
sfchen commented:
Thanks for the result, from the data, I can see:
for short adapters (7bp), Trimmomatic removes the most adapter, then fastp removes less, and Cutadapt removes the least.
for longer adapters (>7bp), fastp removes much more than Trimmomatic and Cutadapt.
in total, fastp removes the most adapters.
Am I right?
Replay:
Hi sfchen,
My sample real adapter sequece is GATCGGAAGAGCACACGTCTGAACTCCAGTCAC********ATCTCGTATGCCGTCTTCTGCTTG, '*' is 8bp barcode.
And trimmomatic adapter file is :
adapter.list.xlsx
When I get clean fastq data, I split adapter sequece to some short substr, like:
7bp (AGATCGG)
8bp (AGATCGGA)
9bp (AGATCGGAA)
10bp (AGATCGGAAG)
11bp (AGATCGGAAGA)
12bp (AGATCGGAAGAG)
13bp (AGATCGGAAGAGC)
and then statistic the count of reads include this substr adapter sequence.
One of the great things about FastQC is that MultiQC can be used to integrate all the quality control data into a single useful HTML.
Is this sort of integration available for Fastp or will it be implemented in the future?
I am benchmarking fastp against other read trimmers using the workflow I developed for the Atropos paper (https://github.com/jdidion/atropos/tree/master/paper/workflow). I find that fastp has a high rate of read overtrimming. Example fastq input and output are attached. The command I used is:
fastp
-i {fastq1} -I {fastq2} -o {prefix}.1.fq.gz -O {prefix}.2.fq.gz
--adapter_sequence {adapter1} --adapter_sequence_r2 {adapter2}
--thread {threads} --length_required 25 —disable_quality_filtering
Nearly all of these overtrimming events involve the spurious removal of up to 10 bases from one or both reads:
I suspect this might be due to overzealous alignment of the reads to each other, and could probably be fixed with an option to require a minimum insert overlap before trimming. Another approach (which is offered as an option in Atropos) is to compute the random match probability of each alignment and compare against a user-specified threshold value.
Hi,
I am seeing a large portion of NextSeq reads that have the poly-G tail, and have successfully trimmed that off with fastp. Most sequences also have a poly-A run just before the poly-G tail, which is apparently due to reduction in signal strength (lower quality) from clusters before they fail altogether (see https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/). I tried to use the -x function in the same trimming command, but it doesn't work (presumably because it isn't at the end of the original read). I suppose I could try it as a second trimming process, but wanted to know if this is a common issue that people with poly-G reads from NextSeq data see, and if so, could it be incorporated into the options.
The data look like this after trimming:
@NS500704:337:HGG2HBGX3:1:11101:10170:2997 1:N:0:GACGAGG+CGGAAT
GCAAGGTCTTAATCAAATTTTGTCAGCTGCAAGATCGAAGAGCACACGTCTGAACTCCAGTCACGACGAGGATCTCGTATGCCGTCTTCTGCGTGAAAAAAAAAA
+
AAAAAEEEEEAEEEE6EEAEEEEEEEEEEEEEEEEEEE/EEEEEEEEAEEEEE/EEEEEEEEEEEAEE/EEEA</AE/AEE<E</6<E//EE/EE/<AAEEEEA/
@NS500704:337:HGG2HBGX3:1:11101:8528:3344 1:N:0:GACGAGG+CGGAAT
ACAGAAACAGGTGCACAGTTCCCCATCAAGATCGGAAGACACACGTCTGAACTCCAGTCACGACGAGGCTCTCGTATGCCGTCTTCTGCATGAAAAAAAAAA
+
AAAAAAEEEEEE6E/AEEE/EEEEEEEEEE/EEEEEEEEE/EAEEEEEE/AE/EAE/AEEEEEEA/EE/AE<///A<E<E//<EE//////A/EEEEEE///
Thanks,
Phil Morin ([email protected])
I think these two filenames may always be the same except the suffix
Thank you for making your wonderful tool!
For dual-UMI experiments, there may/should be different UMI tags on the forward and reverse read of a pair. Is there an option (now or in development) to remove the UMI tags from each read and place them on both of the resultant reads? Downstream tools require that the read names be the same so if there are different UMI tags on the forward and reverse of a pair, it will fail. Instead it should have the read name, followed by a delimiter between the forward and reverse UMI tags.
For instance, in fastq_1.fq.gz
read_1_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag
And in the pair, fastq_2.fq.gz
read_2_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag
Dear Developer,
I have PE fastq file, R1.fq + R2.fq have 34225961 pairs reads, total bases 10.2G, and R1.fq file size 12Gb.
When I run fastp command:
fastp -i R1.fq -I R2 -o trim.R1.fastq.gz -O trim.R2.fastq.gz -5 -3 -M 30 -q 30 -l 36 -n 5 -c --html trim.html --json trim.json --report_title "Fastp Report" --thread 10 > trim.log
found error "Segmentation fault (core dumped)".
but when I remove the one of '-5' or '-3' option, there is no error reported.
So I abstract 100000 read and build test.1.fq and test.2.fq, and run fastp with the same command:
fastp -i test.1.fq -I test.2.fq -o trim.R1.fastq.gz -O trim.R2.fastq.gz -5 -3 -M 30 -q 30 -l 36 -n 5 -c --html trim.html --json trim.json --report_title "Fastp Report" --thread 10 > trim.log
It run ok without error.
How can I solve this problem?
This parameter seems to require uppercase letters. For instance:
$ fastp -i test-umi_1.fastq.gz -I test-umi_2.fastq.gz -o test-r1-out.fastq -O test-r2-out.fastq -U --umi_loc=read1 --umi_len=8 --umi_prefix=mbc
ERROR: UMI prefix can only have characters and numbers, but the given is: mbc
But, uppercase MBC works fine
Hello,
We are using unmapped bam files for storing and archiving our sequence data. It would be really nice if fastp can take bam files as input and output fastq/bam files.
Best,
Bekir
Shifu;
Thanks for this great tool and adding pre-processing for UMIs. I've been looking for faster options to replace our use of umis (https://github.com/vals/umis) for pre-processing UMI outputs and adding into read headers. We typically end up with UMIs in a 3rd file as outputs from bcl2fastq when the UMIs are present in the input reads, and I wondered if this is possible to support?
I had a quick dig into the code to start implementing but realized you have specialized iterators for pairs so didn't want to break too much by trying to have a 3 input iterator, thinking there might be a better way to integrate.
Here is an example case with R1/R3 as the first/second read pair and R2 as the UMI:
https://s3.amazonaws.com/chapmanb/testcases/fastp_umi_example.tar.gz
Thanks for any thoughts and suggestions for processing these with fastp.
could you add a function that can deduplicate the reads for de novo analysis, becuase it seems that only FastUinq can do such work
Shifu;
Congrats on the paper in bioRxiv and thanks for all the great work on fastp. We've been working on improving the runtimes for somatic variant calling workflows and exploring quality and polyX trimming. We did a test run with fastp and atropos and found that the major improvements in runtime were due to removal of polyX sequences at the 3' ends of reads:
https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim
We'd used the new polyG trimming functionality (thank you), but a crude method of 3' polyA/T/C adapter removal, which appears to be less effective with fastp compared to atropos trimming. When additional polyX stretches get removed we get much better runtimes for alignment and variant calling.
I saw general polyX and low complexity trimming are on the roadmap for fastp and would like to express my support for this. We've been making great use of fastp for adapter conversion and would like to offer trimming as part of an effort to speed up alignment and variant calling both on NovaSeqs and more generally.
As a secondary help for integration, is streaming trimming a possibility for paired ends? To help improve preparation runtimes I'd been thinking of including trimming and streaming directly into bwa/minimap2 alignment, or being able to stream outputs into bgzip so we can index and parallelize variant calling.
Thanks again for all the work on fastp.
can you add Q20 or Q30 ratio every position in xxx.json
Hello,
I have a question about the base trimming of fastp. Does it have an option for low-quality base trimming from both sides like in trimmomatic (LEADING and TRAILING options)? I would like to trim bad bases (Q<20) for both sides of the reads.
The "--cut_by_quality5/3" in fastp and "Slidingwindow" option in trimmomatic seem quite different. The former would trim the reads in the window if with low-quality, while for the latter one if the low-quality bases are in the begining of the read, the whole read will the removed.
For example:
Input fq file: Test.fq (The last 4 bases for read 1 and first 4 bases for read 2 are bad ones)
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeEFCB
@2\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
EFCBeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
##If run with LEADING and TRAILING options in Trimmomatic (Exactly what I want, the bad bases are removed)
java -jar $Trimmomatic SE -phred64 Test.fq tt.fq LEADING:20 TRAILING:20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGC
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@2\1
ATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
##If run with Slidingwindow option in Trimmomatic (The 2nd read will be removed totally, as the bad bases are in the begining of the read)
java -jar $Trimmomatic SE -phred64 Test.fq tt.fq SLIDINGWINDOW:4:20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGC
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
fastp --phred64 --in1 Test.fq --out1 Test_Trimed.fq --cut_by_quality5 --cut_by_quality3 --cut_window_size 4 --cut_mean_quality 20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCAT
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF&'
@2\1
GATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Thanks!
Best,
Wenyu
Amplicon sequencing using a set of artificial, amplicon-specific primer. If a reads is beginning with this primer, it is a target reads, and artificial primer should be removed. Otherwise it should be filtered.
Would love a Galaxy wrapper so it can be used in place of trimmomatic in our current workflows
Hi,
I noticed that after the fastp trimming, fastp -g -p -c
, the size of trimmed fq.gz files are larger than the raw ones. There are about 0.55 M reads were filtered.
Perhaps this tool already supports this, but if not, it would be very useful to implement automatic quality score format detection and conversion. For further details, please see my analogous request for SeqKit, for which this has now been fully-implemented.
Hi,
This looks like a great tool and I would like to give it a try, but I encountered an issue when attempting to install it. This issue might be due to our old server, which is a CentOS6.9 with gcc 4.8.2. Are there other system parameters you want to know? Installation went fine on my own Ubuntu 17.10.
I downloaded the release 0.5.0 (the same happens after cloning from GitHub), and executed make
:
g++ -std=c++11 -g -I./inc -O3 -c src/adaptertrimmer.cpp -o obj/adaptertrimmer.o
g++ -std=c++11 -g -I./inc -O3 -c src/evaluator.cpp -o obj/evaluator.o
g++ -std=c++11 -g -I./inc -O3 -c src/fastqreader.cpp -o obj/fastqreader.o
g++ -std=c++11 -g -I./inc -O3 -c src/filter.cpp -o obj/filter.o
g++ -std=c++11 -g -I./inc -O3 -c src/filterresult.cpp -o obj/filterresult.o
g++ -std=c++11 -g -I./inc -O3 -c src/htmlreporter.cpp -o obj/htmlreporter.o
g++ -std=c++11 -g -I./inc -O3 -c src/jsonreporter.cpp -o obj/jsonreporter.o
g++ -std=c++11 -g -I./inc -O3 -c src/main.cpp -o obj/main.o
g++ -std=c++11 -g -I./inc -O3 -c src/options.cpp -o obj/options.o
g++ -std=c++11 -g -I./inc -O3 -c src/overlapanalysis.cpp -o obj/overlapanalysis.o
g++ -std=c++11 -g -I./inc -O3 -c src/peprocessor.cpp -o obj/peprocessor.o
g++ -std=c++11 -g -I./inc -O3 -c src/processor.cpp -o obj/processor.o
g++ -std=c++11 -g -I./inc -O3 -c src/read.cpp -o obj/read.o
g++ -std=c++11 -g -I./inc -O3 -c src/seprocessor.cpp -o obj/seprocessor.o
g++ -std=c++11 -g -I./inc -O3 -c src/sequence.cpp -o obj/sequence.o
g++ -std=c++11 -g -I./inc -O3 -c src/stats.cpp -o obj/stats.o
g++ -std=c++11 -g -I./inc -O3 -c src/threadconfig.cpp -o obj/threadconfig.o
g++ -std=c++11 -g -I./inc -O3 -c src/unittest.cpp -o obj/unittest.o
g++ -std=c++11 -g -I./inc -O3 -c src/writer.cpp -o obj/writer.o
g++ ./obj/adaptertrimmer.o ./obj/evaluator.o ./obj/fastqreader.o ./obj/filter.o ./obj/filterresult.o ./obj/htmlreporter.o ./obj/jsonreporter.o ./obj/main.o ./obj/options.o ./obj/overlapanalysis.o ./obj/peprocessor.o ./obj/processor.o ./obj/read.o ./obj/seprocessor.o ./obj/sequence.o ./obj/stats.o ./obj/threadconfig.o ./obj/unittest.o ./obj/writer.o -lz -lpthread -o fastp
./obj/peprocessor.o: In function `PairEndProcessor::initOutput()':
/home/wdecoster/bin/fastp-0.5.0/src/peprocessor.cpp:32: undefined reference to `gzbuffer'
/home/wdecoster/bin/fastp-0.5.0/src/peprocessor.cpp:35: undefined reference to `gzbuffer'
./obj/fastqreader.o: In function `FastqReader::getBytes(unsigned long&, unsigned long&)':
/home/wdecoster/bin/fastp-0.5.0/src/fastqreader.cpp:38: undefined reference to `gzoffset'
collect2: error: ld returned 1 exit status
make: *** [fastp] Error 1
Do you have suggestions on how to fix this?
Cheers,
Wouter
Hi,
I'm wondering if it is possible to add a new split option: is it possible to split files by a certain number of reads and not in a certain number of sub-files?
It could be useful if you want to parallelize and standardize the downstream alignments (guess the execution time of each sub-sample) and you don't know the size of your input fastq.gz file...
Hi,
would it be nice to have an option to add a report name at the top of the report, Just before the summary. I use the -h option and this specified name added to the top of the report would help or a new option -R Report name included in generated report (string [= reportname])
Thanks,
B.
Even set -w to a different number, the program consist running in 3 cores...
Hi ,I have saw the source of adaptertrimmer , the principle is to allow the maximum number of diff to find the largest PE reads overlap sequence, the rest as an adapter to remove. But there is a problem is to compare PE reads base. If the same time, if R1 has a wrong indel, then cause the following bases are different with R2, resulting in diff the number increases, so overlap length decreases, the adapter can not be removed.
So counld you add a argument to consider wrong indel in next version.
Hello,
First, let me just say I have been working with fastp for over a month now and am very pleased with the performance and direction the tool is going. It also appears to be quite accurate and look forward to the forthcoming publication.
However, I have noticed that the adapter trimming does not efficiently trim under certain conditions where quality dips but there is still an exact match to the adapters on both read pairs. In this case I've synthetic data that is centered at 35Q and 20Q.
I was going through a tool evaluation comparison using this data and found that fastp excels in most cases with >99% sensitivity trimming and near perfect specificity. However, as the qualities drop near 20Q the sensitivity also drops dramatically. Some tools have no loss of adapter clipping with respect to quality shifting. Note my usage and that I am not doing any quality trimming.
Any ideas on how we can better address the below scenario?
Here is one example of where clipping fails, adapter starts at pos 114. You can see that fastp successfully trims in one case but not the other. PS- I have more examples if needed.
Adapters:
> TruSeq_Index_Adapter_5p
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
> TruSeq_Index_Adapter_3p
ATCTCGTATGCCGTCTTCTGCTTG
fastp -Q -i 1m_150bp_35q_R1.fq.gz -I 1m_150bp_35q_R2.fq.gz -o fastp.1m_150bp_35q_R1.fastq.gz -O fastp.1m_150bp_35q_R2.fastq.gz
fastp': fastp -Q -i 1m_150bp_20q_R1.fq.gz -I 1m_150bp_20q_R2.fq.gz -o fastp.1m_150bp_20q_R1.fastq.gz -O fastp.1m_150bp_20q_R2.fastq.gz
Post-trimming results
Read1_35q:
@999465_150_114 1:
AGTCTCAGGATACAAAATCAATGTACAAAAATCACAAGCATTCTTATACACCAATAACAGACAAACAGAGAGCCAAACCATGAATGAACTCCCATTCACAATTGCTTCAAAGAG
+
GJJJ?JAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ>JJJJJJJJJJJJJGJJJJJJJJJJJJJJEJHJJJJHJJJJJJJJJHJJJJJCJCJJEJJJJJJJJJF
Read2_35q:
@999465_150_114 2:
CTCTTTGAAGCAATTGTGAATGGGAGTTCATTCATGGTTTGGCTCTCTGTTTGTCTGTTATTGGTGTATAAGAATGCTTGTGATTTTTGTACATTGATTTTGTATCCTGAGACT
+
GGJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJFJJJGJJJGJJJJJ?JJJJJJJJJGJCJJJJJJJIJJJ?J?JJJGJJBJJJGJJJJJJJ
Read1_20q:
@999465_150_114 1:
AGTCTCAGGACACAAAATCAATGTACAAAAATCACAAGCATTCTTTTACACCAATAACAGACAAACAGAGAGCCAAACCATGAATGAACTCCCATTCACAATTGCTTCAAAGAGGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGC
+
0...4689..17428;:=5:383<=?<755199;6;6367>509329;1<4.<553816619:2<0.36:.6663.;.75:6.5:7.:9..0665/25..:22.0:1.18=22.12.8799844.2.549/5.7..94/3...0../...
Read2_20q:
@999465_150_114 2:
CTCTTTGAAACAATTGTGAATGGGAGTTCATTCATGGTTTGGCTCTCTGTTTGTCTGTTATTGGTGTATAAGAATGCTTGTGATTTTTGTACATTGATTTTGTATCCTCAGACTATCTCGTATGCCGTCTTCTGCTTGCAACATTCACCA
+
-6--1--/;38<576-62<87<:6/6:3<--4831>3-;--32<;2-<57999.</6864<23.-67-007-84<858485.0..534/24.4/-16?:9-7:6;0;3/2-46/21732;0--2463-5/8:57403---.3-6------
Hi,
would it be possible to add a calculated %GC content on the base contents graphs near the base index (top right) ?
Thanks
Hi,
I love your tool and I have a question:
Currently if there is no disk space left, fastp continues running with no error. I think it should report error and exit.
I'd like to know how adapter trimming works? Both for single end and paired end data.
Hi,
Can you put the "evaluated" adapter sequence in the top summary part of the report ?
Thanks,
B.
FYI - Instructions:
brew install brewsci/bio/fastp
Hi,
I have a personal laptop and a work laptop. The personal one is a windows 10 hosting Ubuntu 16.04 on a Oracle virtual box (with anaconda and both python 2.7 and 3.6 installed) and the work laptop is a windows 7 hosting Ubuntu 16.04 on a VMware workstation (with anaconda and python 3.6 installed). I had no problem installing either AfterQC or fastp on my personal laptop.
However, when I was trying to install fastp (thinking that I don't want to install python 2.7 for AfterQC) on my work laptop through bioconda, it tells me that I have conflicts with other packages, as follows:
"Solving environment: failed
UnsatisfiableError: The following specifications were found to be in conflict:
When I was trying to install directly from cloning from github, it also failed, with the following error message:
" /usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status
Makefile:17: recipe for target 'fastp' failed
make: *** [fastp] Error 1 "
I realize that there must be some insufficiency within my work laptop but I don't know what it is, as I am pretty new to Linux system. I would really like to have it fixed because fastp is a really nicely made package for cleaning up and QC WGS reads. If you have any suggestions or useful tips for resolving this issue, please help me out.
Really appreciate your help. Thanks in advance.
Yun
How can one keep these overrepresend sequences? For example I used this sample "SRR3100237"
fastp -i SRA/out/SRR3100237_1.fastq.gz
-I SRA/out/SRR3100237_2.fastq.gz
-o FASTP/SRR3100237_1_trim.fastq.gz
-O FASTP/SRR3100237_2_trim.fastq.gz
-h Reports/SRR3100237 -R "SRR3100237" -l 36 -c -g -p -M 30 -w 6 -5 -3
looking forward!
Dear Developer,
I have some trouble with this application: I'm trying to filter my paired reads fastq files with fastp with two different sets of filters.
Here is a sample of my 2 fastq files (gunzip -c file_X.fastq.gz | tail -n 50):
ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 1:N:0:ATGCCT NTCTTTCTGACCCTCACTGAGAGCGACCTGAAGGAAATTGGCATCACGTGCGTCCAGAAGGGCCGCTCTGGCCCTCAGCCCGGGGTTGGGGCAAACTCCCA + #1=DFFFFGHFHHIJEIJJIEIJGIIIGIIJFHJICEICGGICIII@GFHIGCHGAHECHFBF=A@?BDDDDDDD8<CCCDD<@BB<>BBBB9?<<ACCD( @HWI-1KL150:70:C74KBACXX:1:1101:3880:1976 1:N:0:ATGCCT NACTTTCTGTTTTTCCTTTATAGCAAGCAACCCAGTGATAGCAGCCCAGCTCTGGTGAGTGTCCTTGAGCTCTAGAGCACAGCTCTCCTCTCTAAGNNNNN + #1=DFFFFHHHHHJJJJJJJJJJJJJJJIJJJJJJIIJJJIJJJJJJJJJJIJJJ@GGIDFFGHIIJEIJIJHGHAA>DFFFFECCCCCCDDDEDCCACC3 @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 1:N:0:ATGCCT NACTCCGGGGCTGCTCTGGACCAGTTTCCATTCCCGTCTCCCCACCCTCACCATCCCTCAGGACATCACGAGTGGTTGCTTGGACCTGAGGTGGACATTCT + #1=DFFFFHHHHHJJJJJJGIJIJFHIGIIIJIJJJGIIIJJJJIJJIJJJHHHHHFFFEECEDEDDDC@?@<BBCDDDCDDBCDDCCCDD>BB<A(4:>( @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 1:N:0:ATGCCT NCCCAACCAATCAGCCCCAATTTACGATCTATGTAACTCACCAGTTCGATATGCCAATAACCTGGCCTGAACCATGCAGTGCCTTGCAATTTCCTGTGGCA + #1=DDFFFHHHHHJJJJJJJJJJJIJJJJJJJJIJDGIJJJIJJIJIJJIJIIIIIIIGGICHIGHHHGHCFFFE66;@ACCCC@CCCAAC@CDCCCC?B? @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 1:N:0:ATGCCT NGCAGGCCGCGGACGGAGAGCACGTGAGGGAAGGGGAAGCCGCTCCGGCCTGCGTAGGGGGGGGGGCGGGGCCCCCCGGGACACCCGGGAGGGGGGCGGGN + #4=DDDDDHHDHAGGIDG:;=<FF8C@DDE4?CDEC6>?=;983;?8:&57?85)+0<()5-&&)))0-)&)0&))&&))&&((())&)0&05&0&&&05< @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 1:N:0:ATGCCT NGGGGTCCTCTGCGGCCAGGGCAGCGCTGCTCAGCATGATGAAGACAAGGATGAGGTTGGTGAAGATGTGGTGGTTGATGAGCTTGTGGGAGCCTACGCGN + #4=DDDFFHHHHHJJJJIHHIJIIJJJJJJJJIIIIJJJJJJJJJJJJHICHHEHFCDFD;?AAC@CCACD(8?',5(4:4>ACC+++(&2?&8((+&&)5 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 1:N:0:ATGCCT NGTCCTCCCACCAGCCGGGCACTACTTACATGACGATGAGAGCAGCGTCTCGGGAGTAATCCAGCACAATCTCCTTCAGCCTCACCTGCCGAAGGGCCTGN + #1:BDFFFHHHHGIIIIGFBGIBEH@GECHCHGGGDFHIIIHICEGHGEHIHEH6?5;@3(>(.-(;(55>AC((,,55>?<<C??:9<A9&5)5<(2+(2 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 1:N:0:ATGCCT NATCCCGTTGGCTTTCCAGGAGGCTCTGCAGCATCTGCAGGGTCCTGGGGTCCTGGTAAGGGGCTTCCAGGAGTGGAGAAGGGGGGCAGTGAGGTTGGGCC + #1=DFFFDHHHHHJJJJJJJJJJJJIJIJJIFIIIJJJJIJJGIHIIIIJGGIJJJAHIIJHHHFFFEDCE;@?B;5<ABDBBDDB@BB3@ACD?BD>B?( @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 1:N:0:ATGCCT NGTGGCAATTCTCTTCAGTAGGTTGGCCAAGTCAGCAGACACGGTGCTGGTCTTATAGCTGTCAAATTCAGGAAGGGTCTTGGGCTTAAAATACTCAAACA + #1=DDFFDFHHHHJJJJJHIJJHJJJJJJJJIJJJJJJJJJJJJGHIJJJFGIIIJJJIJIHIJICGGCGGEFHH;B;CC@CDDDDDDCCCEDCDCD:CC5 @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 1:N:0:ATGCCT NTTTTTCACTCCATTGTTGTTGTTTACCCAGTTTATGGGGGTTGTAATGTTTATCACACTCCTTGGATGATTTCCGAAGGTAAGATATCTGGAATGGTTTT + #4=DDFFFHHHHHGHHFHIFHIGIIIIJJJJBHGIIIIGGI?FGFEHJGGAHIJJJGIGGHHHHHFFFFDCC@CE;3;>@:@>C>ACA@CDCD<AC>ACA< @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 1:N:0:ATGCCT NGCTGTTTTACAAGTTGGTAGTTTTCTCTTCTTGGCATGGTGAACGTGCCCTAAAGGCCTGATGTCAGGCTCCATCCTCCATGTTAAAATAGTGAGTTCTT + #1=DFDFFHHHHHJJIJJCFHCFIJEHGIJIJJGHEGIIJFEGHDGEIIJHCGIJEGIJGEGHIGIIHIG<AECA7?DDFFFECC(>CC@CC>CCC>C@D: @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 1:N:0:ATGCCT NTTGTCAACTTTGCTTTTGCTCATGTTGTAATGTTTGGCAATATATGACACATCCACTTGTTTATCGAATCCCTGTCAAAAAGAAGAACAGCAAAAACATN + #1:B=DDDFFFHDGIIIIIGGIEGHBH@9<9FFFHGIIG>FGEGIGDDHG<:9?D@D8?>?FHGGGGGBAG)@;77;4?A;(9?@DFEEA>55=>=;'((, @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 1:N:0:ATGCCT NCCTATGCAACCTCAGTGTCCACTGAGAAGGGAATCTTGTGGTATGGAACAATGTGGCAAAAAGGTACAAAGTATTCTTACACCTGGAATTCTTAACCTGN
ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 2:N:0:ATGCCT ACAGCCTGCGGGGGGAATGTGACCAGGATATGCCTCAGCGTCCCAAGAGCGCTTACATGAGTGGGAGTTTGCCCCAACCCCGGGCTGAGGGCCAGAGCGGC + @CCFFFFFHGHHGID9@BCDEDDDDDBBCCC@@C@CDDABDDDBDDDD?90:BDDCDECDD@A?BDD<CCDCDABB@BDDDDBB>9>BDDDDD<B?<CBD9 @HWI-1KL150:70:C74KBACXX:1:1101:3880:1976 2:N:0:ATGCCT AAAGGGGAAAAAAATTACCAGATGACACACTTCCTGATTTCACTGTAGTAAGGAAAAAGTCAACATTGCAAATAAATACGATCCTTAGAGAGGAGAGCTGT + CCCFFFFFHHHHHJJJJJJJJIJIJJJJJJJJJJJJGIJJJJJJJGHGFHGFIIGJJJJFEHHHGFFFFDF@CCEEEEDDBDBDDCCCDDDDDBBB??CD+ @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 2:N:0:ATGCCT ACACTAGCCACTCACGTTCCATCTCTTCCTCGGAGAAATCCTCAGGCCCAGCCAAGGGCAGGAGCAAAAAGGGGAGAATGTCCACCTCAGGTCCAAGCAAC + CCCFFFFFHHHHHJJJHIIJJJIJJJJJJIJJIJJJJJJIJJJJJJIIJJFHIJIJJJIIHHGFFFFDDDDDDD@BBDDDDDDDDDDDDDD@CDD>CBBDA @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 2:N:0:ATGCCT GTTTGGAGAACCTGTGTGAAAATCCATACTTTAGCAATCTAAGGCAAAACATGAAAGACCTTATCCTACTTTTGGCCACAGTAGCTTCCAGTGTGCCGAAC + CCCFFFFFHHHHHJJJHIJJJJJJJIJJJIJJIJIJJJJJIIIIJJIJIIDDEIGGHJHGGGIIIGGDGHIJJJHHHHFFFCDECCCEECDCCDCCCD??3 @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 2:N:0:ATGCCT GGCCTCGTGCGCTCGGGCCCGCACGCCGTTGTTCGCGTCACCCCCACCCAGCTCCCTTCCGCGTGTGCTCGGAGGGCGCGGCGCACCGCCTACGCAGGCCN + CCCFFFFFHHHHHJJJJIJJJJJJJJJJIJHEHHFFDDBDDDDDDDDDDDDDDDDDDDDDDDDBDBDDDCDD;BBDDDDDDBD@BDDDD<<>CDDBBDDD> @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 2:N:0:ATGCCT CAAGCAGCGGCTTTTCCCTGCAGGATCCGCGTAGGCTGCCACAAGCTCATCAACCACCACATCTTCACCAACCTCATCCTTGTCTTCATCATGCTGAGCAN + CCCFFFFFHHHHHJJJIIIIJIJJJIJJIJJEHEIIIIDGHIGIAEHHHH?@DEFFDDDDDCDCDEDDD>B@BCDCCDDACDCDDDDEEE@CDDCCCBDD3 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 2:N:0:ATGCCT AGACAGGAGACTCTATAAGAATTTATGAGGCAGCAGAGTCTACAAGTAAATCATGAATCCAGTTGAAAATGTTAATGAGGCCATAGACGTGGTGAAGGATT + @C@DFFFFHHHGHJJJJIIIHIIJJJJIGIJJJJIIIJGHIJIIIJIIIHDGGIJJIIJHDGEHGIIGCGGGGECEAEHHHFFFDDCECABB?@DCCC?A3 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 2:N:0:ATGCCT GTGTGCAGCGGAGCCCTGCACGGGAGACAGGTCTGTCTTCTGCCAGATGGAAGTGCTCGATCGCTACTGCTCCATTCCCGGCTACCACCGGCTCTGCTGTN + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJDHIIFHJJIIJIJJJJIIJEHHAEEHFFFECCBDDDCCDDDDDDFEEDDDDDDDDDDDDDDDDDCDDDA9 @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 2:N:0:ATGCCT GGTGGATCTTATATGGGAGGATGCACTGTTCATGTTTGAGTATTTTAAGCCCAAGACCCTTCCTGAATTTGACAGCTATAAGACCAGCACCGTGTCTGCTN + BC@FFFFFHHHGHJJJJIIJHIJJJJJJIJJJJJJJJJJJDHIJJJJJJJGHJJJJJIJIJIJJJJJJJJJIJGHHGHFFFFFFEEEDEDDDDDBBDDDD: @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 2:N:0:ATGCCT GAAAATAATTCACAAATAGTGTTACAGCTCCATCCACTGAAAATTGTCATAAAAGACATTTTTTCAATGAGTTCATTTTTAGAGAAACCATTCCAGATATC + @CCFFFFFHGHHHJJGHIJCJJIJJJJJJIHGIJJIJJIGIIIIIJIIHHDGGGJJIGIIJJJJGIEHIGJIHIJIJCHEECB@;?BCACECDCDCCCCD- @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 2:N:0:ATGCCT CACTTTGAAAACTAGAAATCATTACACAAAGTTAAGAACTCACTATTTTAACATGGAGGATGGAGCCTGACATCAGGCCTTTAGGGCACGTTCACCATGCC + CCCFFFFFHHHHHJJIJJJJJJJJJIJJJJJJJIIJJIJJIJIIIJJIJJHGCHIHGIIIJJGHIJJJJJJIJJGGHGHFFFFFDEEDDDDDDDDDDDDDA @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 2:N:0:ATGCCT AAGAACAAGTTTCTGTACATCTCATTATCATTCTGCCTGTTCACTTGCCTCATGTTTTTGCTGTTCTTCTTTTTGACAGGGATTCGATAAACAAGTGGATN + @@@DDEEDHDHHHEFCEH?FHIIIIHIIIIIGGIIIIIIFFIGGHIECEHBDHGBGCHIIIIIIIIIIIIHGGGE;CCHGHHCFFF@DCA6>CBC@CCCC> @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 2:N:0:ATGCCT TACATAGGAAGAAAATGCCAATCAAAAATGAAAGTCAGTTAAAACCACTTGAAAGCAATGTCTGTTCCTTTTTAGAATGGAAAGTTGGAGGAAACTTCAGC
As you can see, the file seems to be well formated.
I'm applying 2 different sets of filter:
`fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html -W 20 -M 30 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_1.fastq.gz -O fastq_filtered/ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_2.fastq.gz
fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html -q 30 -u 50 -n 5 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I input_files/ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz`
here is the log file:
`Read1 before filtering:
total reads: 69722014
total bases: 7041923414
Q20 bases: 6419265369(91.1578%)
Q30 bases: 5704731454(81.011%)
Read1 after filtering:
total reads: 67489520
total bases: 6816441520
Q20 bases: 6312681343(92.6096%)
Q30 bases: 5624025162(82.5068%)
Read2 before filtering:
total reads: 69722014
total bases: 4505725297
Q20 bases: 4505725297(100%)
Q30 bases: 4505725297(100%)
Read2 aftering filtering:
total reads: 67489520
total bases: 4360073359
Q20 bases: 4360073359(100%)
Q30 bases: 4360073359(100%)
Filtering result:
reads passed filter: 134979040
reads failed due to low quality: 4110650
reads failed due to too many N: 354338
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
JSON report: ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json
HTML report: ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html
fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html -W 20 -M 30 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_2.fastq.gz
fastp v0.6.0, time used: 910 seconds
Read1 before filtering:
total reads: 69722014
total bases: 7041923414
Q20 bases: 6419265369(91.1578%)
Q30 bases: 5704731454(81.011%)
Read1 after filtering:
total reads: 63210865
total bases: 6384297365
Q20 bases: 6031554512(94.4748%)
Q30 bases: 5446376816(85.3089%)
Read2 before filtering:
total reads: 69722014
total bases: 4505725297
Q20 bases: 4505725297(100%)
Q30 bases: 4505725297(100%)
Read2 aftering filtering:
total reads: 63210865
total bases: 4083554985
Q20 bases: 4083554985(100%)
Q30 bases: 4083554985(100%)
Filtering result:
reads passed filter: 126421730
reads failed due to low quality: 12684916
reads failed due to too many N: 337382
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
JSON report: ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json
HTML report: ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html
fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html -q 30 -u 50 -n 5 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz
fastp v0.6.0, time used: 918 seconds`
the result files look alike:
ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 1:N:0:ATGCCT NTCTTTCTGACCCTCACTGAGAGCGACCTGAAGGAAATTGGCATCACGTGCGTCCAGAAGGGCCGCTCTGGCCCTCAGCCCGGGGTTGGGGCAAACTCCCA + #1=DFFFFGHFHHIJEIJJIEIJGIIIGIIJFHJICEICGGICIII@GFHIGCHGAHECHFBF=A@?BDDDDDDD8<CCCDD<@BB<>BBBB9?<<ACCD( @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 1:N:0:ATGCCT NACTCCGGGGCTGCTCTGGACCAGTTTCCATTCCCGTCTCCCCACCCTCACCATCCCTCAGGACATCACGAGTGGTTGCTTGGACCTGAGGTGGACATTCT + #1=DFFFFHHHHHJJJJJJGIJIJFHIGIIIJIJJJGIIIJJJJIJJIJJJHHHHHFFFEECEDEDDDC@?@<BBCDDDCDDBCDDCCCDD>BB<A(4:>( @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 1:N:0:ATGCCT NCCCAACCAATCAGCCCCAATTTACGATCTATGTAACTCACCAGTTCGATATGCCAATAACCTGGCCTGAACCATGCAGTGCCTTGCAATTTCCTGTGGCA + #1=DDFFFHHHHHJJJJJJJJJJJIJJJJJJJJIJDGIJJJIJJIJIJJIJIIIIIIIGGICHIGHHHGHCFFFE66;@ACCCC@CCCAAC@CDCCCC?B? @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 1:N:0:ATGCCT NGGGGTCCTCTGCGGCCAGGGCAGCGCTGCTCAGCATGATGAAGACAAGGATGAGGTTGGTGAAGATGTGGTGGTTGATGAGCTTGTGGGAGCCTACGCGN + #4=DDDFFHHHHHJJJJIHHIJIIJJJJJJJJIIIIJJJJJJJJJJJJHICHHEHFCDFD;?AAC@CCACD(8?',5(4:4>ACC+++(&2?&8((+&&)5 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 1:N:0:ATGCCT NGTCCTCCCACCAGCCGGGCACTACTTACATGACGATGAGAGCAGCGTCTCGGGAGTAATCCAGCACAATCTCCTTCAGCCTCACCTGCCGAAGGGCCTGN + #1:BDFFFHHHHGIIIIGFBGIBEH@GECHCHGGGDFHIIIHICEGHGEHIHEH6?5;@3(>(.-(;(55>AC((,,55>?<<C??:9<A9&5)5<(2+(2 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 1:N:0:ATGCCT NATCCCGTTGGCTTTCCAGGAGGCTCTGCAGCATCTGCAGGGTCCTGGGGTCCTGGTAAGGGGCTTCCAGGAGTGGAGAAGGGGGGCAGTGAGGTTGGGCC + #1=DFFFDHHHHHJJJJJJJJJJJJIJIJJIFIIIJJJJIJJGIHIIIIJGGIJJJAHIIJHHHFFFEDCE;@?B;5<ABDBBDDB@BB3@ACD?BD>B?( @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 1:N:0:ATGCCT NGTGGCAATTCTCTTCAGTAGGTTGGCCAAGTCAGCAGACACGGTGCTGGTCTTATAGCTGTCAAATTCAGGAAGGGTCTTGGGCTTAAAATACTCAAACA + #1=DDFFDFHHHHJJJJJHIJJHJJJJJJJJIJJJJJJJJJJJJGHIJJJFGIIIJJJIJIHIJICGGCGGEFHH;B;CC@CDDDDDDCCCEDCDCD:CC5 @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 1:N:0:ATGCCT NTTTTTCACTCCATTGTTGTTGTTTACCCAGTTTATGGGGGTTGTAATGTTTATCACACTCCTTGGATGATTTCCGAAGGTAAGATATCTGGAATGGTTTT + #4=DDFFFHHHHHGHHFHIFHIGIIIIJJJJBHGIIIIGGI?FGFEHJGGAHIJJJGIGGHHHHHFFFFDCC@CE;3;>@:@>C>ACA@CDCD<AC>ACA< @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 1:N:0:ATGCCT NGCTGTTTTACAAGTTGGTAGTTTTCTCTTCTTGGCATGGTGAACGTGCCCTAAAGGCCTGATGTCAGGCTCCATCCTCCATGTTAAAATAGTGAGTTCTT + #1=DFDFFHHHHHJJIJJCFHCFIJEHGIJIJJGHEGIIJFEGHDGEIIJHCGIJEGIJGEGHIGIIHIG<AECA7?DDFFFECC(>CC@CC>CCC>C@D: @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 1:N:0:ATGCCT NTTGTCAACTTTGCTTTTGCTCATGTTGTAATGTTTGGCAATATATGACACATCCACTTGTTTATCGAATCCCTGTCAAAAAGAAGAACAGCAAAAACATN + #1:B=DDDFFFHDGIIIIIGGIEGHBH@9<9FFFHGIIG>FGEGIGDDHG<:9?D@D8?>?FHGGGGGBAG)@;77;4?A;(9?@DFEEA>55=>=;'((, @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 1:N:0:ATGCCT NCCTATGCAACCTCAGTGTCCACTGAGAAGGGAATCTTGTGGTATGGAACAATGTGGCAAAAAGGTACAAAGTATTCTTACACCTGGAATTCTTAACCTGN + #4BDFFFFHHHHHJJJIJJJJJJJJGJIJJJIFHGIJJJHIIFHIGIGJJGIIIIGGJJJJJGGG)=;CDHE=AC7ADEFFFFDDEE<CCA@DED;C@@BD @HWI-1KL150:70:C74KBACXX:1:1101:11860:1969 1:N:0:ATGCCT NACCTTGTCCTTGGCACTGCGGCAGCCTTGCAGGCTGGCAAGGATCTGGGCCTGCACACTCTGAACCCACAGCTCCCGCTCCTCCGCCGTTGAAGCCTCNN + #1=DDFFFHHHHHJJIJJJJJJJIJIJIJJIJJJIJJJEFHGI=CFGEGF2CCACEHHGBFDECAABB?@?ABC>58?BDDBACABBD>99?2@A:<A<0) @HWI-1KL150:70:C74KBACXX:1:1101:12222:1966 1:N:0:ATGCCT NAGCTTAAACAGTGGGTTTTTCAATGTCTCTCTTTAGGATTTTTGCTGGGTAAAAGCCTGTTTTACGCGTGGAATGCACACCTCCGGCCAACGGAGACTCC
ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 2:N:0:ATGCCT ACAGCCTGCGGGGGGAATGTGACCAGGATATGCCTCAGCGTCCCAAGAGCGCTTACATGAGTGGGAGTTTGCCCCAACCCCGGGCTGAGGGCCAGAGCGGC + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJJJJJIJIJJJJJJJJJJJJGIJJJJJJJGHGFHGFIIGJJJJFEHHHGFFFFDF@CCEEEEDDBDBDDCCCDDDDDBBB??CD+ @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK ACACTAGCCACTCACGTTCCATCTCTTCCTCGGAGAAATCCTCAGGCCCAGCCAAGGGCAGGAGCAAAAAGGGGAGAATGTCCACCTCAGGTCCAAGCAAC + CCCFFFFFHHHHHJJJHIIJJJIJJJJJJIJJIJJJJJJIJJJJJJIIJJFHIJIJJJIIHHGFFFFDDDDDDD@BBDDDDDDDDDDDDDD@CDD>CBBDA K CCCFFFFFHHHHHJJJHIJJJJJJJIJJJIJJIJIJJJJJIIIIJJIJIIDDEIGGHJHGGGIIIGGDGHIJJJHHHHFFFCDECCCEECDCCDCCCD??3 @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 2:N:0:ATGCCT GGCCTCGTGCGCTCGGGCCCGCACGCCGTTGTTCGCGTCACCCCCACCCAGCTCCCTTCCGCGTGTGCTCGGAGGGCGCGGCGCACCGCCTACGCAGGCCN KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJIJJJJJJJJJJIJHEHHFFDDBDDDDDDDDDDDDDDDDDDDDDDDDBDBDDDCDD;BBDDDDDDBD@BDDDD<<>CDDBBDDD> @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK CAAGCAGCGGCTTTTCCCTGCAGGATCCGCGTAGGCTGCCACAAGCTCATCAACCACCACATCTTCACCAACCTCATCCTTGTCTTCATCATGCTGAGCAN + CCCFFFFFHHHHHJJJIIIIJIJJJIJJIJJEHEIIIIDGHIGIAEHHHH?@DEFFDDDDDCDCDEDDD>B@BCDCCDDACDCDDDDEEE@CDDCCCBDD3 K @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 2:N:0:ATGCCT AGACAGGAGACTCTATAAGAATTTATGAGGCAGCAGAGTCTACAAGTAAATCATGAATCCAGTTGAAAATGTTAATGAGGCCATAGACGTGGTGAAGGATT + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK @C@DFFFFHHHGHJJJJIIIHIIJJJJIGIJJJJIIIJGHIJIIIJIIIHDGGIJJIIJHDGEHGIIGCGGGGECEAEHHHFFFDDCECABB?@DCCC?A3 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 2:N:0:ATGCCT GTGTGCAGCGGAGCCCTGCACGGGAGACAGGTCTGTCTTCTGCCAGATGGAAGTGCTCGATCGCTACTGCTCCATTCCCGGCTACCACCGGCTCTGCTGTN KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJDHIIFHJJIIJIJJJJIIJEHHAEEHFFFECCBDDDCCDDDDDDFEEDDDDDDDDDDDDDDDDDCDDDA9 @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK GGTGGATCTTATATGGGAGGATGCACTGTTCATGTTTGAGTATTTTAAGCCCAAGACCCTTCCTGAATTTGACAGCTATAAGACCAGCACCGTGTCTGCTN + BC@FFFFFHHHGHJJJJIIJHIJJJJJJIJJJJJJJJJJJDHIJJJJJJJGHJJJJJIJIJIJJJJJJJJJIJGHHGHFFFFFFEEEDEDDDDDBBDDDD: K @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 2:N:0:ATGCCT GAAAATAATTCACAAATAGTGTTACAGCTCCATCCACTGAAAATTGTCATAAAAGACATTTTTTCAATGAGTTCATTTTTAGAGAAACCATTCCAGATATC + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK @CCFFFFFHGHHHJJGHIJCJJIJJJJJJIHGIJJIJJIGIIIIIJIIHHDGGGJJIGIIJJJJGIEHIGJIHIJIJCHEECB@;?BCACECDCDCCCCD- @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 2:N:0:ATGCCT CACTTTGAAAACTAGAAATCATTACACAAAGTTAAGAACTCACTATTTTAACATGGAGGATGGAGCCTGACATCAGGCCTTTAGGGCACGTTCACCATGCC KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJIJJJJJJJJJIJJJJJJJIIJJIJJIJIIIJJIJJHGCHIHGIIIJJGHIJJJJJJIJJGGHGHFFFFFDEEDDDDDDDDDDDDDA
The other files look alike (file_1 is normal, file_2 has some extra "KKKKKK" lines...)
When trying to align data on genome with BWA MEM, only 3 sequences seems to be well formated in my file.
This is probably due to the non canonical format of my fastq reads with the extra lines.
Do you have any idea why this doesn't work?
hi sfchen:
I want to known why "-3" option will reduce reads passed filters.
fastp -i SRR1770413_1.fastq -I SRR1770413_2.fastq -q 20 -u 20 -o out.SRR1770413_1.fastq -O out.SRR1770413_2.fastq
fastp -i SRR1770413_1.fastq -I SRR1770413_2.fastq -q 20 -u 20 -3 -o out.SRR1770413_1.fastq -O out.SRR1770413_2.fastq
Can I run different steps in one single call? E.g. quality, adaptor, poly G and "global" trimming? If yes, what is the order of execution? Or do I have to run the distinct trimming step each on its own?
This is a great tool. Adding in the pre- and post-trimming average read length would be super helpful. Best to get all information with one pass through the read file(s) than two.
Thanks!
Hello,
Thanks for the program. I'm working with NovaSeq data currently and would like to try out the polyG trimming. After trimming, it looks like fastp still retains reads with 8 or less Gs at the ends of reads. Is that a default set by fastp and what is the reason for doing so? Any way I can change the number of G's fastp lets through its filter?
Cheers,
Mun
Could you please add a -v or --version argument to fastp that outputs the full version of the program? Currently, there is no way to know which version of the program is being run. Thanks!
Hi,
Trying to figure out the onscreen results when fastp finishes running and the results summary in the generated report file.
The numbers don't addup....
Thanks for the help.
Read1 before filtering:
total reads: 4000000
total bases: 400000000
Q20 bases: 392516035(98.129%)
Q30 bases: 376358571(94.0896%)
Read1 after filtering:
total reads: 3102308
total bases: 309754624
Q20 bases: 306341522(98.8981%)
Q30 bases: 294745910(95.1546%)
Read2 before filtering:
total reads: 4000000
total bases: 400000000
Q20 bases: 387859490(96.9649%)
Q30 bases: 373512577(93.3781%)
Read2 aftering filtering:
total reads: 3102308
total bases: 309754624
Q20 bases: 305723840(98.6987%)
Q30 bases: 295222362(95.3085%)
Filtering result:
reads passed filter: 6204616
reads failed due to low quality: 235470
reads failed due to too many N: 1559914
reads failed due to too short: 0
reads with adapter trimmed: 1355796
bases trimmed due to adapters: 17275112
fastp report
Summary
General
fastp version: 0.7.0
sequencing: paired end (100 cycles + 100 cycles)
Before filtering
total reads: 7.629395 M
total bases: 762.939453 M
Q20 bases: 744.224095 M (97.546941%)
Q30 bases: 715.132854 M (93.733893%)
After filtering
total reads: 5.917183 M
total bases: 590.810059 M
Q20 bases: 583.711016 M (98.798422%)
Q30 bases: 562.637589 M (95.231552%)
Filtering result
reads passed filters: 5.917183 M (77.557700%)
reads with low quality: 229.951172 K (2.943375%)
reads with too many N: 1.487650 M (19.498925%)
reads too short: 0 (0.000000%)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.