opengene / fastp Goto Github PK

View Code? Open in Web Editor NEW

1.8K 50.0 330.0 745 KB

An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)

License: MIT License

Makefile 0.20% C++ 59.54% C 40.26%

fastq qc preprocessing filtering adapter overlap quality trimming splitting quality-control

fastp's People

Contributors

Stargazers

Watchers

Forkers

skyformat99 bgistone bioinfotools gaihua559 lilibei zengfengbo githubholiday sunqiangzai sea200k cccsnd dayedepps leicn xuewei01 y461650833y healthvivo lry198010 leangreen zunpengliu66 wuzengx minghao2016 tbluejc congrongssh xtmgah thkuo bgruening dongxiaomai yhoogstrate ocho hui-liu zhaoning2016 xiaoqiwang19 pickingbook wangbao0716 federicomarini bioslad aloofedge yodeng biovisual aoteman5255661 ichobits cgi-nrm ghuls yixf-self novitch flyingdancen corburn brownyung y9c juadiegaitan pythseq wangdi2014 eleozzr sanvva yangming heonedream haoziyeung altingia couchds dfajar2 jetsimancilla pityka biociao hpobio-lab wy2160640 qianyf1 wendy1214 pengjia6 daissi myvax keithmp michelmoser ilnamkang bichkd inambioinfo johnjcole liuweiqing201709 irenexzwen pundla zhouyu yishuihanhan 18853857973 mingjutsai goodstudychina daishaoxing fengyq jameyzhu oschwengers chaigsh heath1210 guanggyoung 0820ll bioevo qiao-xin dauss75 liuyanbioinfo liujinglu zachary-wu kentawan jingmingxia xujing90ss

fastp's Issues

requirement: support STDIN / STDOUT

read2 -O "is a folder, not a file, quit now"

Hi all,

Trying to run fastp on a PE150 sample.

Here is the exact line I'm running:
fastp -i Emx1_1_11_CTRL_USPD16084012-4_HHG33BBXX_L6_1.fq.gz -I Emx1_1_11_CTRL_USPD16084012-4_HHG33BBXX_L6_2.fq.gz -o r1.fq.gz -O r2.fq.gz

Here is the error I get:
ERROR: 'r2.fq.gz' is a folder, not a file, quit now

I have no problems running fastp on either of these fastq files in single-end mode. Also tried using --out2 instead of -O, I get the same result.

Any idea how I can get this to run?

Best,
David

Add verbose output option (-V)

When I run fastp there is no output until the end.

I would like to see what is happening.
Would it be possible to add some progress messages?
-V / --verbose

Thanks

ld returned 1 exit status while make

fastqreader.cpp:32: undefined reference to gzoffset' ./obj/peprocessor.o: In function PairEndProcessor::initOutput()':
peprocessor.cpp:35: undefined reference to gzbuffer' peprocessor.cpp:38: undefined reference to gzbuffer'
collect2: error: ld returned 1 exit status
make: *** [fastp] Error 1

JSON and HTML reports are not generated if a directory is given to --json and --html options

It's not really a bug but JSON and HTML reports are not generated if a directory is given to --json and --html options.

I wanted to generate JSON and HTML reports into a specific directory with by default names.
I just gave a directory to --json and --html options but it did not generate the reports.

Maybe it's possible to create --output_directory option (current directory by default) so that we just have to change names with --out1, --out2, --json and --html options. Just an idea 🙂 !

'out.R1.fq.gz' is not a writable file, quit now

When I use the version 0.13.0, and run the follwing command

fastp -z 1 -i test_1.fq -I test_2.fq --out1 out.R1.fq.gz --out2 out.R2.fq.gz

and then I met the problem"'out.R1.fq.gz' is not a writable file, quit now"

After touch out.R1.fq.gz firstly , this problem will be overcome.

Stitch together overlapping reads?

When the DNA library is overly short, Is it possible that most reads overlap.

Can fastp stitch these reads together (instead of just correcting errors) ?

So input R1, R2 would produce output R1, R2 and SR (stitched, longer single end reads)

Content of rRNA

HI, if I want know the content of rRNA, fastp ,can it be calculated?

How to remove adapter clean?

Hi sfchen,
Today, I compared three software (fastp, cutadapt, trimmomatic) , and found fastp very fast but adapter cannot be remove clean.
I upload my result and hopefully you found this useful!
compare_software.xlsx

sfchen commented:
Thanks for the result, from the data, I can see:

for short adapters (7bp), Trimmomatic removes the most adapter, then fastp removes less, and Cutadapt removes the least.
for longer adapters (>7bp), fastp removes much more than Trimmomatic and Cutadapt.
in total, fastp removes the most adapters.

Am I right?

Replay:

Hi sfchen,
My sample real adapter sequece is GATCGGAAGAGCACACGTCTGAACTCCAGTCAC********ATCTCGTATGCCGTCTTCTGCTTG, '*' is 8bp barcode.
And trimmomatic adapter file is :
adapter.list.xlsx

When I get clean fastq data, I split adapter sequece to some short substr, like:
7bp (AGATCGG)
8bp (AGATCGGA)
9bp (AGATCGGAA)
10bp (AGATCGGAAG)
11bp (AGATCGGAAGA)
12bp (AGATCGGAAGAG)
13bp (AGATCGGAAGAGC)

and then statistic the count of reads include this substr adapter sequence.

HTML report integration into MultiQC

One of the great things about FastQC is that MultiQC can be used to integrate all the quality control data into a single useful HTML.

Is this sort of integration available for Fastp or will it be implemented in the future?

overtrimming of reads

I am benchmarking fastp against other read trimmers using the workflow I developed for the Atropos paper (https://github.com/jdidion/atropos/tree/master/paper/workflow). I find that fastp has a high rate of read overtrimming. Example fastq input and output are attached. The command I used is:

fastp
-i {fastq1} -I {fastq2} -o {prefix}.1.fq.gz -O {prefix}.2.fq.gz
--adapter_sequence {adapter1} --adapter_sequence_r2 {adapter2}
--thread {threads} --length_required 25 —disable_quality_filtering

Nearly all of these overtrimming events involve the spurious removal of up to 10 bases from one or both reads:

I suspect this might be due to overzealous alignment of the reads to each other, and could probably be fixed with an option to require a minimum insert overlap before trimming. Another approach (which is offered as an option in Atropos) is to compute the random match probability of each alignment and compare against a user-specified threshold value.

example.zip

poly-A before poly-G in NextSeq reads

Hi,
I am seeing a large portion of NextSeq reads that have the poly-G tail, and have successfully trimmed that off with fastp. Most sequences also have a poly-A run just before the poly-G tail, which is apparently due to reduction in signal strength (lower quality) from clusters before they fail altogether (see https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/). I tried to use the -x function in the same trimming command, but it doesn't work (presumably because it isn't at the end of the original read). I suppose I could try it as a second trimming process, but wanted to know if this is a common issue that people with poly-G reads from NextSeq data see, and if so, could it be incorporated into the options.
The data look like this after trimming:
@NS500704:337:HGG2HBGX3:1:11101:10170:2997 1:N:0:GACGAGG+CGGAAT
GCAAGGTCTTAATCAAATTTTGTCAGCTGCAAGATCGAAGAGCACACGTCTGAACTCCAGTCACGACGAGGATCTCGTATGCCGTCTTCTGCGTGAAAAAAAAAA
+
AAAAAEEEEEAEEEE6EEAEEEEEEEEEEEEEEEEEEE/EEEEEEEEAEEEEE/EEEEEEEEEEEAEE/EEEA</AE/AEE<E</6<E//EE/EE/<AAEEEEA/
@NS500704:337:HGG2HBGX3:1:11101:8528:3344 1:N:0:GACGAGG+CGGAAT
ACAGAAACAGGTGCACAGTTCCCCATCAAGATCGGAAGACACACGTCTGAACTCCAGTCACGACGAGGCTCTCGTATGCCGTCTTCTGCATGAAAAAAAAAA
+
AAAAAAEEEEEE6E/AEEE/EEEEEEEEEE/EEEEEEEEE/EAEEEEEE/AE/EAE/AEEEEEEA/EE/AE<///A<E<E//<EE//////A/EEEEEE///

Thanks,
Phil Morin ([email protected])

-j and -h can be merge together

I think these two filenames may always be the same except the suffix

Read names do not match with dual UMIs

Thank you for making your wonderful tool!

For dual-UMI experiments, there may/should be different UMI tags on the forward and reverse read of a pair. Is there an option (now or in development) to remove the UMI tags from each read and place them on both of the resultant reads? Downstream tools require that the read names be the same so if there are different UMI tags on the forward and reverse of a pair, it will fail. Instead it should have the read name, followed by a delimiter between the forward and reverse UMI tags.

For instance, in fastq_1.fq.gz
read_1_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag

And in the pair, fastq_2.fq.gz
read_2_name:etc:etc:etc:etc:etc:etc:read_1_tagread_2_tag

Error: Segmentation fault (core dumped)

Dear Developer,
I have PE fastq file, R1.fq + R2.fq have 34225961 pairs reads, total bases 10.2G, and R1.fq file size 12Gb.
When I run fastp command:
fastp -i R1.fq -I R2 -o trim.R1.fastq.gz -O trim.R2.fastq.gz -5 -3 -M 30 -q 30 -l 36 -n 5 -c --html trim.html --json trim.json --report_title "Fastp Report" --thread 10 > trim.log
found error "Segmentation fault (core dumped)".

but when I remove the one of '-5' or '-3' option, there is no error reported.

So I abstract 100000 read and build test.1.fq and test.2.fq, and run fastp with the same command:
fastp -i test.1.fq -I test.2.fq -o trim.R1.fastq.gz -O trim.R2.fastq.gz -5 -3 -M 30 -q 30 -l 36 -n 5 -c --html trim.html --json trim.json --report_title "Fastp Report" --thread 10 > trim.log
It run ok without error.

How can I solve this problem?

--umi_prefix requires uppercase

This parameter seems to require uppercase letters. For instance:

$ fastp -i test-umi_1.fastq.gz -I test-umi_2.fastq.gz -o test-r1-out.fastq -O test-r2-out.fastq -U --umi_loc=read1 --umi_len=8 --umi_prefix=mbc
ERROR: UMI prefix can only have characters and numbers, but the given is: mbc

But, uppercase MBC works fine

BAM support

Hello,

We are using unmapped bam files for storing and archiving our sequence data. It would be really nice if fastp can take bam files as input and output fastq/bam files.

Best,
Bekir

Support UMI pre-processing with a 3rd file containing UMIs

Shifu;
Thanks for this great tool and adding pre-processing for UMIs. I've been looking for faster options to replace our use of umis (https://github.com/vals/umis) for pre-processing UMI outputs and adding into read headers. We typically end up with UMIs in a 3rd file as outputs from bcl2fastq when the UMIs are present in the input reads, and I wondered if this is possible to support?

I had a quick dig into the code to start implementing but realized you have specialized iterators for pairs so didn't want to break too much by trying to have a 3 input iterator, thinking there might be a better way to integrate.

Here is an example case with R1/R3 as the first/second read pair and R2 as the UMI:

https://s3.amazonaws.com/chapmanb/testcases/fastp_umi_example.tar.gz

Thanks for any thoughts and suggestions for processing these with fastp.

FastUniq-like function

could you add a function that can deduplicate the reads for de novo analysis, becuase it seems that only FastUinq can do such work

Trimming for polyA/T/C; improved runtimes for downstream alignment and variant calling

Shifu;
Congrats on the paper in bioRxiv and thanks for all the great work on fastp. We've been working on improving the runtimes for somatic variant calling workflows and exploring quality and polyX trimming. We did a test run with fastp and atropos and found that the major improvements in runtime were due to removal of polyX sequences at the 3' ends of reads:

https://github.com/bcbio/bcbio_validations/tree/master/somatic_trim

We'd used the new polyG trimming functionality (thank you), but a crude method of 3' polyA/T/C adapter removal, which appears to be less effective with fastp compared to atropos trimming. When additional polyX stretches get removed we get much better runtimes for alignment and variant calling.

I saw general polyX and low complexity trimming are on the roadmap for fastp and would like to express my support for this. We've been making great use of fastp for adapter conversion and would like to offer trimming as part of an effort to speed up alignment and variant calling both on NovaSeqs and more generally.

As a secondary help for integration, is streaming trimming a possibility for paired ends? To help improve preparation runtimes I'd been thinking of including trimming and streaming directly into bwa/minimap2 alignment, or being able to stream outputs into bgzip so we can index and parallelize variant calling.

Thanks again for all the work on fastp.

Q20 or Q30 ratio

can you add Q20 or Q30 ratio every position in xxx.json

Low-quality base trimming from leading and trailing side

Hello,
I have a question about the base trimming of fastp. Does it have an option for low-quality base trimming from both sides like in trimmomatic (LEADING and TRAILING options)? I would like to trim bad bases (Q<20) for both sides of the reads.

The "--cut_by_quality5/3" in fastp and "Slidingwindow" option in trimmomatic seem quite different. The former would trim the reads in the window if with low-quality, while for the latter one if the low-quality bases are in the begining of the read, the whole read will the removed.

For example:
Input fq file: Test.fq (The last 4 bases for read 1 and first 4 bases for read 2 are bad ones)
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeEFCB
@2\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
EFCBeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

##If run with LEADING and TRAILING options in Trimmomatic (Exactly what I want, the bad bases are removed)

java -jar $Trimmomatic SE -phred64 Test.fq tt.fq LEADING:20 TRAILING:20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGC

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@2\1
ATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

##If run with Slidingwindow option in Trimmomatic (The 2nd read will be removed totally, as the bad bases are in the begining of the read)

java -jar $Trimmomatic SE -phred64 Test.fq tt.fq SLIDINGWINDOW:4:20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGC

eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

If run with "--cut_by_quality5/3" in fastp (it is wierd that only the last 2 bases for read 1 and first 2 bases for read 2was clipped off)

fastp --phred64 --in1 Test.fq --out1 Test_Trimed.fq --cut_by_quality5 --cut_by_quality3 --cut_window_size 4 --cut_mean_quality 20
@1\1
AATGATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCAT

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF&'
@2\1
GATCGTAGCGATGCAAGCTAGCCCGATGCCCGATCGCATCG
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Thanks!

Best,
Wenyu

Request: supporting for amplicon sequencing

Amplicon sequencing using a set of artificial, amplicon-specific primer. If a reads is beginning with this primer, it is a target reads, and artificial primer should be removed. Otherwise it should be filtered.

requirement: over representation analysis

Galaxy wrapper & commit to Tool Shed

Would love a Galaxy wrapper so it can be used in place of trimmomatic in our current workflows

why trimmed fq.gz files are larger than the raw ones

Hi,
I noticed that after the fastp trimming, fastp -g -p -c, the size of trimmed fq.gz files are larger than the raw ones. There are about 0.55 M reads were filtered.

Feature Request: automatic quality score conversion

Perhaps this tool already supports this, but if not, it would be very useful to implement automatic quality score format detection and conversion. For further details, please see my analogous request for SeqKit, for which this has now been fully-implemented.

installation issue on CentOS6.9

Hi,

This looks like a great tool and I would like to give it a try, but I encountered an issue when attempting to install it. This issue might be due to our old server, which is a CentOS6.9 with gcc 4.8.2. Are there other system parameters you want to know? Installation went fine on my own Ubuntu 17.10.

I downloaded the release 0.5.0 (the same happens after cloning from GitHub), and executed make:

g++ -std=c++11 -g -I./inc -O3 -c  src/adaptertrimmer.cpp -o obj/adaptertrimmer.o
g++ -std=c++11 -g -I./inc -O3 -c  src/evaluator.cpp -o obj/evaluator.o
g++ -std=c++11 -g -I./inc -O3 -c  src/fastqreader.cpp -o obj/fastqreader.o
g++ -std=c++11 -g -I./inc -O3 -c  src/filter.cpp -o obj/filter.o
g++ -std=c++11 -g -I./inc -O3 -c  src/filterresult.cpp -o obj/filterresult.o
g++ -std=c++11 -g -I./inc -O3 -c  src/htmlreporter.cpp -o obj/htmlreporter.o
g++ -std=c++11 -g -I./inc -O3 -c  src/jsonreporter.cpp -o obj/jsonreporter.o
g++ -std=c++11 -g -I./inc -O3 -c  src/main.cpp -o obj/main.o
g++ -std=c++11 -g -I./inc -O3 -c  src/options.cpp -o obj/options.o
g++ -std=c++11 -g -I./inc -O3 -c  src/overlapanalysis.cpp -o obj/overlapanalysis.o
g++ -std=c++11 -g -I./inc -O3 -c  src/peprocessor.cpp -o obj/peprocessor.o
g++ -std=c++11 -g -I./inc -O3 -c  src/processor.cpp -o obj/processor.o
g++ -std=c++11 -g -I./inc -O3 -c  src/read.cpp -o obj/read.o
g++ -std=c++11 -g -I./inc -O3 -c  src/seprocessor.cpp -o obj/seprocessor.o
g++ -std=c++11 -g -I./inc -O3 -c  src/sequence.cpp -o obj/sequence.o
g++ -std=c++11 -g -I./inc -O3 -c  src/stats.cpp -o obj/stats.o
g++ -std=c++11 -g -I./inc -O3 -c  src/threadconfig.cpp -o obj/threadconfig.o
g++ -std=c++11 -g -I./inc -O3 -c  src/unittest.cpp -o obj/unittest.o
g++ -std=c++11 -g -I./inc -O3 -c  src/writer.cpp -o obj/writer.o
g++ ./obj/adaptertrimmer.o ./obj/evaluator.o ./obj/fastqreader.o ./obj/filter.o ./obj/filterresult.o ./obj/htmlreporter.o ./obj/jsonreporter.o ./obj/main.o ./obj/options.o ./obj/overlapanalysis.o ./obj/peprocessor.o ./obj/processor.o ./obj/read.o ./obj/seprocessor.o ./obj/sequence.o ./obj/stats.o ./obj/threadconfig.o ./obj/unittest.o ./obj/writer.o  -lz -lpthread -o fastp
./obj/peprocessor.o: In function `PairEndProcessor::initOutput()':
/home/wdecoster/bin/fastp-0.5.0/src/peprocessor.cpp:32: undefined reference to `gzbuffer'
/home/wdecoster/bin/fastp-0.5.0/src/peprocessor.cpp:35: undefined reference to `gzbuffer'
./obj/fastqreader.o: In function `FastqReader::getBytes(unsigned long&, unsigned long&)':
/home/wdecoster/bin/fastp-0.5.0/src/fastqreader.cpp:38: undefined reference to `gzoffset'
collect2: error: ld returned 1 exit status
make: *** [fastp] Error 1

Do you have suggestions on how to fix this?

Cheers,
Wouter

Sort overrepresented sequences by count

Hi,

In the HTML report would it be possible to output the overrepresented sequences sorted by the count?

As currently high counts can be distributed through the table making them harder to see, see the html example:

requirement: splitting on the number of lines/reads

Hi,

I'm wondering if it is possible to add a new split option: is it possible to split files by a certain number of reads and not in a certain number of sub-files?
It could be useful if you want to parallelize and standardize the downstream alignments (guess the execution time of each sub-sample) and you don't know the size of your input fastq.gz file...

Not an Issue but a function request

Hi,

would it be nice to have an option to add a report name at the top of the report, Just before the summary. I use the -h option and this specified name added to the top of the report would help or a new option -R Report name included in generated report (string [= reportname])

Thanks,

multiple threads -w not work

Even set -w to a different number, the program consist running in 3 cores...

About adapter trimmer

Hi ,I have saw the source of adaptertrimmer , the principle is to allow the maximum number of diff to find the largest PE reads overlap sequence, the rest as an adapter to remove. But there is a problem is to compare PE reads base. If the same time, if R1 has a wrong indel, then cause the following bases are different with R2, resulting in diff the number increases, so overlap length decreases, the adapter can not be removed.
So counld you add a argument to consider wrong indel in next version.

adapter trimming with respect to quality

Hello,

First, let me just say I have been working with fastp for over a month now and am very pleased with the performance and direction the tool is going. It also appears to be quite accurate and look forward to the forthcoming publication.

However, I have noticed that the adapter trimming does not efficiently trim under certain conditions where quality dips but there is still an exact match to the adapters on both read pairs. In this case I've synthetic data that is centered at 35Q and 20Q.

I was going through a tool evaluation comparison using this data and found that fastp excels in most cases with >99% sensitivity trimming and near perfect specificity. However, as the qualities drop near 20Q the sensitivity also drops dramatically. Some tools have no loss of adapter clipping with respect to quality shifting. Note my usage and that I am not doing any quality trimming.

Any ideas on how we can better address the below scenario?

Here is one example of where clipping fails, adapter starts at pos 114. You can see that fastp successfully trims in one case but not the other. PS- I have more examples if needed.

Adapters:
> TruSeq_Index_Adapter_5p
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
> TruSeq_Index_Adapter_3p
ATCTCGTATGCCGTCTTCTGCTTG

fastp -Q -i 1m_150bp_35q_R1.fq.gz -I 1m_150bp_35q_R2.fq.gz -o fastp.1m_150bp_35q_R1.fastq.gz -O fastp.1m_150bp_35q_R2.fastq.gz

fastp': fastp -Q -i 1m_150bp_20q_R1.fq.gz -I 1m_150bp_20q_R2.fq.gz -o fastp.1m_150bp_20q_R1.fastq.gz -O fastp.1m_150bp_20q_R2.fastq.gz 

Post-trimming results

Read1_35q:
@999465_150_114 1:
AGTCTCAGGATACAAAATCAATGTACAAAAATCACAAGCATTCTTATACACCAATAACAGACAAACAGAGAGCCAAACCATGAATGAACTCCCATTCACAATTGCTTCAAAGAG
+
GJJJ?JAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJ>JJJJJJJJJJJJJGJJJJJJJJJJJJJJEJHJJJJHJJJJJJJJJHJJJJJCJCJJEJJJJJJJJJF

Read2_35q:
@999465_150_114 2:
CTCTTTGAAGCAATTGTGAATGGGAGTTCATTCATGGTTTGGCTCTCTGTTTGTCTGTTATTGGTGTATAAGAATGCTTGTGATTTTTGTACATTGATTTTGTATCCTGAGACT
+
GGJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHJJJJFJJJGJJJGJJJJJ?JJJJJJJJJGJCJJJJJJJIJJJ?J?JJJGJJBJJJGJJJJJJJ

Read1_20q:
@999465_150_114 1:
AGTCTCAGGACACAAAATCAATGTACAAAAATCACAAGCATTCTTTTACACCAATAACAGACAAACAGAGAGCCAAACCATGAATGAACTCCCATTCACAATTGCTTCAAAGAGGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGC
+
0...4689..17428;:=5:383<=?<755199;6;6367>509329;1<4.<553816619:2<0.36:.6663.;.75:6.5:7.:9..0665/25..:22.0:1.18=22.12.8799844.2.549/5.7..94/3...0../...

Read2_20q:
@999465_150_114 2:
CTCTTTGAAACAATTGTGAATGGGAGTTCATTCATGGTTTGGCTCTCTGTTTGTCTGTTATTGGTGTATAAGAATGCTTGTGATTTTTGTACATTGATTTTGTATCCTCAGACTATCTCGTATGCCGTCTTCTGCTTGCAACATTCACCA
+
-6--1--/;38<576-62<87<:6/6:3<--4831>3-;--32<;2-<57999.</6864<23.-67-007-84<858485.0..534/24.4/-16?:9-7:6;0;3/2-46/21732;0--2463-5/8:57403---.3-6------

%GC content

Hi,

would it be possible to add a calculated %GC content on the base contents graphs near the base index (top right) ?

Thanks

question : activate/desactivate filter types

Hi,

I love your tool and I have a question:

when using the -W and -M options, is it possible to disable the -q, -u and -n options? I tried a -W -M -Q line, but all the options were disables, including the -W and -Q... (I want to use per the "read cutting by quality options" and disable the "quality filtering options" in the same time)

fastp should report error and exit when there is no disk space left for output

Currently if there is no disk space left, fastp continues running with no error. I think it should report error and exit.

How adapter trimming works?

I'd like to know how adapter trimming works? Both for single end and paired end data.

Addition in the summary part of the report

Hi,

Can you put the "evaluated" adapter sequence in the top summary part of the report ?

Thanks,

Now packaged in Homebrew (brewsci/bio)

FYI - Instructions:

brew install brewsci/bio/fastp

https://github.com/brewsci/homebrew-bio/pull/91/files

installation issue

Hi,

I have a personal laptop and a work laptop. The personal one is a windows 10 hosting Ubuntu 16.04 on a Oracle virtual box (with anaconda and both python 2.7 and 3.6 installed) and the work laptop is a windows 7 hosting Ubuntu 16.04 on a VMware workstation (with anaconda and python 3.6 installed). I had no problem installing either AfterQC or fastp on my personal laptop.

However, when I was trying to install fastp (thinking that I don't want to install python 2.7 for AfterQC) on my work laptop through bioconda, it tells me that I have conflicts with other packages, as follows:

"Solving environment: failed

UnsatisfiableError: The following specifications were found to be in conflict:

blaze -> pytables[version='>=3.0.0'] -> zlib[version='>=1.2.11,<1.3.0a0']
fastp
Use "conda info " to see the dependencies for each package."

When I was trying to install directly from cloning from github, it also failed, with the following error message:

" /usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status
Makefile:17: recipe for target 'fastp' failed
make: *** [fastp] Error 1 "

I realize that there must be some insufficiency within my work laptop but I don't know what it is, as I am pretty new to Linux system. I would really like to have it fixed because fastp is a really nicely made package for cleaning up and QC WGS reads. If you have any suggestions or useful tips for resolving this issue, please help me out.

Really appreciate your help. Thanks in advance.

Yun

Overrepresented sequences belong to highly expressed chloroplast-related genes

How can one keep these overrepresend sequences? For example I used this sample "SRR3100237"

fastp -i SRA/out/SRR3100237_1.fastq.gz
-I SRA/out/SRR3100237_2.fastq.gz
-o FASTP/SRR3100237_1_trim.fastq.gz
-O FASTP/SRR3100237_2_trim.fastq.gz
-h Reports/SRR3100237 -R "SRR3100237" -l 36 -c -g -p -M 30 -w 6 -5 -3

fastp-report_SRR3100237.zip

looking forward!

bug (?): non standard FATQ file on paired read

Dear Developer,

I have some trouble with this application: I'm trying to filter my paired reads fastq files with fastp with two different sets of filters.

Here is a sample of my 2 fastq files (gunzip -c file_X.fastq.gz | tail -n 50):
ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 1:N:0:ATGCCT NTCTTTCTGACCCTCACTGAGAGCGACCTGAAGGAAATTGGCATCACGTGCGTCCAGAAGGGCCGCTCTGGCCCTCAGCCCGGGGTTGGGGCAAACTCCCA + #1=DFFFFGHFHHIJEIJJIEIJGIIIGIIJFHJICEICGGICIII@GFHIGCHGAHECHFBF=A@?BDDDDDDD8<CCCDD<@BB<>BBBB9?<<ACCD( @HWI-1KL150:70:C74KBACXX:1:1101:3880:1976 1:N:0:ATGCCT NACTTTCTGTTTTTCCTTTATAGCAAGCAACCCAGTGATAGCAGCCCAGCTCTGGTGAGTGTCCTTGAGCTCTAGAGCACAGCTCTCCTCTCTAAGNNNNN + #1=DFFFFHHHHHJJJJJJJJJJJJJJJIJJJJJJIIJJJIJJJJJJJJJJIJJJ@GGIDFFGHIIJEIJIJHGHAA>DFFFFECCCCCCDDDEDCCACC3 @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 1:N:0:ATGCCT NACTCCGGGGCTGCTCTGGACCAGTTTCCATTCCCGTCTCCCCACCCTCACCATCCCTCAGGACATCACGAGTGGTTGCTTGGACCTGAGGTGGACATTCT + #1=DFFFFHHHHHJJJJJJGIJIJFHIGIIIJIJJJGIIIJJJJIJJIJJJHHHHHFFFEECEDEDDDC@?@<BBCDDDCDDBCDDCCCDD>BB<A(4:>( @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 1:N:0:ATGCCT NCCCAACCAATCAGCCCCAATTTACGATCTATGTAACTCACCAGTTCGATATGCCAATAACCTGGCCTGAACCATGCAGTGCCTTGCAATTTCCTGTGGCA + #1=DDFFFHHHHHJJJJJJJJJJJIJJJJJJJJIJDGIJJJIJJIJIJJIJIIIIIIIGGICHIGHHHGHCFFFE66;@ACCCC@CCCAAC@CDCCCC?B? @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 1:N:0:ATGCCT NGCAGGCCGCGGACGGAGAGCACGTGAGGGAAGGGGAAGCCGCTCCGGCCTGCGTAGGGGGGGGGGCGGGGCCCCCCGGGACACCCGGGAGGGGGGCGGGN + #4=DDDDDHHDHAGGIDG:;=<FF8C@DDE4?CDEC6>?=;983;?8:&57?85)+0<()5-&&)))0-)&)0&))&&))&&((())&)0&05&0&&&05< @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 1:N:0:ATGCCT NGGGGTCCTCTGCGGCCAGGGCAGCGCTGCTCAGCATGATGAAGACAAGGATGAGGTTGGTGAAGATGTGGTGGTTGATGAGCTTGTGGGAGCCTACGCGN + #4=DDDFFHHHHHJJJJIHHIJIIJJJJJJJJIIIIJJJJJJJJJJJJHICHHEHFCDFD;?AAC@CCACD(8?',5(4:4>ACC+++(&2?&8((+&&)5 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 1:N:0:ATGCCT NGTCCTCCCACCAGCCGGGCACTACTTACATGACGATGAGAGCAGCGTCTCGGGAGTAATCCAGCACAATCTCCTTCAGCCTCACCTGCCGAAGGGCCTGN + #1:BDFFFHHHHGIIIIGFBGIBEH@GECHCHGGGDFHIIIHICEGHGEHIHEH6?5;@3(>(.-(;(55>AC((,,55>?<<C??:9<A9&5)5<(2+(2 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 1:N:0:ATGCCT NATCCCGTTGGCTTTCCAGGAGGCTCTGCAGCATCTGCAGGGTCCTGGGGTCCTGGTAAGGGGCTTCCAGGAGTGGAGAAGGGGGGCAGTGAGGTTGGGCC + #1=DFFFDHHHHHJJJJJJJJJJJJIJIJJIFIIIJJJJIJJGIHIIIIJGGIJJJAHIIJHHHFFFEDCE;@?B;5<ABDBBDDB@BB3@ACD?BD>B?( @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 1:N:0:ATGCCT NGTGGCAATTCTCTTCAGTAGGTTGGCCAAGTCAGCAGACACGGTGCTGGTCTTATAGCTGTCAAATTCAGGAAGGGTCTTGGGCTTAAAATACTCAAACA + #1=DDFFDFHHHHJJJJJHIJJHJJJJJJJJIJJJJJJJJJJJJGHIJJJFGIIIJJJIJIHIJICGGCGGEFHH;B;CC@CDDDDDDCCCEDCDCD:CC5 @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 1:N:0:ATGCCT NTTTTTCACTCCATTGTTGTTGTTTACCCAGTTTATGGGGGTTGTAATGTTTATCACACTCCTTGGATGATTTCCGAAGGTAAGATATCTGGAATGGTTTT + #4=DDFFFHHHHHGHHFHIFHIGIIIIJJJJBHGIIIIGGI?FGFEHJGGAHIJJJGIGGHHHHHFFFFDCC@CE;3;>@:@>C>ACA@CDCD<AC>ACA< @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 1:N:0:ATGCCT NGCTGTTTTACAAGTTGGTAGTTTTCTCTTCTTGGCATGGTGAACGTGCCCTAAAGGCCTGATGTCAGGCTCCATCCTCCATGTTAAAATAGTGAGTTCTT + #1=DFDFFHHHHHJJIJJCFHCFIJEHGIJIJJGHEGIIJFEGHDGEIIJHCGIJEGIJGEGHIGIIHIG<AECA7?DDFFFECC(>CC@CC>CCC>C@D: @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 1:N:0:ATGCCT NTTGTCAACTTTGCTTTTGCTCATGTTGTAATGTTTGGCAATATATGACACATCCACTTGTTTATCGAATCCCTGTCAAAAAGAAGAACAGCAAAAACATN + #1:B=DDDFFFHDGIIIIIGGIEGHBH@9<9FFFHGIIG>FGEGIGDDHG<:9?D@D8?>?FHGGGGGBAG)@;77;4?A;(9?@DFEEA>55=>=;'((, @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 1:N:0:ATGCCT NCCTATGCAACCTCAGTGTCCACTGAGAAGGGAATCTTGTGGTATGGAACAATGTGGCAAAAAGGTACAAAGTATTCTTACACCTGGAATTCTTAACCTGN

ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 2:N:0:ATGCCT ACAGCCTGCGGGGGGAATGTGACCAGGATATGCCTCAGCGTCCCAAGAGCGCTTACATGAGTGGGAGTTTGCCCCAACCCCGGGCTGAGGGCCAGAGCGGC + @CCFFFFFHGHHGID9@BCDEDDDDDBBCCC@@C@CDDABDDDBDDDD?90:BDDCDECDD@A?BDD<CCDCDABB@BDDDDBB>9>BDDDDD<B?<CBD9 @HWI-1KL150:70:C74KBACXX:1:1101:3880:1976 2:N:0:ATGCCT AAAGGGGAAAAAAATTACCAGATGACACACTTCCTGATTTCACTGTAGTAAGGAAAAAGTCAACATTGCAAATAAATACGATCCTTAGAGAGGAGAGCTGT + CCCFFFFFHHHHHJJJJJJJJIJIJJJJJJJJJJJJGIJJJJJJJGHGFHGFIIGJJJJFEHHHGFFFFDF@CCEEEEDDBDBDDCCCDDDDDBBB??CD+ @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 2:N:0:ATGCCT ACACTAGCCACTCACGTTCCATCTCTTCCTCGGAGAAATCCTCAGGCCCAGCCAAGGGCAGGAGCAAAAAGGGGAGAATGTCCACCTCAGGTCCAAGCAAC + CCCFFFFFHHHHHJJJHIIJJJIJJJJJJIJJIJJJJJJIJJJJJJIIJJFHIJIJJJIIHHGFFFFDDDDDDD@BBDDDDDDDDDDDDDD@CDD>CBBDA @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 2:N:0:ATGCCT GTTTGGAGAACCTGTGTGAAAATCCATACTTTAGCAATCTAAGGCAAAACATGAAAGACCTTATCCTACTTTTGGCCACAGTAGCTTCCAGTGTGCCGAAC + CCCFFFFFHHHHHJJJHIJJJJJJJIJJJIJJIJIJJJJJIIIIJJIJIIDDEIGGHJHGGGIIIGGDGHIJJJHHHHFFFCDECCCEECDCCDCCCD??3 @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 2:N:0:ATGCCT GGCCTCGTGCGCTCGGGCCCGCACGCCGTTGTTCGCGTCACCCCCACCCAGCTCCCTTCCGCGTGTGCTCGGAGGGCGCGGCGCACCGCCTACGCAGGCCN + CCCFFFFFHHHHHJJJJIJJJJJJJJJJIJHEHHFFDDBDDDDDDDDDDDDDDDDDDDDDDDDBDBDDDCDD;BBDDDDDDBD@BDDDD<<>CDDBBDDD> @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 2:N:0:ATGCCT CAAGCAGCGGCTTTTCCCTGCAGGATCCGCGTAGGCTGCCACAAGCTCATCAACCACCACATCTTCACCAACCTCATCCTTGTCTTCATCATGCTGAGCAN + CCCFFFFFHHHHHJJJIIIIJIJJJIJJIJJEHEIIIIDGHIGIAEHHHH?@DEFFDDDDDCDCDEDDD>B@BCDCCDDACDCDDDDEEE@CDDCCCBDD3 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 2:N:0:ATGCCT AGACAGGAGACTCTATAAGAATTTATGAGGCAGCAGAGTCTACAAGTAAATCATGAATCCAGTTGAAAATGTTAATGAGGCCATAGACGTGGTGAAGGATT + @C@DFFFFHHHGHJJJJIIIHIIJJJJIGIJJJJIIIJGHIJIIIJIIIHDGGIJJIIJHDGEHGIIGCGGGGECEAEHHHFFFDDCECABB?@DCCC?A3 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 2:N:0:ATGCCT GTGTGCAGCGGAGCCCTGCACGGGAGACAGGTCTGTCTTCTGCCAGATGGAAGTGCTCGATCGCTACTGCTCCATTCCCGGCTACCACCGGCTCTGCTGTN + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJDHIIFHJJIIJIJJJJIIJEHHAEEHFFFECCBDDDCCDDDDDDFEEDDDDDDDDDDDDDDDDDCDDDA9 @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 2:N:0:ATGCCT GGTGGATCTTATATGGGAGGATGCACTGTTCATGTTTGAGTATTTTAAGCCCAAGACCCTTCCTGAATTTGACAGCTATAAGACCAGCACCGTGTCTGCTN + BC@FFFFFHHHGHJJJJIIJHIJJJJJJIJJJJJJJJJJJDHIJJJJJJJGHJJJJJIJIJIJJJJJJJJJIJGHHGHFFFFFFEEEDEDDDDDBBDDDD: @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 2:N:0:ATGCCT GAAAATAATTCACAAATAGTGTTACAGCTCCATCCACTGAAAATTGTCATAAAAGACATTTTTTCAATGAGTTCATTTTTAGAGAAACCATTCCAGATATC + @CCFFFFFHGHHHJJGHIJCJJIJJJJJJIHGIJJIJJIGIIIIIJIIHHDGGGJJIGIIJJJJGIEHIGJIHIJIJCHEECB@;?BCACECDCDCCCCD- @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 2:N:0:ATGCCT CACTTTGAAAACTAGAAATCATTACACAAAGTTAAGAACTCACTATTTTAACATGGAGGATGGAGCCTGACATCAGGCCTTTAGGGCACGTTCACCATGCC + CCCFFFFFHHHHHJJIJJJJJJJJJIJJJJJJJIIJJIJJIJIIIJJIJJHGCHIHGIIIJJGHIJJJJJJIJJGGHGHFFFFFDEEDDDDDDDDDDDDDA @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 2:N:0:ATGCCT AAGAACAAGTTTCTGTACATCTCATTATCATTCTGCCTGTTCACTTGCCTCATGTTTTTGCTGTTCTTCTTTTTGACAGGGATTCGATAAACAAGTGGATN + @@@DDEEDHDHHHEFCEH?FHIIIIHIIIIIGGIIIIIIFFIGGHIECEHBDHGBGCHIIIIIIIIIIIIHGGGE;CCHGHHCFFF@DCA6>CBC@CCCC> @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 2:N:0:ATGCCT TACATAGGAAGAAAATGCCAATCAAAAATGAAAGTCAGTTAAAACCACTTGAAAGCAATGTCTGTTCCTTTTTAGAATGGAAAGTTGGAGGAAACTTCAGC

As you can see, the file seems to be well formated.

I'm applying 2 different sets of filter:
`fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html -W 20 -M 30 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_1.fastq.gz -O fastq_filtered/ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_2.fastq.gz

fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html -q 30 -u 50 -n 5 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I input_files/ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz`

here is the log file:
`Read1 before filtering:
total reads: 69722014
total bases: 7041923414
Q20 bases: 6419265369(91.1578%)
Q30 bases: 5704731454(81.011%)

Read1 after filtering:
total reads: 67489520
total bases: 6816441520
Q20 bases: 6312681343(92.6096%)
Q30 bases: 5624025162(82.5068%)

Read2 before filtering:
total reads: 69722014
total bases: 4505725297
Q20 bases: 4505725297(100%)
Q30 bases: 4505725297(100%)

Read2 aftering filtering:
total reads: 67489520
total bases: 4360073359
Q20 bases: 4360073359(100%)
Q30 bases: 4360073359(100%)

Filtering result:
reads passed filter: 134979040
reads failed due to low quality: 4110650
reads failed due to too many N: 354338
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

JSON report: ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json
HTML report: ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html

fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_report.html -W 20 -M 30 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_windows_W20M30_2.fastq.gz
fastp v0.6.0, time used: 910 seconds
Read1 before filtering:
total reads: 69722014
total bases: 7041923414
Q20 bases: 6419265369(91.1578%)
Q30 bases: 5704731454(81.011%)

Read1 after filtering:
total reads: 63210865
total bases: 6384297365
Q20 bases: 6031554512(94.4748%)
Q30 bases: 5446376816(85.3089%)

Read2 before filtering:
total reads: 69722014
total bases: 4505725297
Q20 bases: 4505725297(100%)
Q30 bases: 4505725297(100%)

Read2 aftering filtering:
total reads: 63210865
total bases: 4083554985
Q20 bases: 4083554985(100%)
Q30 bases: 4083554985(100%)

Filtering result:
reads passed filter: 126421730
reads failed due to low quality: 12684916
reads failed due to too many N: 337382
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

JSON report: ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json
HTML report: ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html

fastp -A -L -w 4 -j ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.json -h ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_report.html -q 30 -u 50 -n 5 -i ffcf607a-7b70-4e16-a60b-c09197fa1601_1.fastq.gz -I ffcf607a-7b70-4e16-a60b-c09197fa1601_2.fastq.gz -o ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz -O ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz
fastp v0.6.0, time used: 918 seconds`

the result files look alike:
ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_1.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 1:N:0:ATGCCT NTCTTTCTGACCCTCACTGAGAGCGACCTGAAGGAAATTGGCATCACGTGCGTCCAGAAGGGCCGCTCTGGCCCTCAGCCCGGGGTTGGGGCAAACTCCCA + #1=DFFFFGHFHHIJEIJJIEIJGIIIGIIJFHJICEICGGICIII@GFHIGCHGAHECHFBF=A@?BDDDDDDD8<CCCDD<@BB<>BBBB9?<<ACCD( @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 1:N:0:ATGCCT NACTCCGGGGCTGCTCTGGACCAGTTTCCATTCCCGTCTCCCCACCCTCACCATCCCTCAGGACATCACGAGTGGTTGCTTGGACCTGAGGTGGACATTCT + #1=DFFFFHHHHHJJJJJJGIJIJFHIGIIIJIJJJGIIIJJJJIJJIJJJHHHHHFFFEECEDEDDDC@?@<BBCDDDCDDBCDDCCCDD>BB<A(4:>( @HWI-1KL150:70:C74KBACXX:1:1101:4438:1971 1:N:0:ATGCCT NCCCAACCAATCAGCCCCAATTTACGATCTATGTAACTCACCAGTTCGATATGCCAATAACCTGGCCTGAACCATGCAGTGCCTTGCAATTTCCTGTGGCA + #1=DDFFFHHHHHJJJJJJJJJJJIJJJJJJJJIJDGIJJJIJJIJIJJIJIIIIIIIGGICHIGHHHGHCFFFE66;@ACCCC@CCCAAC@CDCCCC?B? @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 1:N:0:ATGCCT NGGGGTCCTCTGCGGCCAGGGCAGCGCTGCTCAGCATGATGAAGACAAGGATGAGGTTGGTGAAGATGTGGTGGTTGATGAGCTTGTGGGAGCCTACGCGN + #4=DDDFFHHHHHJJJJIHHIJIIJJJJJJJJIIIIJJJJJJJJJJJJHICHHEHFCDFD;?AAC@CCACD(8?',5(4:4>ACC+++(&2?&8((+&&)5 @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 1:N:0:ATGCCT NGTCCTCCCACCAGCCGGGCACTACTTACATGACGATGAGAGCAGCGTCTCGGGAGTAATCCAGCACAATCTCCTTCAGCCTCACCTGCCGAAGGGCCTGN + #1:BDFFFHHHHGIIIIGFBGIBEH@GECHCHGGGDFHIIIHICEGHGEHIHEH6?5;@3(>(.-(;(55>AC((,,55>?<<C??:9<A9&5)5<(2+(2 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 1:N:0:ATGCCT NATCCCGTTGGCTTTCCAGGAGGCTCTGCAGCATCTGCAGGGTCCTGGGGTCCTGGTAAGGGGCTTCCAGGAGTGGAGAAGGGGGGCAGTGAGGTTGGGCC + #1=DFFFDHHHHHJJJJJJJJJJJJIJIJJIFIIIJJJJIJJGIHIIIIJGGIJJJAHIIJHHHFFFEDCE;@?B;5<ABDBBDDB@BB3@ACD?BD>B?( @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 1:N:0:ATGCCT NGTGGCAATTCTCTTCAGTAGGTTGGCCAAGTCAGCAGACACGGTGCTGGTCTTATAGCTGTCAAATTCAGGAAGGGTCTTGGGCTTAAAATACTCAAACA + #1=DDFFDFHHHHJJJJJHIJJHJJJJJJJJIJJJJJJJJJJJJGHIJJJFGIIIJJJIJIHIJICGGCGGEFHH;B;CC@CDDDDDDCCCEDCDCD:CC5 @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 1:N:0:ATGCCT NTTTTTCACTCCATTGTTGTTGTTTACCCAGTTTATGGGGGTTGTAATGTTTATCACACTCCTTGGATGATTTCCGAAGGTAAGATATCTGGAATGGTTTT + #4=DDFFFHHHHHGHHFHIFHIGIIIIJJJJBHGIIIIGGI?FGFEHJGGAHIJJJGIGGHHHHHFFFFDCC@CE;3;>@:@>C>ACA@CDCD<AC>ACA< @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 1:N:0:ATGCCT NGCTGTTTTACAAGTTGGTAGTTTTCTCTTCTTGGCATGGTGAACGTGCCCTAAAGGCCTGATGTCAGGCTCCATCCTCCATGTTAAAATAGTGAGTTCTT + #1=DFDFFHHHHHJJIJJCFHCFIJEHGIJIJJGHEGIIJFEGHDGEIIJHCGIJEGIJGEGHIGIIHIG<AECA7?DDFFFECC(>CC@CC>CCC>C@D: @HWI-1KL150:70:C74KBACXX:1:1101:9989:1970 1:N:0:ATGCCT NTTGTCAACTTTGCTTTTGCTCATGTTGTAATGTTTGGCAATATATGACACATCCACTTGTTTATCGAATCCCTGTCAAAAAGAAGAACAGCAAAAACATN + #1:B=DDDFFFHDGIIIIIGGIEGHBH@9<9FFFHGIIG>FGEGIGDDHG<:9?D@D8?>?FHGGGGGBAG)@;77;4?A;(9?@DFEEA>55=>=;'((, @HWI-1KL150:70:C74KBACXX:1:1101:10460:1982 1:N:0:ATGCCT NCCTATGCAACCTCAGTGTCCACTGAGAAGGGAATCTTGTGGTATGGAACAATGTGGCAAAAAGGTACAAAGTATTCTTACACCTGGAATTCTTAACCTGN + #4BDFFFFHHHHHJJJIJJJJJJJJGJIJJJIFHGIJJJHIIFHIGIGJJGIIIIGGJJJJJGGG)=;CDHE=AC7ADEFFFFDDEE<CCA@DED;C@@BD @HWI-1KL150:70:C74KBACXX:1:1101:11860:1969 1:N:0:ATGCCT NACCTTGTCCTTGGCACTGCGGCAGCCTTGCAGGCTGGCAAGGATCTGGGCCTGCACACTCTGAACCCACAGCTCCCGCTCCTCCGCCGTTGAAGCCTCNN + #1=DDFFFHHHHHJJIJJJJJJJIJIJIJJIJJJIJJJEFHGI=CFGEGF2CCACEHHGBFDECAABB?@?ABC>58?BDDBACABBD>99?2@A:<A<0) @HWI-1KL150:70:C74KBACXX:1:1101:12222:1966 1:N:0:ATGCCT NAGCTTAAACAGTGGGTTTTTCAATGTCTCTCTTTAGGATTTTTGCTGGGTAAAAGCCTGTTTTACGCGTGGAATGCACACCTCCGGCCAACGGAGACTCC

ffcf607a-7b70-4e16-a60b-c09197fa1601_bases_q30u50n5_2.fastq.gz
@HWI-1KL150:70:C74KBACXX:1:1101:1931:1994 2:N:0:ATGCCT ACAGCCTGCGGGGGGAATGTGACCAGGATATGCCTCAGCGTCCCAAGAGCGCTTACATGAGTGGGAGTTTGCCCCAACCCCGGGCTGAGGGCCAGAGCGGC + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJJJJJIJIJJJJJJJJJJJJGIJJJJJJJGHGFHGFIIGJJJJFEHHHGFFFFDF@CCEEEEDDBDBDDCCCDDDDDBBB??CD+ @HWI-1KL150:70:C74KBACXX:1:1101:4185:1976 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK ACACTAGCCACTCACGTTCCATCTCTTCCTCGGAGAAATCCTCAGGCCCAGCCAAGGGCAGGAGCAAAAAGGGGAGAATGTCCACCTCAGGTCCAAGCAAC + CCCFFFFFHHHHHJJJHIIJJJIJJJJJJIJJIJJJJJJIJJJJJJIIJJFHIJIJJJIIHHGFFFFDDDDDDD@BBDDDDDDDDDDDDDD@CDD>CBBDA K CCCFFFFFHHHHHJJJHIJJJJJJJIJJJIJJIJIJJJJJIIIIJJIJIIDDEIGGHJHGGGIIIGGDGHIJJJHHHHFFFCDECCCEECDCCDCCCD??3 @HWI-1KL150:70:C74KBACXX:1:1101:4539:1970 2:N:0:ATGCCT GGCCTCGTGCGCTCGGGCCCGCACGCCGTTGTTCGCGTCACCCCCACCCAGCTCCCTTCCGCGTGTGCTCGGAGGGCGCGGCGCACCGCCTACGCAGGCCN KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJIJJJJJJJJJJIJHEHHFFDDBDDDDDDDDDDDDDDDDDDDDDDDDBDBDDDCDD;BBDDDDDDBD@BDDDD<<>CDDBBDDD> @HWI-1KL150:70:C74KBACXX:1:1101:4702:1995 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK CAAGCAGCGGCTTTTCCCTGCAGGATCCGCGTAGGCTGCCACAAGCTCATCAACCACCACATCTTCACCAACCTCATCCTTGTCTTCATCATGCTGAGCAN + CCCFFFFFHHHHHJJJIIIIJIJJJIJJIJJEHEIIIIDGHIGIAEHHHH?@DEFFDDDDDCDCDEDDD>B@BCDCCDDACDCDDDDEEE@CDDCCCBDD3 K @HWI-1KL150:70:C74KBACXX:1:1101:6121:1971 2:N:0:ATGCCT AGACAGGAGACTCTATAAGAATTTATGAGGCAGCAGAGTCTACAAGTAAATCATGAATCCAGTTGAAAATGTTAATGAGGCCATAGACGTGGTGAAGGATT + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK @C@DFFFFHHHGHJJJJIIIHIIJJJJIGIJJJJIIIJGHIJIIIJIIIHDGGIJJIIJHDGEHGIIGCGGGGECEAEHHHFFFDDCECABB?@DCCC?A3 @HWI-1KL150:70:C74KBACXX:1:1101:6748:1978 2:N:0:ATGCCT GTGTGCAGCGGAGCCCTGCACGGGAGACAGGTCTGTCTTCTGCCAGATGGAAGTGCTCGATCGCTACTGCTCCATTCCCGGCTACCACCGGCTCTGCTGTN KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJDHIIFHJJIIJIJJJJIIJEHHAEEHFFFECCBDDDCCDDDDDDFEEDDDDDDDDDDDDDDDDDCDDDA9 @HWI-1KL150:70:C74KBACXX:1:1101:6964:1994 2:N:0:ATGCCT KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK GGTGGATCTTATATGGGAGGATGCACTGTTCATGTTTGAGTATTTTAAGCCCAAGACCCTTCCTGAATTTGACAGCTATAAGACCAGCACCGTGTCTGCTN + BC@FFFFFHHHGHJJJJIIJHIJJJJJJIJJJJJJJJJJJDHIJJJJJJJGHJJJJJIJIJIJJJJJJJJJIJGHHGHFFFFFFEEEDEDDDDDBBDDDD: K @HWI-1KL150:70:C74KBACXX:1:1101:8404:1977 2:N:0:ATGCCT GAAAATAATTCACAAATAGTGTTACAGCTCCATCCACTGAAAATTGTCATAAAAGACATTTTTTCAATGAGTTCATTTTTAGAGAAACCATTCCAGATATC + KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK @CCFFFFFHGHHHJJGHIJCJJIJJJJJJIHGIJJIJJIGIIIIIJIIHHDGGGJJIGIIJJJJGIEHIGJIHIJIJCHEECB@;?BCACECDCDCCCCD- @HWI-1KL150:70:C74KBACXX:1:1101:8836:1977 2:N:0:ATGCCT CACTTTGAAAACTAGAAATCATTACACAAAGTTAAGAACTCACTATTTTAACATGGAGGATGGAGCCTGACATCAGGCCTTTAGGGCACGTTCACCATGCC KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK + CCCFFFFFHHHHHJJIJJJJJJJJJIJJJJJJJIIJJIJJIJIIIJJIJJHGCHIHGIIIJJGHIJJJJJJIJJGGHGHFFFFFDEEDDDDDDDDDDDDDA

The other files look alike (file_1 is normal, file_2 has some extra "KKKKKK" lines...)

When trying to align data on genome with BWA MEM, only 3 sequences seems to be well formated in my file.
This is probably due to the non canonical format of my fastq reads with the extra lines.

Do you have any idea why this doesn't work?

why "--cut_by_quality3" reduce reads counts

hi sfchen:

 I want to known why "-3" option will reduce  reads passed filters.

fastp -i SRR1770413_1.fastq -I SRR1770413_2.fastq -q 20 -u 20 -o out.SRR1770413_1.fastq -O out.SRR1770413_2.fastq

fastp -i SRR1770413_1.fastq -I SRR1770413_2.fastq -q 20 -u 20 -3 -o out.SRR1770413_1.fastq -O out.SRR1770413_2.fastq

Order of execution?

Can I run different steps in one single call? E.g. quality, adaptor, poly G and "global" trimming? If yes, what is the order of execution? Or do I have to run the distinct trimming step each on its own?

Average read length pre- and post-trimming

This is a great tool. Adding in the pre- and post-trimming average read length would be super helpful. Best to get all information with one pass through the read file(s) than two.

Thanks!

PolyG trimming?

Hello,

Thanks for the program. I'm working with NovaSeq data currently and would like to try out the polyG trimming. After trimming, it looks like fastp still retains reads with 8 or less Gs at the ends of reads. Is that a default set by fastp and what is the reason for doing so? Any way I can change the number of G's fastp lets through its filter?

Cheers,
Mun

Add -v/--version option

Could you please add a -v or --version argument to fastp that outputs the full version of the program? Currently, there is no way to know which version of the program is being run. Thanks!

requirement: support UMI preprocessing

Discrepancy in results between screen output and report out

Hi,

Trying to figure out the onscreen results when fastp finishes running and the results summary in the generated report file.

The numbers don't addup....

Thanks for the help.

================
Example when fastp finishes and the output on screen gives:

Read1 before filtering:
total reads: 4000000
total bases: 400000000
Q20 bases: 392516035(98.129%)
Q30 bases: 376358571(94.0896%)

Read1 after filtering:
total reads: 3102308
total bases: 309754624
Q20 bases: 306341522(98.8981%)
Q30 bases: 294745910(95.1546%)

Read2 before filtering:
total reads: 4000000
total bases: 400000000
Q20 bases: 387859490(96.9649%)
Q30 bases: 373512577(93.3781%)

Read2 aftering filtering:
total reads: 3102308
total bases: 309754624
Q20 bases: 305723840(98.6987%)
Q30 bases: 295222362(95.3085%)

Filtering result:
reads passed filter: 6204616
reads failed due to low quality: 235470
reads failed due to too many N: 1559914
reads failed due to too short: 0
reads with adapter trimmed: 1355796
bases trimmed due to adapters: 17275112

=================================
These are the results in the report:

fastp report
Summary
General
fastp version: 0.7.0
sequencing: paired end (100 cycles + 100 cycles)

Before filtering
total reads: 7.629395 M
total bases: 762.939453 M
Q20 bases: 744.224095 M (97.546941%)
Q30 bases: 715.132854 M (93.733893%)

After filtering
total reads: 5.917183 M
total bases: 590.810059 M
Q20 bases: 583.711016 M (98.798422%)
Q30 bases: 562.637589 M (95.231552%)

Filtering result
reads passed filters: 5.917183 M (77.557700%)
reads with low quality: 229.951172 K (2.943375%)
reads with too many N: 1.487650 M (19.498925%)
reads too short: 0 (0.000000%)

opengene / fastp Goto Github PK

fastp's People

Contributors

Stargazers

Watchers

Forkers

fastp's Issues

If run with "--cut_by_quality5/3" in fastp (it is wierd that only the last 2 bases for read 1 and first 2 bases for read 2was clipped off)

================ Example when fastp finishes and the output on screen gives:

================================= These are the results in the report:

Recommend Projects

Recommend Topics

Recommend Org

================
Example when fastp finishes and the output on screen gives:

=================================
These are the results in the report: