alastair-droop / fqtools Goto Github PK
View Code? Open in Web Editor NEWAn efficient FASTQ manipulation suite
License: GNU General Public License v3.0
An efficient FASTQ manipulation suite
License: GNU General Public License v3.0
The second entry of this fastq still passes the validator. Is it something intentional?
@LD5V2:07687:11026
CGGGGGTCTTAGCTTTGGCTCTCCTTGCAAAGTTATTTCTAGTTAATTCATTATGCAGAAGGTATAGGGGTTAGTCCTTGCTTATATTATGCTTGGTTATAATTTTTCATCTTTCCCTTGCGGTACTATATCTATTGCGACCA
+
35977*6772999990959:;:<6<.53::19;39891845..*-36159<6;::::;;6;6>:95333)52577957*...*/6774999894726268858=>=-99:99:2;>3>5:::5:;;<;;::;7/,*,3***+,
@LD5V2:07687:11043
+
@LD5V2:07688:11020
AAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCCCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACT
+
:::0:?4>7==<<4<;;;=<=7;78888.9;@><5;;4:4:4:4:;;2;<<<6<<6<<=4>6;;<=2;;;;;5565533'5::/;<2<?ABD?7<<<;5<=
Would there be interest in a adding a flag to fqtools validate
to optionally output the total count of sequences if the records were all valid? I can work on submitting a pull request, but would like a bit of direction on whether you would prefer to have the changes go in fqtools validate
or fqtools count
(and add an option to validate) or neither 🥲 .
The motivation is that I need to count sequences and validate records as fast as possible. It seems to me like this information could be available from fqtools validate
since we have to traverse the file regardless.
gcc --version
gcc (Homebrew gcc 5.3.0) 5.3.0
In file included from src/fqheader.h:21:0,
from src/fqfile.c:14:
/usr/include/zlib.h:1324:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
ZEXTERN int ZEXPORT gzwrite OF((gzFile file,
^
src/fqfile.c: In function 'fqfile_eof_file_fastq_compressed':
src/fqfile.c:285:18: warning: passing argument 1 of 'gzeof' from incompatible pointer type [-Wincompatible-pointer-types]
return gzeof((gzFile*)(((fqfile*)f)->handle));
^
In file included from src/fqheader.h:21:0,
from src/fqfile.c:14:
/usr/include/zlib.h:1458:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
ZEXTERN int ZEXPORT gzeof OF((gzFile file));
^
src/fqfile.c: In function 'fqfile_flush_file_fastq_compressed':
src/fqfile.c:302:13: warning: passing argument 1 of 'gzflush' from incompatible pointer type [-Wincompatible-pointer-types]
gzflush((gzFile*)(((fqfile*)f)->handle), 0);
^
In file included from src/fqheader.h:21:0,
from src/fqfile.c:14:
/usr/include/zlib.h:1395:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
ZEXTERN int ZEXPORT gzflush OF((gzFile file, int flush));
^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include -c -o src/fqfsin.o src/fqfsin.c
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include -c -o src/fqfsout.o src/fqfsout.c
src/fqfsout.c: In function 'fqfsout_writechar':
src/fqfsout.c:165:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
fqstatus result;
^
src/fqfsout.c: In function 'fqfsout_write':
src/fqfsout.c:178:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
fqstatus result;
^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include -c -o src/fqfileprep.o src/fqfileprep.c
src/fqfileprep.c: In function 'prepare_filesets':
src/fqfileprep.c:86:15: warning: 'informat_2' may be used uninitialized in this function [-Wmaybe-uninitialized]
if((outformat_2 == FQ_FORMAT_UNKNOWN) && (options.input_interleaving == FQ_INTERLEAVED)) outformat_2 = info
^
The installation doc is not clear.
Would you be able to elaborate on what criteria the validate
subcommand checks?
fqtools is missing a man page. Or even on for each sub-command. help2man should get you started.
Needed for packaging on brew/conda etc
Hello,
I have lists of sequence which I would like to find fastq reads that contain these sequences.
would it be possible to use fqtools find option to do this??
my lists of sequence looks like following
GATAAAAAAAAAAAAAAAC
GATAAAAAAAAAAAAAACC
GATAAAAAAAAAAAAAATC
GATAAAAAAAAAAAAAAGC
GATAAAAAAAAAAAAACAC
GATAAAAAAAAAAAAACCC
GATAAAAAAAAAAAAACTC
GATAAAAAAAAAAAAATAC
GATAAAAAAAAAAAAATCC
GATAAAAAAAAAAAAATGC
GATAAAAAAAAAAAAAGAC
GATAAAAAAAAAAAAAGCC
GATAAAAAAAAAAAAAGGC
GATAAAAAAAAAAAACAAC
GATAAAAAAAAAAAACACC
GATAAAAAAAAAAAACCAC
GATAAAAAAAAAAAACCCC
GATAAAAAAAAAAAACCTC
GATAAAAAAAAAAAATAAC
GATAAAAAAAAAAAATCAC
GATAAAAAAAAAAAATTAC
GATAAAAAAAAAAAAGAAC
GATAAAAAAAAAAAAGACC
GATAAAAAAAAAAACAAAC
GATAAAAAAAAAAACCCCC
GATAAAAAAAAAAATAAAC
GATAAAAAAAAAAAGAAAC
GATAAAAAAAAAACAAAAC
.
.
.
.
I have used grep to do this one by one but it's taking too long
grep -A 2 -B 1 "CTCAAAAAAAAACAAAGGA" input.fastq |grep -v "^\-\-$" > output.fastq
make tests
mkdir -p bin
cc -O2 -g -L/bio/linuxbrew/opt/htslib -o./bin/fqtools src/fqprocess_view.o src/fqprocess_head.o src/fqprocess_count.o src/fqprocess_blockview.o src/fqprocess_fasta.o src/fqprocess_basetab.o src/fqprocess_qualtab.o src/fqprocess_lengthtab.o src/fqprocess_type.o src/fqprocess_validate.o src/fqprocess_find.o src/fqprocess_trim.o src/fqprocess_qualmap.o src/fqbuffer.o src/fqfile.o src/fqfsin.o src/fqfsout.o src/fqfileprep.o src/fqparser.o src/fqgenerics.o src/fqhelp.o src/fqtools.o -lz -lhts -lm
cc -O2 -g -L./src -i/bio/linuxbrew/opt/htslib -o./tests/test-fqbuffer fqtools tests/test-fqbuffer.c -lz -lhts -lm
cc: error: fqtools: No such file or directory
cc: error: unrecognized command line option '-i/bio/linuxbrew/opt/htslib'
The test set indicated in the paper appears to have become in accessible, can you confirm if interleaved FASTQ is processed correctly (i.e. checks that adjacent records are from the same paired read)?
Hi,
First i would like to thank you for this awesome tool!
I recently started using fastq files with mixed uppercase and lowercase.
I'm using fqtools 2.3 2019-05-08 (zlib 1.2.8; htslib 1.8).
A test file I created named test_lowercase.fastq
which contains:
@A00740:65:HNY73DSXX:3:1103:7925:25645 1:N:0:GTCCTTGA+TAATCTTA
agccatgcactctgtaatgaagagttcacAATCTTCAACAGAGTAGATATTTCAAGAAGTCAACTGATAGATGAATTGGCAGATAAATTTAACCGGCTTCTTGAAGATTTTCTGCAAGAGGTATATATTATAACTATTACAAGTATTTTGTCAGTTgagcccctctactgcaggaa
+
FFFFFFFFFFFFFFFFFFFF:FFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFF
The command I'm using:
fqtools -l count test_lowercase.fastq
And the error is:
ERROR [line 2]: invalid sequence character (a)
I also tried:
fqtools -l -F count test_lowercase.fastq
Which outputs a different error:
ERROR: unknown command: "test_lowercase.fastq"
usage: fqtools [-hvdramuli] [-b BUFSIZE] [-B BUFSIZE] [-q QUALTYPE] [-f FORMAT] [-F FORMAT] COMMAND [...] [FILE] [FILE]
I must say I first used an older version which I already had
fqtools 2.1 2016-10-04 (zlib 1.2.7; htslib 1.8) had the same results.
Decided to clone the repository again.
Am I using the command in the wrong way or is it a bug?
Thank you
For those running Anaconda/conda to manage your environment, fqtools (and htslib) is available in the bioconda channel.
Easy to install:
conda install -c bioconda fqtools htslib
Hi,
I have some large fq.gz files, which takes a long time to aln, so I try to split them into small files, it worked. But my script consumes a lot memeory. Will you consider add a new feature "split fq" to your tools
It appears that the current algorithm for fqtools type
can get the fastq quality format wrong. Here's a reproducible example:
fastq-dump --split-files ERR719681
# Read 300355 spots for ERR719681
# Written 300355 spots for ERR719681
fqtools type ERR719681_1.fastq
# fastq-illumina
fqtools type ERR719681_2.fastq
# fastq-sanger
If Bio.SeqIO
is then used to read these fastq files with the "type" specified by fqtools type
, then the following error occurs:
File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 611, in parse
for r in i:
File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/QualityIO.py", line 1255, in FastqIlluminaIterator
raise ValueError("Invalid character in quality string")
Maybe using the min & max of qual values (the full range) for all sequences in the fastq file would help prevent these mis-calls?
Hi during the "./configure" command I got the following error: "config.status: error: cannot find input file: `config.h.in' htslib"
I fixed this by also running "autoheader" prior to configure
Following the instructions on the main page:
git clone https://github.com/alastair-droop/fqtools
cd fqtools/
git clone https://github.com/samtools/htslib
cd htslib/
autoheader
autoconf
./configure
make
make install
cd ..
make
Fails at the last step with:
In file included from src/fqprocess_view.c:14:0: src/fqheader.h:22:10: fatal error: sam.h: No such file or directory #include <sam.h>
I do not find a sam.h definition in htslib. I do however find a sam.h definition with in the separate samtools project. However if I modify for the Makefile to use that location as well I receive an error about too many parameters:
In file included from ../samtools-1.10/sam.h:29:0, from src/fqheader.h:22, from src/fqfile.c:14: src/fqfile.c: In function ‘fqfile_open_read_file_bam’: ../samtools-1.10/bam.h:209:22: error: too many arguments to function ‘samtools_sam_open’ #define sam_open samtools_sam_open
Could this be because samtools htslib etc. are now separately maintained projects?
Hi,
I am using fqtools
to validate fastq files for our pipeline. I found the function fqtools validate
very useful and fast. Just want to know, is there any support for automatic error fixing (at least some types of errors, e.g. unpaired reads)? Thanks a lot!
PS: I had problem with linking the htslib. So my solution is to use bioconda to install fqtools, which is completely automatic.
-hh1985
could not build it easily
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.