alastair-droop / fqtools Goto Github PK

View Code? Open in Web Editor NEW

134.0 7.0 19.0 2.34 MB

An efficient FASTQ manipulation suite

License: GNU General Public License v3.0

Makefile 0.57% C 83.89% Shell 0.58% Python 13.41% Objective-C 0.64% C++ 0.91%

fastq fastq-files next-generation-sequencing bioinformatics

fqtools's People

Contributors

Stargazers

Watchers

Forkers

xtmgah molleraj yixf-self biocodings zeronot yifangt arvin580 channing-zeng linguoliang shicheng-guo flywind2 vaofford berdal84 anandksrao oxb-dc kasenjing vivianleung genostack aofarrel

fqtools's Issues

an empty sequence line still passes

The second entry of this fastq still passes the validator. Is it something intentional?

@LD5V2:07687:11026
CGGGGGTCTTAGCTTTGGCTCTCCTTGCAAAGTTATTTCTAGTTAATTCATTATGCAGAAGGTATAGGGGTTAGTCCTTGCTTATATTATGCTTGGTTATAATTTTTCATCTTTCCCTTGCGGTACTATATCTATTGCGACCA
+
35977*6772999990959:;:<6<.53::19;39891845..*-36159<6;::::;;6;6>:95333)52577957*...*/6774999894726268858=>=-99:99:2;>3>5:::5:;;<;;::;7/,*,3***+,
@LD5V2:07687:11043

+

@LD5V2:07688:11020
AAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCCCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACT
+
:::0:?4>7==<<4<;;;=<=7;78888.9;@><5;;4:4:4:4:;;2;<<<6<<6<<=4>6;;<=2;;;;;5565533'5::/;<2<?ABD?7<<<;5<=

Add flag to fqtools validate to output sequence count

Would there be interest in a adding a flag to fqtools validate to optionally output the total count of sequences if the records were all valid? I can work on submitting a pull request, but would like a bit of direction on whether you would prefer to have the changes go in fqtools validate or fqtools count (and add an option to validate) or neither 🥲 .

The motivation is that I need to count sequences and validate records as fast as possible. It seems to me like this information could be available from fqtools validate since we have to traverse the file regardless.

Compiler warnings

gcc --version
gcc (Homebrew gcc 5.3.0) 5.3.0

In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1324:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzwrite OF((gzFile file,
                     ^
src/fqfile.c: In function 'fqfile_eof_file_fastq_compressed':
src/fqfile.c:285:18: warning: passing argument 1 of 'gzeof' from incompatible pointer type [-Wincompatible-pointer-types]
     return gzeof((gzFile*)(((fqfile*)f)->handle));
                  ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1458:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzeof OF((gzFile file));
                     ^
src/fqfile.c: In function 'fqfile_flush_file_fastq_compressed':
src/fqfile.c:302:13: warning: passing argument 1 of 'gzflush' from incompatible pointer type [-Wincompatible-pointer-types]
     gzflush((gzFile*)(((fqfile*)f)->handle), 0);
             ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1395:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzflush OF((gzFile file, int flush));
                     ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsin.o src/fqfsin.c
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsout.o src/fqfsout.c
src/fqfsout.c: In function 'fqfsout_writechar':
src/fqfsout.c:165:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
src/fqfsout.c: In function 'fqfsout_write':
src/fqfsout.c:178:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfileprep.o src/fqfileprep.c
src/fqfileprep.c: In function 'prepare_filesets':
src/fqfileprep.c:86:15: warning: 'informat_2' may be used uninitialized in this function [-Wmaybe-uninitialized]
             if((outformat_2 == FQ_FORMAT_UNKNOWN) && (options.input_interleaving == FQ_INTERLEAVED)) outformat_2 = info
               ^

How to build without SAM/htslib?

The installation doc is not clear.

What does the validate subcommand check

Would you be able to elaborate on what criteria the validate subcommand checks?

Missing man page

fqtools is missing a man page. Or even on for each sub-command. help2man should get you started.

Tag a release

Needed for packaging on brew/conda etc

find fastq reads from specific sequences

Hello,

I have lists of sequence which I would like to find fastq reads that contain these sequences.

would it be possible to use fqtools find option to do this??

my lists of sequence looks like following

GATAAAAAAAAAAAAAAAC
GATAAAAAAAAAAAAAACC
GATAAAAAAAAAAAAAATC
GATAAAAAAAAAAAAAAGC
GATAAAAAAAAAAAAACAC
GATAAAAAAAAAAAAACCC
GATAAAAAAAAAAAAACTC
GATAAAAAAAAAAAAATAC
GATAAAAAAAAAAAAATCC
GATAAAAAAAAAAAAATGC
GATAAAAAAAAAAAAAGAC
GATAAAAAAAAAAAAAGCC
GATAAAAAAAAAAAAAGGC
GATAAAAAAAAAAAACAAC
GATAAAAAAAAAAAACACC
GATAAAAAAAAAAAACCAC
GATAAAAAAAAAAAACCCC
GATAAAAAAAAAAAACCTC
GATAAAAAAAAAAAATAAC
GATAAAAAAAAAAAATCAC
GATAAAAAAAAAAAATTAC
GATAAAAAAAAAAAAGAAC
GATAAAAAAAAAAAAGACC
GATAAAAAAAAAAACAAAC
GATAAAAAAAAAAACCCCC
GATAAAAAAAAAAATAAAC
GATAAAAAAAAAAAGAAAC
GATAAAAAAAAAACAAAAC
.
.
.
.

I have used grep to do this one by one but it's taking too long
grep -A 2 -B 1 "CTCAAAAAAAAACAAAGGA" input.fastq |grep -v "^\-\-$" > output.fastq

make test error

make tests
mkdir -p bin
cc -O2 -g -L/bio/linuxbrew/opt/htslib -o./bin/fqtools src/fqprocess_view.o src/fqprocess_head.o src/fqprocess_count.o src/fqprocess_blockview.o src/fqprocess_fasta.o src/fqprocess_basetab.o src/fqprocess_qualtab.o src/fqprocess_lengthtab.o src/fqprocess_type.o src/fqprocess_validate.o src/fqprocess_find.o src/fqprocess_trim.o src/fqprocess_qualmap.o src/fqbuffer.o src/fqfile.o src/fqfsin.o src/fqfsout.o src/fqfileprep.o src/fqparser.o src/fqgenerics.o src/fqhelp.o src/fqtools.o -lz -lhts -lm
cc -O2 -g -L./src -i/bio/linuxbrew/opt/htslib -o./tests/test-fqbuffer fqtools tests/test-fqbuffer.c -lz -lhts -lm
cc: error: fqtools: No such file or directory
cc: error: unrecognized command line option '-i/bio/linuxbrew/opt/htslib'

Is there support for interleaved FASTQ?

The test set indicated in the paper appears to have become in accessible, can you confirm if interleaved FASTQ is processed correctly (i.e. checks that adjacent records are from the same paired read)?

Lowercase option not working

Hi,
First i would like to thank you for this awesome tool!
I recently started using fastq files with mixed uppercase and lowercase.
I'm using fqtools 2.3 2019-05-08 (zlib 1.2.8; htslib 1.8).

A test file I created named test_lowercase.fastq which contains:

@A00740:65:HNY73DSXX:3:1103:7925:25645 1:N:0:GTCCTTGA+TAATCTTA
agccatgcactctgtaatgaagagttcacAATCTTCAACAGAGTAGATATTTCAAGAAGTCAACTGATAGATGAATTGGCAGATAAATTTAACCGGCTTCTTGAAGATTTTCTGCAAGAGGTATATATTATAACTATTACAAGTATTTTGTCAGTTgagcccctctactgcaggaa
+
FFFFFFFFFFFFFFFFFFFF:FFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFF

The command I'm using:
fqtools -l count test_lowercase.fastq
And the error is:
ERROR [line 2]: invalid sequence character (a)

I also tried:
fqtools -l -F count test_lowercase.fastq
Which outputs a different error:

ERROR: unknown command: "test_lowercase.fastq"
usage: fqtools [-hvdramuli] [-b BUFSIZE] [-B BUFSIZE] [-q QUALTYPE] [-f FORMAT] [-F FORMAT] COMMAND [...] [FILE] [FILE]

I must say I first used an older version which I already had
fqtools 2.1 2016-10-04 (zlib 1.2.7; htslib 1.8) had the same results.
Decided to clone the repository again.

Am I using the command in the wrong way or is it a bug?

Thank you

PSA - fqtools is available to install via bioconda

For those running Anaconda/conda to manage your environment, fqtools (and htslib) is available in the bioconda channel.

Easy to install:
conda install -c bioconda fqtools htslib

add a new feature 'split fq'

Hi,

I have some large fq.gz files, which takes a long time to aln, so I try to split them into small files, it worked. But my script consumes a lot memeory. Will you consider add a new feature "split fq" to your tools

`fqtools type` incorrectly typing fastq

It appears that the current algorithm for fqtools type can get the fastq quality format wrong. Here's a reproducible example:

fastq-dump --split-files ERR719681
# Read 300355 spots for ERR719681
# Written 300355 spots for ERR719681
fqtools type ERR719681_1.fastq
# fastq-illumina
fqtools type ERR719681_2.fastq
# fastq-sanger

If Bio.SeqIO is then used to read these fastq files with the "type" specified by fqtools type, then the following error occurs:

  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 611, in parse
    for r in i:
  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/QualityIO.py", line 1255, in FastqIlluminaIterator
    raise ValueError("Invalid character in quality string")

Maybe using the min & max of qual values (the full range) for all sequences in the fastq file would help prevent these mis-calls?

htslib failed to install but this fixed it

Hi during the "./configure" command I got the following error: "config.status: error: cannot find input file: `config.h.in' htslib"
I fixed this by also running "autoheader" prior to configure

Does not build per instructions, missing sam.h file.

Following the instructions on the main page:
git clone https://github.com/alastair-droop/fqtools
cd fqtools/
git clone https://github.com/samtools/htslib
cd htslib/
autoheader
autoconf
./configure
make
make install
cd ..
make

Fails at the last step with:
In file included from src/fqprocess_view.c:14:0: src/fqheader.h:22:10: fatal error: sam.h: No such file or directory #include <sam.h>

I do not find a sam.h definition in htslib. I do however find a sam.h definition with in the separate samtools project. However if I modify for the Makefile to use that location as well I receive an error about too many parameters:

In file included from ../samtools-1.10/sam.h:29:0, from src/fqheader.h:22, from src/fqfile.c:14: src/fqfile.c: In function ‘fqfile_open_read_file_bam’: ../samtools-1.10/bam.h:209:22: error: too many arguments to function ‘samtools_sam_open’ #define sam_open samtools_sam_open

Could this be because samtools htslib etc. are now separately maintained projects?

Any support for fq fix?

Hi,

I am using fqtools to validate fastq files for our pipeline. I found the function fqtools validate very useful and fast. Just want to know, is there any support for automatic error fixing (at least some types of errors, e.g. unpaired reads)? Thanks a lot!

PS: I had problem with linking the htslib. So my solution is to use bioconda to install fqtools, which is completely automatic.

-hh1985

Any binary available?

could not build it easily