Giter Site home page Giter Site logo

alastair-droop / fqtools Goto Github PK

View Code? Open in Web Editor NEW
134.0 7.0 19.0 2.34 MB

An efficient FASTQ manipulation suite

License: GNU General Public License v3.0

Makefile 0.57% C 83.89% Shell 0.58% Python 13.41% Objective-C 0.64% C++ 0.91%
fastq fastq-files next-generation-sequencing bioinformatics

fqtools's People

Contributors

alastair-droop avatar kloetzl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fqtools's Issues

an empty sequence line still passes

The second entry of this fastq still passes the validator. Is it something intentional?

@LD5V2:07687:11026
CGGGGGTCTTAGCTTTGGCTCTCCTTGCAAAGTTATTTCTAGTTAATTCATTATGCAGAAGGTATAGGGGTTAGTCCTTGCTTATATTATGCTTGGTTATAATTTTTCATCTTTCCCTTGCGGTACTATATCTATTGCGACCA
+
35977*6772999990959:;:<6<.53::19;39891845..*-36159<6;::::;;6;6>:95333)52577957*...*/6774999894726268858=>=-99:99:2;>3>5:::5:;;<;;::;7/,*,3***+,
@LD5V2:07687:11043

+

@LD5V2:07688:11020
AAAATTTAACACCCATAGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTTCAAGCCCAACACCCACTACCTAAAAAATCCCAAACATATAACTGAACT
+
:::0:?4>7==<<4<;;;=<=7;78888.9;@><5;;4:4:4:4:;;2;<<<6<<6<<=4>6;;<=2;;;;;5565533'5::/;<2<?ABD?7<<<;5<=

Add flag to fqtools validate to output sequence count

Would there be interest in a adding a flag to fqtools validate to optionally output the total count of sequences if the records were all valid? I can work on submitting a pull request, but would like a bit of direction on whether you would prefer to have the changes go in fqtools validate or fqtools count (and add an option to validate) or neither 🥲 .

The motivation is that I need to count sequences and validate records as fast as possible. It seems to me like this information could be available from fqtools validate since we have to traverse the file regardless.

Compiler warnings

gcc --version
gcc (Homebrew gcc 5.3.0) 5.3.0
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1324:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzwrite OF((gzFile file,
                     ^
src/fqfile.c: In function 'fqfile_eof_file_fastq_compressed':
src/fqfile.c:285:18: warning: passing argument 1 of 'gzeof' from incompatible pointer type [-Wincompatible-pointer-types]
     return gzeof((gzFile*)(((fqfile*)f)->handle));
                  ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1458:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzeof OF((gzFile file));
                     ^
src/fqfile.c: In function 'fqfile_flush_file_fastq_compressed':
src/fqfile.c:302:13: warning: passing argument 1 of 'gzflush' from incompatible pointer type [-Wincompatible-pointer-types]
     gzflush((gzFile*)(((fqfile*)f)->handle), 0);
             ^
In file included from src/fqheader.h:21:0,
                 from src/fqfile.c:14:
/usr/include/zlib.h:1395:21: note: expected 'gzFile {aka struct gzFile_s *}' but argument is of type 'struct gzFile_s *'
 ZEXTERN int ZEXPORT gzflush OF((gzFile file, int flush));
                     ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsin.o src/fqfsin.c
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfsout.o src/fqfsout.c
src/fqfsout.c: In function 'fqfsout_writechar':
src/fqfsout.c:165:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
src/fqfsout.c: In function 'fqfsout_write':
src/fqfsout.c:178:14: warning: variable 'result' set but not used [-Wunused-but-set-variable]
     fqstatus result;
              ^
cc -O2 -g -Wall -Wextra -Wno-unused-parameter -I/bio/linuxbrew/opt/htslib/include  -c -o src/fqfileprep.o src/fqfileprep.c
src/fqfileprep.c: In function 'prepare_filesets':
src/fqfileprep.c:86:15: warning: 'informat_2' may be used uninitialized in this function [-Wmaybe-uninitialized]
             if((outformat_2 == FQ_FORMAT_UNKNOWN) && (options.input_interleaving == FQ_INTERLEAVED)) outformat_2 = info
               ^

find fastq reads from specific sequences

Hello,

I have lists of sequence which I would like to find fastq reads that contain these sequences.

would it be possible to use fqtools find option to do this??

my lists of sequence looks like following

GATAAAAAAAAAAAAAAAC
GATAAAAAAAAAAAAAACC
GATAAAAAAAAAAAAAATC
GATAAAAAAAAAAAAAAGC
GATAAAAAAAAAAAAACAC
GATAAAAAAAAAAAAACCC
GATAAAAAAAAAAAAACTC
GATAAAAAAAAAAAAATAC
GATAAAAAAAAAAAAATCC
GATAAAAAAAAAAAAATGC
GATAAAAAAAAAAAAAGAC
GATAAAAAAAAAAAAAGCC
GATAAAAAAAAAAAAAGGC
GATAAAAAAAAAAAACAAC
GATAAAAAAAAAAAACACC
GATAAAAAAAAAAAACCAC
GATAAAAAAAAAAAACCCC
GATAAAAAAAAAAAACCTC
GATAAAAAAAAAAAATAAC
GATAAAAAAAAAAAATCAC
GATAAAAAAAAAAAATTAC
GATAAAAAAAAAAAAGAAC
GATAAAAAAAAAAAAGACC
GATAAAAAAAAAAACAAAC
GATAAAAAAAAAAACCCCC
GATAAAAAAAAAAATAAAC
GATAAAAAAAAAAAGAAAC
GATAAAAAAAAAACAAAAC
.
.
.
.

I have used grep to do this one by one but it's taking too long
grep -A 2 -B 1 "CTCAAAAAAAAACAAAGGA" input.fastq |grep -v "^\-\-$" > output.fastq

make test error

make tests
mkdir -p bin
cc -O2 -g -L/bio/linuxbrew/opt/htslib -o./bin/fqtools src/fqprocess_view.o src/fqprocess_head.o src/fqprocess_count.o src/fqprocess_blockview.o src/fqprocess_fasta.o src/fqprocess_basetab.o src/fqprocess_qualtab.o src/fqprocess_lengthtab.o src/fqprocess_type.o src/fqprocess_validate.o src/fqprocess_find.o src/fqprocess_trim.o src/fqprocess_qualmap.o src/fqbuffer.o src/fqfile.o src/fqfsin.o src/fqfsout.o src/fqfileprep.o src/fqparser.o src/fqgenerics.o src/fqhelp.o src/fqtools.o -lz -lhts -lm
cc -O2 -g -L./src -i/bio/linuxbrew/opt/htslib -o./tests/test-fqbuffer fqtools tests/test-fqbuffer.c -lz -lhts -lm
cc: error: fqtools: No such file or directory
cc: error: unrecognized command line option '-i/bio/linuxbrew/opt/htslib'

Is there support for interleaved FASTQ?

The test set indicated in the paper appears to have become in accessible, can you confirm if interleaved FASTQ is processed correctly (i.e. checks that adjacent records are from the same paired read)?

Lowercase option not working

Hi,
First i would like to thank you for this awesome tool!
I recently started using fastq files with mixed uppercase and lowercase.
I'm using fqtools 2.3 2019-05-08 (zlib 1.2.8; htslib 1.8).

A test file I created named test_lowercase.fastq which contains:

@A00740:65:HNY73DSXX:3:1103:7925:25645 1:N:0:GTCCTTGA+TAATCTTA
agccatgcactctgtaatgaagagttcacAATCTTCAACAGAGTAGATATTTCAAGAAGTCAACTGATAGATGAATTGGCAGATAAATTTAACCGGCTTCTTGAAGATTTTCTGCAAGAGGTATATATTATAACTATTACAAGTATTTTGTCAGTTgagcccctctactgcaggaa
+
FFFFFFFFFFFFFFFFFFFF:FFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFFFFFFFFFFFFFFFFFFFFFF

The command I'm using:
fqtools -l count test_lowercase.fastq
And the error is:
ERROR [line 2]: invalid sequence character (a)

I also tried:
fqtools -l -F count test_lowercase.fastq
Which outputs a different error:

ERROR: unknown command: "test_lowercase.fastq"
usage: fqtools [-hvdramuli] [-b BUFSIZE] [-B BUFSIZE] [-q QUALTYPE] [-f FORMAT] [-F FORMAT] COMMAND [...] [FILE] [FILE]

I must say I first used an older version which I already had
fqtools 2.1 2016-10-04 (zlib 1.2.7; htslib 1.8) had the same results.
Decided to clone the repository again.

Am I using the command in the wrong way or is it a bug?

Thank you

add a new feature 'split fq'

Hi,

I have some large fq.gz files, which takes a long time to aln, so I try to split them into small files, it worked. But my script consumes a lot memeory. Will you consider add a new feature "split fq" to your tools

`fqtools type` incorrectly typing fastq

It appears that the current algorithm for fqtools type can get the fastq quality format wrong. Here's a reproducible example:

fastq-dump --split-files ERR719681
# Read 300355 spots for ERR719681
# Written 300355 spots for ERR719681
fqtools type ERR719681_1.fastq
# fastq-illumina
fqtools type ERR719681_2.fastq
# fastq-sanger

If Bio.SeqIO is then used to read these fastq files with the "type" specified by fqtools type, then the following error occurs:

  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 611, in parse
    for r in i:
  File "/ebio/abt3_projects/software/dev/llmgqc/.snakemake/conda/a289c738/lib/python3.6/site-packages/Bio/SeqIO/QualityIO.py", line 1255, in FastqIlluminaIterator
    raise ValueError("Invalid character in quality string")

Maybe using the min & max of qual values (the full range) for all sequences in the fastq file would help prevent these mis-calls?

htslib failed to install but this fixed it

Hi during the "./configure" command I got the following error: "config.status: error: cannot find input file: `config.h.in' htslib"
I fixed this by also running "autoheader" prior to configure

Does not build per instructions, missing sam.h file.

Following the instructions on the main page:
git clone https://github.com/alastair-droop/fqtools
cd fqtools/
git clone https://github.com/samtools/htslib
cd htslib/
autoheader
autoconf
./configure
make
make install
cd ..
make

Fails at the last step with:
In file included from src/fqprocess_view.c:14:0: src/fqheader.h:22:10: fatal error: sam.h: No such file or directory #include <sam.h>

I do not find a sam.h definition in htslib. I do however find a sam.h definition with in the separate samtools project. However if I modify for the Makefile to use that location as well I receive an error about too many parameters:

In file included from ../samtools-1.10/sam.h:29:0, from src/fqheader.h:22, from src/fqfile.c:14: src/fqfile.c: In function ‘fqfile_open_read_file_bam’: ../samtools-1.10/bam.h:209:22: error: too many arguments to function ‘samtools_sam_open’ #define sam_open samtools_sam_open

Could this be because samtools htslib etc. are now separately maintained projects?

Any support for fq fix?

Hi,

I am using fqtools to validate fastq files for our pipeline. I found the function fqtools validate very useful and fast. Just want to know, is there any support for automatic error fixing (at least some types of errors, e.g. unpaired reads)? Thanks a lot!

PS: I had problem with linking the htslib. So my solution is to use bioconda to install fqtools, which is completely automatic.

-hh1985

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.