Giter Site home page Giter Site logo

nunofonseca / fastq_utils Goto Github PK

View Code? Open in Web Editor NEW
34.0 6.0 14.0 84 MB

Validation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.

License: GNU General Public License v3.0

Makefile 1.56% C 75.56% Shell 22.74% Dockerfile 0.15%
sequencing fastq validation ngs high-throughput-sequencing fastq-filterpair umi single-cell drop-seq 10x

fastq_utils's People

Contributors

nunofonseca avatar pinin4fjords avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fastq_utils's Issues

Duplicated Arguments

Hi Nuno,
the fastq2bam has duplicated arg '-s'
it is used for both schema and for sample barcode offset which was added as a response to Anja's issue #1

Thanks.

Fatal Error Issue

I get a fatal error issue when trying to install on my local machine or remote server:

-bash-4.2$ make && make install
make -C src clean
make[1]: Entering directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
rm -f *.o fastq_truncate fastq_filterpair  fastq_filter_n fastq_num_reads fastq_not_empty fastq_pre_barcodes fastq_trim_poly_at fastq_split_interleaved fastq_tests bam_add_tags bam_umi_count bam2fastq fastq_info *~
make[1]: Leaving directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
make -C src
make[1]: Entering directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_truncate.c
gcc -O3 -Wunused-result  -Wall -c hash.c
gcc -O3 -Wunused-result  -Wall -I ../zlib-1.2.11 -lz -c fastq.c
gcc  -O3 -Wunused-result  -Wall fastq_truncate.o hash.o fastq.o -lz -o fastq_truncate
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_filterpair.c
gcc  -O3 -Wunused-result  -Wall hash.o fastq_filterpair.o fastq.o -lz -o fastq_filterpair
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_filter_n.c
gcc  -O3 -Wunused-result  -Wall fastq_filter_n.o fastq.o hash.o -lz -o fastq_filter_n
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_num_reads.c
gcc  -O3 -Wunused-result  -Wall fastq_num_reads.o hash.o fastq.o -lz -o fastq_num_reads
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_not_empty.c
gcc  -O3 -Wunused-result  -Wall fastq_not_empty.o hash.o fastq.o -lz -o fastq_not_empty
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_pre_barcodes.c
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
 #include "bam.h"
                 ^
compilation terminated.
make[1]: *** [fastq_pre_barcodes.o] Error 1
make[1]: Leaving directory `/home//username/fastq_utils/src'
make: *** [all] Error 2

Add support for Nanopore RNA-seq data to FASTQ validator

Hi Nuno (hope you are doing well!),

We have a new request for you. We have people submitting their Oxford Nanopore RNA-sequencing data in FASTQ format. These naturally contain 'U' bases instead of 'T's. The fastq validator currently rejects these for containing 'U' characters as unknown (invalid character 'U' (hex. code:'55'), expected ACGTacgt0123nN.).
ENA is happy to accept such sequences (example file: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR431/ERR4319899/PCreads.fastq.gz) and thus, we would like to relax the fastq validation rules and have it accept files with sequences containing U characters.

validate_fastq.sh doesn't catch bzip2 decrompression fail

Hi Nuno,

A minor bug report. Once in a blue moon we have bzip2 files running through the validator, and the time has come again. ;)

I've noticed that there is a check that is supposed to end the program when the file fails bzip2 decompression

bunzip2   -c $f | gzip -c > $tmp_file
	    if [ $? -ne 0 ]; then
		echo "ERROR: $f: error uncompressing bzip2 file"
exit 2

but it didn't run. The error we got happened much later when an empty tmp gzip file was generated/checked. So instead of "error uncompressing bzip2 file" we got "No reads found in xxx.tmp.gz" for the broken bzip2 file.

I've tested the bunzip | gzip command with the broken file and it did give exit code 0. So maybe this is something that can be fixed at some point.

Best wishes,
Anja

Conda channels for install

Hi! Commenting here in case someone runs into the same thing.
In my environment I failed to install it from the bioconda channel only and got errors related to libgcc-ng -> __glibc.

For me it worked like this:

conda install -c conda-forge -c bioconda fastq_utils

Schema name for drop-seq in fastq2bam

Hi Nuno,

We've ran the fastq2bam converter for a drop-seq experiment and we've noticed that the schema name differs between the usage string ("drop-seq") and the actual expected schema name ("dropseq").
Can you change that to "drop-seq" everywhere to be consistent with our nomenclature?

Thanks and best wishes,
Anja

bam2fastq doesn't give back the original fastq files

Hi !
I'm processing publicly available data in the forms of bam files that originate from the fastq2bam command.
I'm thus trying to retrieve the original fastq files using the bam2fastq command.
I'm not very much familiar with what that command does, but it seems that the fastq files outputted by bam2fastq have shorter reads than the ones inputted to fastq2bam when converting them to bam files. They are thus too short for processing with cellranger.

I might be using the package in a wrong way but I believed that fastq2bam is used for lossless conversion of fastq files to bam files and then bam2fastq to retrieve the fastq files.

Thank you in advance for your help.

Best,

Eloi

Fatal error when attempting 'make'

After downloading and attempting the make step I get the following error:
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory

I first tried with latest, and then tried with the previous minor version.

Full output

mshadbolt@ip-172-31-71-222:~/software/fastq_utils-0.23.0$ make
make -C src clean
make[1]: Entering directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
rm -f *.o fastq_truncate fastq_filterpair  fastq_filter_n fastq_num_reads fastq_not_empty fastq_pre_barcodes fastq_trim_poly_at fastq_split_interleaved fastq_tests bam_add_tags bam_umi_count bam2fastq fastq_info *~
make[1]: Leaving directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
make -C src
make[1]: Entering directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_truncate.c
gcc -O3 -Wunused-result  -Wall -c hash.c
gcc -O3 -Wunused-result  -Wall -I ../zlib-1.2.11 -lz -c fastq.c 
gcc  -O3 -Wunused-result  -Wall fastq_truncate.o hash.o fastq.o -lz -o fastq_truncate  
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_filterpair.c
gcc  -O3 -Wunused-result  -Wall hash.o fastq_filterpair.o fastq.o -lz -o fastq_filterpair
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_filter_n.c
gcc  -O3 -Wunused-result  -Wall fastq_filter_n.o fastq.o hash.o -lz -o fastq_filter_n 
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_num_reads.c
gcc  -O3 -Wunused-result  -Wall fastq_num_reads.o hash.o fastq.o -lz -o fastq_num_reads 
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_not_empty.c
gcc  -O3 -Wunused-result  -Wall fastq_not_empty.o hash.o fastq.o -lz -o fastq_not_empty 
gcc -I ../samtools-0.1.19 -O3 -Wunused-result  -Wall   -c fastq_pre_barcodes.c
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
compilation terminated.
Makefile:106: recipe for target 'fastq_pre_barcodes.o' failed
make[1]: *** [fastq_pre_barcodes.o] Error 1
make[1]: Leaving directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
Makefile:4: recipe for target 'all' failed
make: *** [all] Error 2

PacBio quality encoding range

Hi Nuno,

Long time no see :) Hope you are well. I've got a new issue for you:

We've now seen a few cases where FASTQ files from PacBio fail fastq validation with the error
Unable to determine quality encoding - unknown range [34,123] .
Although the actual numbers in the range vary from file to file.

Can this be added as allowed?
Unfortunately, I haven't found out what the maximum value or the expected range should be.
Let me know if you need more details.

Anja

fastq2bam - flexibility with barcode/read lengths

Hi Nuno,

We discovered a few cases of 10x experiments, where the R1 and R2 reads don't have the expected lengths. While it isn't a problem if the reads are longer (I guess the extra bases are simply disregarded during the conversion), the fastq2bam conversion fails if the reads are shorter than expected.

Here is one example of a 10X2 library sequenced with non-standard parameters:
Cell barcode offset = 0
Cell barcode size =16
UMI barcode offset = 16
UMI barcode size =9

Do you think it will be possible to pass these numbers as parameters if they are non-default? Or can you think of another workaround to make the script generate BAM files for this library?

Many thanks for looking into this.
Anja

Add 10xV3 schema

Hi Nuno,

10x has now released version 3 and we've been getting a few submissions already with the new library schema.
Could you add the 10xV3 schema to the converter? I think all that has changed from our point of view is that the UMI barcode is now 12 bp instead of 10 bp (compared to 10xV2).

Thanks and best wishes,
Anja

I've made you a Bioconda package :-)

Hi Nuno,

I've made a Bioconda package for fastq_utils. What this means is that you should soon be able to install fastq_utils simply like 'conda install fastq_utils' (assuming you have Bioconda installed). This is nice because it allows you to provide the associated dependencies more easily.

You can see the PR by which this was done here: bioconda/bioconda-recipes#13473. It's passing tests (after much work), will merge soon and shortly thereafter the install command will work.

Thanks,

Jon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.