nunofonseca / fastq_utils Goto Github PK
View Code? Open in Web Editor NEWValidation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.
License: GNU General Public License v3.0
Validation and manipulation of FASTQ files, scRNA-seq barcode pre-processing and UMI quantification.
License: GNU General Public License v3.0
Hi Nuno,
the fastq2bam has duplicated arg '-s'
it is used for both schema
and for sample barcode offset
which was added as a response to Anja's issue #1
Thanks.
I get a fatal error issue when trying to install on my local machine or remote server:
-bash-4.2$ make && make install
make -C src clean
make[1]: Entering directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
rm -f *.o fastq_truncate fastq_filterpair fastq_filter_n fastq_num_reads fastq_not_empty fastq_pre_barcodes fastq_trim_poly_at fastq_split_interleaved fastq_tests bam_add_tags bam_umi_count bam2fastq fastq_info *~
make[1]: Leaving directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
make -C src
make[1]: Entering directory `/home/pmc_research/jbakerhernandez/fastq_utils/src'
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_truncate.c
gcc -O3 -Wunused-result -Wall -c hash.c
gcc -O3 -Wunused-result -Wall -I ../zlib-1.2.11 -lz -c fastq.c
gcc -O3 -Wunused-result -Wall fastq_truncate.o hash.o fastq.o -lz -o fastq_truncate
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_filterpair.c
gcc -O3 -Wunused-result -Wall hash.o fastq_filterpair.o fastq.o -lz -o fastq_filterpair
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_filter_n.c
gcc -O3 -Wunused-result -Wall fastq_filter_n.o fastq.o hash.o -lz -o fastq_filter_n
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_num_reads.c
gcc -O3 -Wunused-result -Wall fastq_num_reads.o hash.o fastq.o -lz -o fastq_num_reads
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_not_empty.c
gcc -O3 -Wunused-result -Wall fastq_not_empty.o hash.o fastq.o -lz -o fastq_not_empty
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_pre_barcodes.c
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
#include "bam.h"
^
compilation terminated.
make[1]: *** [fastq_pre_barcodes.o] Error 1
make[1]: Leaving directory `/home//username/fastq_utils/src'
make: *** [all] Error 2
Hi Nuno (hope you are doing well!),
We have a new request for you. We have people submitting their Oxford Nanopore RNA-sequencing data in FASTQ format. These naturally contain 'U' bases instead of 'T's. The fastq validator currently rejects these for containing 'U' characters as unknown (invalid character 'U' (hex. code:'55'), expected ACGTacgt0123nN.
).
ENA is happy to accept such sequences (example file: ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR431/ERR4319899/PCreads.fastq.gz) and thus, we would like to relax the fastq validation rules and have it accept files with sequences containing U characters.
Hi Nuno,
A minor bug report. Once in a blue moon we have bzip2 files running through the validator, and the time has come again. ;)
I've noticed that there is a check that is supposed to end the program when the file fails bzip2 decompression
bunzip2 -c $f | gzip -c > $tmp_file
if [ $? -ne 0 ]; then
echo "ERROR: $f: error uncompressing bzip2 file"
exit 2
but it didn't run. The error we got happened much later when an empty tmp gzip file was generated/checked. So instead of "error uncompressing bzip2 file" we got "No reads found in xxx.tmp.gz" for the broken bzip2 file.
I've tested the bunzip | gzip
command with the broken file and it did give exit code 0. So maybe this is something that can be fixed at some point.
Best wishes,
Anja
Hi! Commenting here in case someone runs into the same thing.
In my environment I failed to install it from the bioconda channel only and got errors related to libgcc-ng -> __glibc
.
For me it worked like this:
conda install -c conda-forge -c bioconda fastq_utils
Hi Nuno,
We've ran the fastq2bam converter for a drop-seq experiment and we've noticed that the schema name differs between the usage string ("drop-seq") and the actual expected schema name ("dropseq").
Can you change that to "drop-seq" everywhere to be consistent with our nomenclature?
Thanks and best wishes,
Anja
Hi !
I'm processing publicly available data in the forms of bam files that originate from the fastq2bam command.
I'm thus trying to retrieve the original fastq files using the bam2fastq command.
I'm not very much familiar with what that command does, but it seems that the fastq files outputted by bam2fastq have shorter reads than the ones inputted to fastq2bam when converting them to bam files. They are thus too short for processing with cellranger.
I might be using the package in a wrong way but I believed that fastq2bam is used for lossless conversion of fastq files to bam files and then bam2fastq to retrieve the fastq files.
Thank you in advance for your help.
Best,
Eloi
After downloading and attempting the make step I get the following error:
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
I first tried with latest, and then tried with the previous minor version.
Full output
mshadbolt@ip-172-31-71-222:~/software/fastq_utils-0.23.0$ make
make -C src clean
make[1]: Entering directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
rm -f *.o fastq_truncate fastq_filterpair fastq_filter_n fastq_num_reads fastq_not_empty fastq_pre_barcodes fastq_trim_poly_at fastq_split_interleaved fastq_tests bam_add_tags bam_umi_count bam2fastq fastq_info *~
make[1]: Leaving directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
make -C src
make[1]: Entering directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_truncate.c
gcc -O3 -Wunused-result -Wall -c hash.c
gcc -O3 -Wunused-result -Wall -I ../zlib-1.2.11 -lz -c fastq.c
gcc -O3 -Wunused-result -Wall fastq_truncate.o hash.o fastq.o -lz -o fastq_truncate
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_filterpair.c
gcc -O3 -Wunused-result -Wall hash.o fastq_filterpair.o fastq.o -lz -o fastq_filterpair
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_filter_n.c
gcc -O3 -Wunused-result -Wall fastq_filter_n.o fastq.o hash.o -lz -o fastq_filter_n
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_num_reads.c
gcc -O3 -Wunused-result -Wall fastq_num_reads.o hash.o fastq.o -lz -o fastq_num_reads
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_not_empty.c
gcc -O3 -Wunused-result -Wall fastq_not_empty.o hash.o fastq.o -lz -o fastq_not_empty
gcc -I ../samtools-0.1.19 -O3 -Wunused-result -Wall -c fastq_pre_barcodes.c
fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
compilation terminated.
Makefile:106: recipe for target 'fastq_pre_barcodes.o' failed
make[1]: *** [fastq_pre_barcodes.o] Error 1
make[1]: Leaving directory '/home/mshadbolt/software/fastq_utils-0.23.0/src'
Makefile:4: recipe for target 'all' failed
make: *** [all] Error 2
Not sure if your UMI tags are meant to support 10X or another technology but just in case, the tags for the Chromium molecular barcode are UB (corrected barcode) and UY from 'https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam' rather than RX and QX. Perhaps of interest to you.
Kind regards
Hi Nuno,
Long time no see :) Hope you are well. I've got a new issue for you:
We've now seen a few cases where FASTQ files from PacBio fail fastq validation with the error
Unable to determine quality encoding - unknown range [34,123]
.
Although the actual numbers in the range vary from file to file.
Can this be added as allowed?
Unfortunately, I haven't found out what the maximum value or the expected range should be.
Let me know if you need more details.
Anja
Hi Nuno,
We discovered a few cases of 10x experiments, where the R1 and R2 reads don't have the expected lengths. While it isn't a problem if the reads are longer (I guess the extra bases are simply disregarded during the conversion), the fastq2bam conversion fails if the reads are shorter than expected.
Here is one example of a 10X2 library sequenced with non-standard parameters:
Cell barcode offset = 0
Cell barcode size =16
UMI barcode offset = 16
UMI barcode size =9
Do you think it will be possible to pass these numbers as parameters if they are non-default? Or can you think of another workaround to make the script generate BAM files for this library?
Many thanks for looking into this.
Anja
Hi Nuno,
10x has now released version 3 and we've been getting a few submissions already with the new library schema.
Could you add the 10xV3 schema to the converter? I think all that has changed from our point of view is that the UMI barcode is now 12 bp instead of 10 bp (compared to 10xV2).
Thanks and best wishes,
Anja
Hi Nuno,
I've made a Bioconda package for fastq_utils. What this means is that you should soon be able to install fastq_utils simply like 'conda install fastq_utils' (assuming you have Bioconda installed). This is nice because it allows you to provide the associated dependencies more easily.
You can see the PR by which this was done here: bioconda/bioconda-recipes#13473. It's passing tests (after much work), will merge soon and shortly thereafter the install command will work.
Thanks,
Jon
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.