deplanckelab / brb-seqtools Goto Github PK

View Code? Open in Web Editor NEW

25.0 5.0 10.0 285.99 MB

Suite of tools for processing BRB-seq data

License: Other

Java 100.00%

brb-seqtools's Introduction

BRB-seq Tools

A suite of tools for the pre-processing of BRB-seq data (bulk RNA-seq)

⚠️ Recent notes (and for Alithea users)

BRB-seq tools suite was created in the early days of multiplexed libraries, when there was not many other alternatives to analyze BRB-seq libraries. Now, this is not the case anymore, so we would recommend using STARsolo instead, which should produce similar results in a single command.

In particular, do NOT use the output of the Trimming step of BRB-seq Tools as input for STARsolo, as this will not produce correct UMI values (without displaying an error message). STARsolo can provide read trimming that matches BRB-seq specificities using the --clipAdapterType CellRanger4 option

An example of command that you can run to align + demultiplex BRB-seq libraries:

STAR --runMode alignReads --soloUMIdedup 1MM_Directional --soloCBmatchWLtype 1MM --clipAdapterType CellRanger4 --outSAMmapqUnique 60 --outSAMunmapped Within --soloStrand Forward --quantMode GeneCounts --genomeDir STAR_Index/ --soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 16 --soloCellFilter None --soloCBwhitelist lib_example_barcodes.txt --soloFeatures Gene --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --outFilterMultimapNmax 1 --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix bam/lib_example/ --readFilesIn lib_example_R2.fastq.gz lib_example_R1.fastq.gz

Note: Here, the lib_example_barcodes.txt file should be a list of barcodes, one per line.

Of course, you can also add some other options like:

--runThreadN 12 --outBAMsortingThreadN 12 to run STAR in parallel on 12 threads (or more, or less)
--soloCBstart 1 --soloCBlen 12 --soloUMIstart 13 --soloUMIlen 16 this depends on your R1 sequence (check in the fastq file), and the barcodes that you are using

Download software

BRB-seq command-line tools are provided as a single executable jar file. The .jar file contains all required materials and can be run on any terminal.

Dependencies

Java version

For the tools to run properly, you must have Java >=1.8 installed.

To check your java version, open your terminal application and r un the following command:

java -version

If the output looks something like java version "1.8.x", you are good to go. If not, you may need to update your version; see the Oracle Java website to download the latest JRE (for users) or JDK (for developers).

Picard

The software relies on Picard Tools, but the Picard JAR is already embedded in the released JAR, so no need to install it yourself.

Sequencing output

After sequencing your BRB-seq libraries, you should obtain two fastq files per library:

R1 fastq file: This file should contain the barcodes and UMIs of the BRBseq construct. Usually, the barcode comes before the UMI sequence. And the UMI sequence is optional.
R2 fastq file: This file should contain the exact same number of reads (and read names) than the R1 file. Except that the sequence of the reads are the sequences representing the RNA fragments.

BRB-seq tools is a suite dedicated to help you analyze these data, until the generation of the output count/UMI matrix. For further analyses (filtering, normalization, dimension reduction, clustering, differential expression), we recommend using ASAP web portal that you can freely access at asap.epfl.ch.

Aligner

BRB-seqTools was mainly tested with the STAR aligner. So far it seems to not working with the hisat2 aligner. Please contact me if it does not work with any other aligner, I'll try to have a look in the future, if this is a recurrent problem.

Installation

To check that BRB-seq Tools is working properly, run the following command:

java -jar BRBseqTools.jar

As shown in the output of this command, BRB-seq tool suite allows the user to run 4 possible tools.

Demultiplex     Create demultiplexed fastq files from R1+R2 fastq (use this if you want to process them independently, thus doing the next steps without BRBseqTools)
CreateDGEMatrix Create the DGE Matrix (counts + UMI) from R2 aligned BAM and R1 fastq
Trim            For trimming BRBseq construct in R2 fastq file or demultiplexed fastq files
AnnotateBAM     For annotating the R2 BAM file using UMIs/Barcodes from the R1 fastq and/or GTF and/or Barcode files
ExtractReadCountMatrix  For generating a read count matrix from a STAR-aligned BAM file.

Example of full pipeline (using STAR)

First if you don't already have an indexed genome, you need to build the index (for STAR)

# Download last release of your species of interest .fasta file from Ensembl (or any other database that you'd prefer to use). Here e.g. Homo sapiens from Ensembl
wget ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# Unzip
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# Download last release of homo sapiens .gtf file from Ensembl
wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz
# Unzip
gzip -d Homo_sapiens.GRCh38.90.gtf.gz
# Generate the STAR genome index (in 'STAR_Index' folder) => This require ~30G RAM
mkdir STAR_Index/
STAR --runMode genomeGenerate --genomeDir STAR_Index/ --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.90.gtf
# Note: you can add the argument '--runThreadN 4' for running 4 threads in parallel (or more if your computer has enough cores)
# Note: if genome is not human, you need to add/tune the argument '--genomeSAindexNbases xx' with xx = log2(nbBasesInGenome)/2 - 1
#	For e.g. for drosophila melanogaster xx ~ 12.43, thus you should add the argument '--genomeSAindexNbases 12'

When the index is built, you will never need to rebuild it. Then you can use only the following script:

# (Optional) Trim the read containing the sequence fragments (generates a 'lib_example_R2.trimmed.fastq.gz' file)
java -jar BRBseqTools.jar Trim -f lib_example_R2.fastq.gz
# Create output folder
mkdir BAM/
# Align only the R2 fastq file (using STAR, no sorting/indexing is needed)
STAR --runMode alignReads --genomeDir STAR_Index/ --outFilterMultimapNmax 1 --readFilesCommand zcat --outSAMtype BAM Unsorted --outFileNamePrefix BAM/ --readFilesIn lib_example_R2.trimmed.fastq.gz
# Note: you can add the argument '--runThreadN 4' for running 4 threads in parallel (or more if your computer has enough cores)
# Note: the '--outFilterMultimapNmax 1' option is recommended for removing multiple mapping reads from the output BAM
# (optional) Rename the output aligned BAM
mv BAM/Aligned.out.bam BAM/lib_example_R2.bam
# Demultiplex and generate output count/UMI matrix
java -jar BRBseqTools.jar CreateDGEMatrix -f lib_example_R1.fastq.gz -b BAM/lib_example_R2.bam -c lib_example_barcodes.txt -gtf Homo_sapiens.GRCh38.90.gtf -p BU -UMI 14
# Note: This example suppose that R1 has barcode followed by 14bp UMI
# Note: The ‘-p’ option specifies the pattern in the R1 fastq files. If you sequenced only the 6bp barcode then you should use ‘-p B’, but if you sequenced the barcode + 10bp UMIs, then it should be ‘-p BU’ (for Barcode followed by UMI) and then you need to specify the UMI length with ‘-UMI 10’. 
# Note: In case you sequenced some bp you don’t want to use, you can use the ‘?’ character. For e.g. you sequenced 10bp UMI but the R1 contains extra 4bp after the UMI that are not UMI, then you should use “-p BU???? -UMI 10”
# Note: 'lib_example_barcodes.txt' should be created by the user and should contain the mapping between the barcode and the sample name¹

¹You can download/edit this example of barcode/samplename mapping file

Then you can load the generated 'output.dge.reads.txt' count matrix in R and performs your analyses (or 'output.dge.umis.txt' if you prefer working with UMIs). Or you can upload this file to https://asap.epfl.ch and run the analysis pipeline online.

Usage

Here follows the description of each tool in more details.

ExtractReadCountMatrix

This tool was recently added to BRBseq-Tools and is useful when using STARsolo for performing the whole pipeline (instead of BRBseq-Tools). The issue with STARsolo is that it does not generate the read count matrix (only the UMI count matrix). ExtractReadCountMatrix was specifically designed to retrieve the read count matrix from the BAM output from STARsolo.

Note: This tool allows to perform the whole pipeline with one STARsolo command (and eventually trim via cutadapt). It will probably become the method of choice for processing BRB-seq datasets, given STARsolo performances.

Options:

-b %s           [Required] Path of STAR-aligned BAM file to analyze.
-c %s           [Required] Path of Barcode/Samplename mapping file.
-gtf %s         [Required] Path of GTF file.
-o %s           Output folder
-chunkSize %i   Maximum number of reads to be stored in RAM (default = 1000000)

¹You can download/edit this example of barcode/samplename mapping file

Example:

java -jar BRBseqTools.jar ExtractReadCountMatrix -b STAR_solo_aligned.bam -c lib_example_barcodes.txt -gtf hg38.gtf

Demultiplex

This tool is used when you need to generate all the fastq files corresponding to all your multiplexed samples. You end up with one fastq file per sample.

Note: If you use the "AnnotateBAM" tool, you can also demultiplex the annotated BAM file using GATK SplitReads

Options:

-r1 %s          [Required] Path of R1 FastQ files (containing barcode and optionally UMIs) [can be gzipped or raw].
-r2 %s          [Required] Path of R2 FastQ files (containing read sequence) [can be gzipped or raw].
-c %s           [Required] Path of Barcode/Samplename mapping file¹.
-n %i           Number of allowed difference with the barcode [Default = 1]. Ambiguous barcodes (same distance from two or more existing barcodes) will be automatically discarded.
-o %s           Output folder
-p %s           Barcode pattern/order found in the reads of the R1 FastQ file. Barcode names should match the barcode file (default = 'BU' i.e. barcode followed by the UMI).
                        'B' [required] is used for specifying the barcode position.
                        'U' [optional] is used for specifying a UMI value position.
                        Character '?' [optional] can be used to ignore specific nucleotides.
-UMI %i         If your barcode pattern contains UMI ('U'), you should specify this parameter as the length of the UMI.

¹You can download/edit this example of barcode/samplename mapping file

Example:

java -jar BRBseqTools.jar Demultiplex -r1 lib_example_R1.fastq.gz -r2 lib_example_R2.fastq.gz -c lib_example_barcodes.txt -p BU -UMI 14

or, if no UMI:

java -jar BRBseqTools.jar Demultiplex -r1 lib_example_R1.fastq.gz -r2 lib_example_R2.fastq.gz -c lib_example_barcodes.txt -p B

Note: When using this tool, UMIs are kept as indexes in all output .fastq files

CreateDGEMatrix

This tool is used when you don't need the intermediary .fastq & .bam files from all your multiplexed samples. It greatly simplifies the demultiplexing and analysis, and directly generates a workable count/UMI matrix.

Options:

-f %s           [Required] Path of R1 FastQ file [can be gzipped or raw].
-b %s           [Required] Path of R2 aligned BAM file [do not need to be sorted or indexed].
-c %s           [Required] Path of Barcode/Samplename mapping file¹.
-gtf %s         [Required] Path of GTF file [can be gzipped or raw].
-s %s           Do you want to count only reads falling on same strand than gene? [no, yes, reverse] (default = yes since BRB-seq is stranded protocol)
-n %i           Number of allowed difference with the barcode [ambiguous reads will be automatically discarded].
-o %s           Output folder
-t %s           Path of existing folder for storing temporary files
-chunkSize %i   Maximum number of reads to be stored in RAM (default = 10000000)
-p %s           Barcode pattern/order found in the reads of the R1 FastQ file. Barcode names should match the barcode file (default = 'BU' i.e. barcode followed by the UMI).
                        'B' [required] is used for specifying the barcode position.
                        'U' [optional] is used for specifying a UMI value position.
                        Character '?' [optional] can be used to ignore specific nucleotides.
-UMI %i         If your barcode pattern contains UMI ('U'), you should specify this parameter as the length of the UMI.

¹You can download/edit this example of barcode/samplename mapping file

Example:

java -jar BRBseqTools.jar CreateDGEMatrix -f lib_example_R1.fastq.gz -b lib_example_R2.bam -c lib_example_barcodes.txt -gtf Homo_sapiens.GRCh38.90.gtf.gz -p BU -UMI 14

Note: The original BRB-seq protocol contains a UMI construct. But you can also generate a library without UMIs, or even not sequence the UMIs from R2 read. If UMIs are present, both UMI and read count matrices will be generated. If not, only the read count table will be generated.

AnnotateBAM

This tool is used when you need specific downstream analyses of your BAM files that are not handled by BRBseqTools. It creates an annotated BAM file containing the UMI/Barcodes and can also have annotated genes information.

Note: This tool allows you to use pipelines such as Picard MarkDuplicates, that can use the UMI/Barcode information for better quantifying the duplicated reads (Picard options BARCODE_TAG=BC READ_ONE_BARCODE_TAG=BX), you can also demultiplex the annotated BAM file using GATK SplitReads

Options:

-f %s           [Required] Path of R1 FastQ file [can be gzipped or raw].
-b %s           [Required] Path of R2 aligned BAM file [do not need to be sorted or indexed].
-c %s           Path of Barcode/Samplename mapping file¹.
-gtf %s         Path of GTF file [can be gzipped or raw] fir annoting mapped genes
-n %i           Number of allowed difference with the barcode [ambiguous reads will be automatically discarded].
-s %s           Do you want to annotate genes only in case where reads are falling on same strand? [no, yes, reverse] (default = yes since BRB-seq is stranded protocol
-rg             Add Read Group information (for GATK / Picard MarkDuplicates / etc...). Can replace gatk AddOrReplaceReadGroups.
-o %s           Output folder
-t %s           Path of existing folder for storing temporary files
-chunkSize %i   Maximum number of reads to be stored in RAM (default = 10000000)
-p %s           Barcode pattern/order found in the reads of the R1 FastQ file. Barcode names should match the barcode file (default = 'BU' i.e. barcode followed by the UMI).
                        'B' [required] is used for specifying the barcode position.
                        'U' [optional] is used for specifying a UMI value position.
                        Character '?' [optional] can be used to ignore specific nucleotides.
-UMI %i         If your barcode pattern contains UMI ('U'), you should specify this parameter as the length of the UMI.

¹You can download/edit this example of barcode/samplename mapping file

Example:

java -jar BRBseqTools.jar AnnotateBAM -f lib_example_R1.fastq.gz -b lib_example_R2.bam -p BU -UMI 14

Note: Most of the options are optionals here. If not put, then the corresponding tag will simply not be added to the annotated BAM.

Trim

This tool is used to trim the BRB-seq construct (SMART oligo + Barcode + UMI) & the polyA sequences from the R2 fastq file.
It can also be used to trim individually all the demultiplexed .fastq files if you used the "Demultiplex" tool (in this case, use the -uniqueBarcode option).
The tool will TRIM the reads (i.e. cut the construct from the read, and write the remaining of the read). Some reads may not be written in case the remaining of the reads is below the minimum allowed length (-minLength option).

Options:

-f %s           [Required] Path of FastQ file to trim (or containing folder for processing all fastq files at once).
-o %s           Output folder
-uniqueBarcode  If the fastq file(s) contain(s) only one barcode (for e.g. after demultiplexing), this option can be used for searching the specific barcode (most occuring) in the construct and trimming it when present.
-polyA %i       Trim polyA strings that have more than this length (without mismatch), and all 3' string that follows [default=6]
-minLength %i   If resulting trimmed reads are < this number, it is removed from the output fastq file  [default=10]

Example:

java -jar BRBseqTools.jar Trim -f lib_example_R2.fastq.gz

Note: If you use STAR for alignment, this step is optional, as it will not change much the results of the alignment (our tests have shown that the improvement is real but very minor)

Directory content

src: all source files required for compilation
lib: all JAR dependencies required for compilation / execution
releases: latest release of BRB-seq Tools

Author

Vincent Gardeux - [email protected]

brb-seqtools's People

Contributors

Stargazers

Watchers

Forkers

hoohm vincentgardeux fengpku haroon123 antoniogps michamalecki lidkalee carolinewandinger disolis chenhsieh

brb-seqtools's Issues

TODO: Duplicate removal using UMIs

License?

Hello,
What license is BRB-seqTools distributed under (MIT, GPL, something else.)? I'd like to add a package for BRB-seqTools to bioconda, but in order to do so it would be helpful to know which license to list and that the software can indeed be (re)distributed.

Thank you!

Demultiplexing sample

I am trying to demultiplex the samples before the analysis to get samples first. This is the code:

java -jar /rsstu/users/r/rrellan/sara/nirwan_backup/ntanduk/BRBseqTools.jar Demultiplex -r1 /rsstu/users/r/rrellan/sara/RNA_Sequencing_raw/BZea_CLY23D1/NVS205B_RellanAlvarez/BZeaR2_S1_L004_R1_001.fastq.gz -r2 /rsstu/users/r/rrellan/sara/RNA_Sequencing_raw/BZea_CLY23D1/NVS205B_RellanAlvarez/BZeaR2_S1_L004_R2_001.fastq.gz -c /share/maize/ntanduk/BZea/rnaseq/barcodes.txt -p BU -UMI 14

this is my barcodes.txt

(base) [ntanduk@login02 rnaseq]$ more barcodes.txt 
Name	B1
B73_1	TACACTATAGCTAG	
B73_2	CAGACATGATAGAG
B73_3	TCTTGAACGACTAA
B73_4	ACCAACTTAGACAA
B73_5	CCATATTGAAGCAC
B73_6	TTCATGGTAGAGAG
B73_7	CACGTAGCTAGACG
B73_8	ATACGATGCTGCGC
B73_9	ACTTAAGGTCCAAC
B73_10	CTCTCAAGTCCTCC

This is my error:

Error while parsing R1 FastQ file
Read found in FastQ has length 151 while barcode pattern has length 28

I have UMI in my samples.

Any help is appreciated

BRBseqTools.jar 1.6 Final step of DGEMatrix generation issue

Hi, I've been trying to use the BRBseqTools, for my BRBseq data with 48 samples. All the steps have until the generation of the BAM file have worked, but when I get to the last step:
java -jar BRBseqTools.jar CreateDGEMatrix -f lib_example_R1.fastq.gz -b BAM/lib_example_R2.bam -c lib_example_barcodes.txt -gtf Homo_sapiens.GRCh38.90.gtf -p BU -UMI 14

I get the following error, which I have been unable to fix, my Barcode is 14BP long so is the UMI. Could you help me out?

Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file: /bharadv/BRBseq/BAM/library_R2.bam
at htsjdk.samtools.util.BlockCompressedInputStream.processNextBlock(BlockCompressedInputStream.java:513)
at htsjdk.samtools.util.BlockCompressedInputStream.nextBlock(BlockCompressedInputStream.java:451)
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:441)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:194)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:322)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:366)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:199)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:813)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:787)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:781)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:751)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:569)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:543)
at DGEMatrixManager.readR2BAM(DGEMatrixManager.java:43)
at BRBseqTools.main(BRBseqTools.java:43)

CreateDGEMatrix error in demultiplexing

Hi @VincentGardeux

Thank you for your great work! The newest release seems to solve my GTF problem!

But I am so sorry to say that I still have another problem in read count.
I have some highly expressed genes, which is confirmed by usual Illumina library prep kit for same samples as in BRB-seq data set.
Inspection on IGV looks those genes are highly expressed in BRB-seq library too. I attached one example (gene_id: MSTRG.1281). Top row is bam from non-demultiplexed reads and bottom three rows are from demultiplexed reads using Demultiplex function.

However, if I use CreateDGEMatrix function, resultant file says no gene expression on this gene.

$ grep MSTRG.1281 output.dge.reads.detailed.txt
MSTRG.1281      LOC111223187    0       0       0       0

As you can see many reads in bams from demultiplexed reads(bottom three rows in attached file), Demultiplex function seem to work. So I think something happened during CreateDGEMatrix process.
Below is stdout I obtained in CreateDGEMatrix.

Barcode Pattern = BU
Maximum allowed Hamming distance for sequencing error correction [Barcode] = 1
Maximum allowed Hamming distance for sequencing error correction [UMI] = 0
ChunkSize = 1000000 i.e. no more than 1000000 reads will be stored in RAM.
Stranded = YES
Config: B1 contains 3 barcodes: [ GCGTGG AAACAT TTCATA ]
Config: B1 length [#nucleotides] = 6

Analyzing barcode Pattern...
According to barcode pattern, reads of R1 FastQ file should contain 16 characters.

Reading GTF file provided: /home/koyama/Documents/SeriolaSD/dumerili/12p045/Lib1/01mapping/hisat2/../../../8p113Sdu.stringtie_merge.sorted.gtf
[WARNING] l.25880       : No gene_id. This entry is ignored.
[WARNING] l.25881       : No gene_id. This entry is ignored.
[WARNING] l.25884       : No gene_id. This entry is ignored.
[WARNING] l.188387      : No gene_id. This entry is ignored.
[WARNING] l.188388      : No gene_id. This entry is ignored.
[WARNING] l.188390      : No gene_id. This entry is ignored.
[WARNING] l.307983      : No gene_id. This entry is ignored.
[WARNING] l.307984      : No gene_id. This entry is ignored.
[WARNING] l.307987      : No gene_id. This entry is ignored.
[WARNING] l.307989      : No gene_id. This entry is ignored.
[WARNING] l.307990      : No gene_id. This entry is ignored.
[WARNING] l.307993      : No gene_id. This entry is ignored.
[WARNING] l.360471      : No gene_id. This entry is ignored.
[WARNING] l.360472      : No gene_id. This entry is ignored.
[WARNING] l.360474      : No gene_id. This entry is ignored.
[WARNING] l.496370      : No gene_id. This entry is ignored.
[WARNING] l.496371      : No gene_id. This entry is ignored.
[WARNING] l.496373      : No gene_id. This entry is ignored.
[WARNING] l.496374      : No gene_id. This entry is ignored.
[WARNING] l.496375      : No gene_id. This entry is ignored.
[WARNING] l.496378      : No gene_id. This entry is ignored.
[WARNING] l.496380      : No gene_id. This entry is ignored.
[WARNING] l.496381      : No gene_id. This entry is ignored.
[WARNING] l.496384      : No gene_id. This entry is ignored.
[WARNING] l.496386      : No gene_id. This entry is ignored.
[WARNING] l.496387      : No gene_id. This entry is ignored.
[WARNING] l.496390      : No gene_id. This entry is ignored.
[WARNING] l.496394      : No gene_id. This entry is ignored.
[WARNING] l.496395      : No gene_id. This entry is ignored.
[WARNING] l.496397      : No gene_id. This entry is ignored.
[WARNING] l.507281      : No gene_id. This entry is ignored.
[WARNING] l.507282      : No gene_id. This entry is ignored.
[WARNING] l.507285      : No gene_id. This entry is ignored.
[WARNING] l.507287      : No gene_id. This entry is ignored.
[WARNING] l.507288      : No gene_id. This entry is ignored.
[WARNING] l.507291      : No gene_id. This entry is ignored.
[WARNING] l.507295      : No gene_id. This entry is ignored.
[WARNING] l.507296      : No gene_id. This entry is ignored.
[WARNING] l.507298      : No gene_id. This entry is ignored.
[WARNING] l.507299      : No gene_id. This entry is ignored.
[WARNING] l.507300      : No gene_id. This entry is ignored.
[WARNING] l.507303      : No gene_id. This entry is ignored.
[WARNING] l.507305      : No gene_id. This entry is ignored.
[WARNING] l.507306      : No gene_id. This entry is ignored.
[WARNING] l.507309      : No gene_id. This entry is ignored.
[WARNING] l.507311      : No gene_id. This entry is ignored.
[WARNING] l.507312      : No gene_id. This entry is ignored.
[WARNING] l.507315      : No gene_id. This entry is ignored.
[WARNING] l.555883      : No gene_id. This entry is ignored.
[WARNING] l.555884      : No gene_id. This entry is ignored.
[WARNING] l.555887      : No gene_id. This entry is ignored.
[WARNING] l.557958      : No gene_id. This entry is ignored.
[WARNING] l.557959      : No gene_id. This entry is ignored.
[WARNING] l.557962      : No gene_id. This entry is ignored.
[WARNING] l.662983      : No gene_id. This entry is ignored.
[WARNING] l.662984      : No gene_id. This entry is ignored.
[WARNING] l.662987      : No gene_id. This entry is ignored.
[WARNING] l.676602      : No gene_id. This entry is ignored.
[WARNING] l.676603      : No gene_id. This entry is ignored.
[WARNING] l.676605      : No gene_id. This entry is ignored.
[WARNING] l.676606      : No gene_id. This entry is ignored.
[WARNING] l.676607      : No gene_id. This entry is ignored.
[WARNING] l.676610      : No gene_id. This entry is ignored.
[WARNING] l.676612      : No gene_id. This entry is ignored.
[WARNING] l.676613      : No gene_id. This entry is ignored.
[WARNING] l.676616      : No gene_id. This entry is ignored.
[WARNING] l.676618      : No gene_id. This entry is ignored.
[WARNING] l.676619      : No gene_id. This entry is ignored.
[WARNING] l.676622      : No gene_id. This entry is ignored.
[WARNING] l.676624      : No gene_id. This entry is ignored.
[WARNING] l.676625      : No gene_id. This entry is ignored.
[WARNING] l.676628      : No gene_id. This entry is ignored.
[WARNING] l.676632      : No gene_id. This entry is ignored.
[WARNING] l.676633      : No gene_id. This entry is ignored.
[WARNING] l.676635      : No gene_id. This entry is ignored.
[WARNING] l.676636      : No gene_id. This entry is ignored.
[WARNING] l.676637      : No gene_id. This entry is ignored.
[WARNING] l.676640      : No gene_id. This entry is ignored.
[WARNING] l.676642      : No gene_id. This entry is ignored.
[WARNING] l.676643      : No gene_id. This entry is ignored.
[WARNING] l.676646      : No gene_id. This entry is ignored.
[WARNING] l.676648      : No gene_id. This entry is ignored.
[WARNING] l.676649      : No gene_id. This entry is ignored.
[WARNING] l.676652      : No gene_id. This entry is ignored.
[WARNING] l.676656      : No gene_id. This entry is ignored.
[WARNING] l.676657      : No gene_id. This entry is ignored.
[WARNING] l.676659      : No gene_id. This entry is ignored.
[WARNING] l.789064      : No gene_id. This entry is ignored.
[WARNING] l.789065      : No gene_id. This entry is ignored.
[WARNING] l.789067      : No gene_id. This entry is ignored.
[WARNING] l.789068      : No gene_id. This entry is ignored.
[WARNING] l.789069      : No gene_id. This entry is ignored.
[WARNING] l.789072      : No gene_id. This entry is ignored.
[WARNING] l.800604      : No gene_id. This entry is ignored.
[WARNING] l.800605      : No gene_id. This entry is ignored.
[WARNING] l.800607      : No gene_id. This entry is ignored.
[WARNING] l.800609      : No gene_id. This entry is ignored.
[WARNING] l.800653      : No gene_id. This entry is ignored.
[WARNING] l.800654      : No gene_id. This entry is ignored.
[WARNING] l.800657      : No gene_id. This entry is ignored.
[WARNING] l.802860      : No gene_id. This entry is ignored.
[WARNING] l.802861      : No gene_id. This entry is ignored.
[WARNING] l.802863      : No gene_id. This entry is ignored.
[WARNING] l.804662      : No gene_id. This entry is ignored.
[WARNING] l.804663      : No gene_id. This entry is ignored.
[WARNING] l.804665      : No gene_id. This entry is ignored.
[WARNING] l.804988      : No gene_id. This entry is ignored.
[WARNING] l.804989      : No gene_id. This entry is ignored.
[WARNING] l.804992      : No gene_id. This entry is ignored.
737343 'exons' are annotating 31689 unique gene_ids in the provided GTF file.
LOC111239202,LOC111239226,MSTRG.26637
plk1

Reading reads barcodes/UMI from the R1 fastq file...
1000000 reads were processed from fastq file [4 s 363 ms]
2000000 reads were processed from fastq file [12 s 383 ms]
3000000 reads were processed from fastq file [19 s 277 ms]
4000000 reads were processed from fastq file [23 s 861 ms]
5000000 reads were processed from fastq file [28 s 252 ms]
6000000 reads were processed from fastq file [33 s 723 ms]
7000000 reads were processed from fastq file [38 s 332 ms]
7919981 reads were processed from fastq file [43 s 23 ms]
Created 8 temporary fastq files
78552 reads have no matching barcodes (0.99%)

Reading the reads from the BAM file...
1000000 reads were processed from BAM file [14 s 231 ms]
2000000 reads were processed from BAM file [30 s 660 ms]
3000000 reads were processed from BAM file [38 s 822 ms]
4000000 reads were processed from BAM file [46 s 769 ms]
5000000 reads were processed from BAM file [54 s 885 ms]
6000000 reads were processed from BAM file [1 mn 3 s 112 ms]
7000000 reads were processed from BAM file [1 mn 9 s 585 ms]
7731142 reads were processed from BAM file [1 mn 14 s 184 ms]
Created 8 temporary BAM files
19839 Unique genes were detected (at least one read).
5662594 'Mapped' reads (73.24%)
2592 'Ambiguous' reads (0.03%)
1333107 'No Features' reads (17.24%)
0 'Not Unique' reads (0%)
0 'Not Aligned' reads (0%)
732849 'Too Low aQual' reads (9.48%)

Starting demultiplexing...
374 reads were demultiplexed [9 s 791 ms]


Starting writing output file (without sequencing error correction)...
267 UMI counts were written (107 duplicates = 28.61%)
DGE count & UMI matrices were created [10 s 122 ms]

Any suggestions are appreciated. Thank you!

Originally posted by @TakashiKoyama in #3 (comment)

Two issues in CreateDGEMatrix

Hi,
I have two issues when I use CreateDGEMatrix.

First, when I use symbolic link instead of real file path, BRBseqTools produce error like

The '-f' option should be followed by R1 FastQ file path. No file at path ...

This can be solved by specifying read file path. But it would be helpful if I can use symbolic link.

Second, I have error in reading gtf file as below

Reading GTF file provided: /path/to/gtf
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
	at model.GTF.readGTF(GTF.java:46)
	at BRBseqTools.main(BRBseqTools.java:41)

I have not solved this problem. It would be appreciated if you could help me.

Here is my code.

REF=/path/to/ref
READ1=/path/to/read1 #the path is symbolic link. real read data is in the NAS
READ2=/path/to/read2 #the path is symbolic link. real read data is in the NAS
BARCODE=/path/to/barcode
GTF=/path/to/gtf # the gtf file we previously made using Stringtie

# main
hisat2 -p 10 --dta -x $REF -U $READ2 --summary-file hisat2.summary | samtools view -bhT $REF -F 0x4 -@ 10 - | samtools sort -@ 10 -O bam -o mapped.sorted.bam -
BRBseqTools.1.3.jar CreateDGEMatrix -f $READ1 -b mapped.sorted.bam -c $BARCODE -gtf $GTF -p BU -UMI 20

Issure with barcode file when running CreateDGEMatrix

I ran:

$ java -jar BRBseqTools.1.3.jar CreateDGEMatrix -f
 ../Library1_R1_001.fastq.gz -b
 Library1_R2_001.trimmed_Aligned.out.bam -c ../lib_example_barcodes.txt
-gtf  ~/bioninformatics/genomes/hg38/Homo_sapiens.GRCh38.98.chr.gtf  -p BU
-UMI 10

And received the following output:

BRBSeqTools 1.3 [CreateDGEMatrix]

Barcode Pattern = BU
Maximum allowed Hamming distance for sequencing error correction [Barcode]= 1
Maximum allowed Hamming distance for sequencing error correction [UMI] = 0
ChunkSize = 1000000 i.e. no more than 1000000 reads will be stored in RAM.
Stranded = YES
No output Folder is specified, using default one:"/home/star_bam/..".
You can specify an output path by using the '-o' option.
No folder for temporary generated files is specified, using output folder
as default:"/home/star_bam/..".
You can specify a temporary folder by using the '-t' option.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at tools.Utils.readConfig(Utils.java:333)
at BRBseqTools.main(BRBseqTools.java:34)

I'm attaching my barcode file, since the problem appears to be related.
lib_example_barcodes.txt

Thanks,
Avital

one issue in CreateDGEMatrix

Hi,
I ran 'java -jar BRBseqTools-1.6.jar CreateDGEMatrix -f BRBeq_raw_1.fq.gz -b BAM/Aligned.out.bam -c lib_barcodes.txt -gtf Mus_musculus.GRCm38.102.gtf -p BU -I 10' and got the following error
'BRBSeqTools 1.6 [CreateDGEMatrix]

Analyzing barcode Pattern...
According to barcode pattern, reads of R1 FastQ file should contain 16 characters.

Reading GTF file provided: Mus_musculus.GRCm38.102.gtf
843712 'exons' are annotating 55487 unique gene_ids in the provided GTF file.

Reading reads barcodes/UMI from the R1 fastq file...
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2
at java.lang.String.substring(String.java:1967)
at tools.Utils.nextRead(Utils.java:96)
at tools.Utils.readR1Fastq(Utils.java:273)
at BRBseqTools.main(BRBseqTools.java:42)'

My barcode file is
lib_barcodes.txt
and my R1 file looks like this

Can you help me? Thank you!
Yan

BRBseqTools.jar 1.6 CreateDGEMatrix generation issue

Hi~
I have been trying to fix the issue for several days. When I come to the last step of CreateDGEMatrix, I alwasy get an error info.
The command I used was: “java -jar BRBseqTools-1.6.jar CreateDGEMatrix -f subsample_V350083497_L01_read_1.fq.gz -b BAM/subsample_V350083497_L01_read_2.bam -c PPI_sample_to_barcode.txt -gtf Mus_musculus.GRCm38.99.gtf -p BU -UMI 14”

The error information was:

“BRBSeqTools 1.6 [CreateDGEMatrix]

Barcode Pattern = BU
Maximum allowed Hamming distance for sequencing error correction [Barcode] = 1
Maximum allowed Hamming distance for sequencing error correction [UMI] = 0
ChunkSize = 1000000 i.e. no more than 1000000 reads will be stored in RAM.
Stranded = YES
Multiple mapped reads will be DISCARDED
Config: B1 contains 48 barcodes: [ CACGTAGCTAGACG CAGCAGCGCTGACA TATGTACGTAGAGA CAGATCGCTCTGCG AACATGGCACGGAA ATGCTATCTAGACA TAACCTAACTTCAC ACCAATCCGTATAA CTCGTGCCTCAGGA CTCCTAACACAGCC ATGCAATTACACCG ATACCGTTGGAGGA ACCAACTTAGACAA TCAGCGTGTCGTCA ATTCCGCTCTGAGG ACCGATGTGCCGGA CTAAGGAAGGACCG TTCTGAGGCGATGG ACGAGCAGATGCAG ATCCTAAGTGGAGG TTCAATCTCCTTAG CTCGGTTCGAATGC ATACGATGCTGCGC TACACTATAGCTAG CAAGTATAAGGAAC AAGCCTTAGGCCGC CTGATATGCAGCGA AATGTGTGGATACC ATGCTGCAACGGCA CCATATTGAAGCAC TCGATTGGTAATGG CTTGATAAGCCAGG TTCATGGTAGAGAG ACGCTCCTAACGGC CTCAATTAGAGACC TCGATCTTCGGCAG AACAACCGAAGTAA AACTAGTATCCGGA ACGCAGCGAGCAGA TCCAAGGCATTAAG TCGATGCTATATGA TATGACTGCATGCG ACAAGGTTCCGCCG ACCGGATAACGGAA CCACGTTCCGGTAG ACTTAAGGTCCAAC TCCAGAAGGCAACA CTGCTCATGAGAGA ]
Config: B1 length [#nucleotides] = 14

Analyzing barcode Pattern...
According to barcode pattern, reads of R1 FastQ file should contain 28 characters.

Reading GTF file provided: /home/projects/ku_00009/people/fanzha/PPI/RNAseq/L01/Mus_musculus.GRCm38.99.gtf
843183 'exons' are annotating 55471 unique gene_ids in the provided GTF file.

I have a read1 with 28bp, 14bp barcode and followed by UMI.

My Read1 file looks like this:

The barcode file looks like this:

How should I fix the issue? Thank you very much.

deplanckelab / brb-seqtools Goto Github PK

brb-seqtools's Introduction

BRB-seq Tools

⚠️ Recent notes (and for Alithea users)

Download software

Dependencies

Java version

Picard

Sequencing output

Aligner

Installation

Example of full pipeline (using STAR)

Usage

ExtractReadCountMatrix

Demultiplex

CreateDGEMatrix

AnnotateBAM

Trim

Directory content

Author

brb-seqtools's People

Contributors

Stargazers

Watchers

Forkers

brb-seqtools's Issues

Recommend Projects

Recommend Topics

Recommend Org