ncbi / tpmcalculator Goto Github PK

TPMCalculator quantifies mRNA abundance directly from the alignments by parsing BAM files

License: Other

Makefile 4.46% C++ 54.58% C 39.31% Shell 0.40% Dockerfile 0.53% Common Workflow Language 0.72%

ngs rna-seq tpm

tpmcalculator's Introduction

TPMCalculator

TPMCalculator quantifies mRNA abundance directly from the alignments by parsing BAM files. The input parameters are the same GTF files used to generate the alignments, and one or multiple input BAM file(s) containing either single-end or paired-end sequencing reads. The TPMCalculator output is comprised of four files per sample reporting the TPM values and raw read counts for genes, transcripts, exons and introns respectively.

Reference

Roberto Vera Alvarez, Lorinc Sandor Pongor, Leonardo Mariño-Ramírez, David Landsman; TPMCalculator: one-step software to quantify mRNA abundance of genomic features, Bioinformatics, , bty896, https://doi.org/10.1093/bioinformatics/bty896

Conda/Bioconda

TPMCalculator is available on Bioconda: https://bioconda.github.io/recipes/tpmcalculator/README.html

NIH Biowulf

NIH Biowulf users can load TPMcalculator as a module: https://hpc.nih.gov/apps/TPMCalculator.html

Requirements

BAMTools

Clone the BAMTools repository from GitHub: https://github.com/pezmaster31/bamtools

Compile it on this way and set the environment variables for TPMCalculator:

cd bamtools
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=../ ..
make
make install
cd ..
export BAMTOOLS_DIR=`pwd`
export CPPFLAGS="-I $BAMTOOLS_DIR/include/bamtools/"
export LDFLAGS="-L $BAMTOOLS_DIR/lib64 -Wl,-rpath,$BAMTOOLS_DIR/lib64"

That's it. BAMTools was compiled and the env variables were set for compiling TPMCalculator.

Installation

After the installation of BAMTools go to the TPMCalculator folder and do make:

make

A bin folder will be created with the TPMCalculator executable.

Docker

Use provided Dockerfile based on the BioContainers base image.

docker build -t biocontainers/tpmcalculator:0.0.1 https://raw.githubusercontent.com/ncbi/TPMCalculator/master/Dockerfile

docker run -v /path_to_data:/data --user=yourUID:your:GID biocontainers/tpmcalculator:0.0.1 TPMCalculator -g /data/path_to_GTF/genes.gtf -b /data/path_to_bam/sample1.bam

CWL

A CWL tool definition is also provided tpmcalculator.cwl

Use it like this:

cwl-runner tpmcalculator.cwl --out_stderr=test.stderr --out_stdout=test.stdout -g genes.gtf -b sample_1.bam

Usage

Usage: ./bin/TPMCalculator -g GTF_file [-d BAM_files_directory|-b BAM_file]

./bin/TPMCalculator options:

    -v    Print info
    -h    Display this usage information.
    -g    GTF file
    -d    Directory with the BAM files
    -b    BAM file
    -k    Gene key to use from GTF file. Default: gene_id
    -t    Transcript key to use from GTF file. Default: transcript_id
    -c    Smaller size allowed for an intron created for genes. Default: 16. We recommend to use the reads length
    -p    Use only properly paired reads. Default: No. Recommended for paired-end reads.
    -q    Minimum MAPQ value to filter out reads. Default: 0. This value depends on the aligner MAPQ value.
    -o    Minimum overlap between a reads and a feature. Default: 8.
    -e    Extended output. This will include transcript level TPM values. Default: No.
    -a    Print out all features with read counts equal to zero. Default: No.

Description

The model to describe the genomic features used for a gene is created from the GTF provided by the user. TPMCalculator performs two transformations which are executed on the genomic coordinates generating regions for the genes that include the exons and “pure” intron regions as shown in Figure S1. The first transformation creates overlapped exons for all alternative spliced forms of the genes. A single gene model is generated with unique exons and introns which includes the sequence of all exonic regions. The second transformation process creates a list of pure intron regions that replace those generated by the first transformation. We should indicate that only the intron regions are modified to generate regions not overlapped by exons of other genes. Reporting TPM values for these unique introns allows further identification of alternative splicing events like intron retention. Additionally, a set of non-overlapped gene features (exons and introns) are generated and used for TPM calculation.

Validation

For more detailed description and instalation guide lines see https://github.com/ncbi/TPMCalculator/wiki/

Credits

Roberto Vera Alvarez Email: [email protected]

Lorinc Pongor Email: [email protected]

Leonardo Mariño-Ramírez Email: [email protected]

David Landsman Email: [email protected]

Public Domain notice

National Center for Biotechnology Information.

This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

tpmcalculator's People

Contributors

Stargazers

Watchers

tpmcalculator's Issues

Add multithreads

TPMCalculater, the output files are empty

My bam file generated from Star, when I used TPMCalculater, the output files are empty. It complains "Chromosome with name: TraesCS4D01G266700.2 does not exist Chromosome" for all the queries. I am not sure what's wrong. Thank you in advance!

btw, I also attached image file for bam, gtf and genome fasta files:

Chromosome with name: ENST.... does not exist

Hello!

How i understand that gtf file not contain version number of transcipts. Can you help me, how can i solve this problem without correct bam file?

Sets the name of the output

How could I set the name of the output and specify full path for the output in TPMCalculator tool?

For the paired end reads, is it recommended to use option -p or should i go with default without it?

Hi
I need to have tpm counts for paired end reads mapped with star, should i use option -p (Use only properly paired reads. Default: No. Recommended for paired-end reads.) or by default without it ? whats pros and cons os using it ?

Thanks

Output file description

Hi, I used TPMCalculator release 0.0.4 and the outputs don't quite match the descriptions on the wiki, could you please clarify what the outputs are?

My command:

TPMCalculator -a -p \
        -q 255 \
        -g ${gtf} \
        -b ${bam}

I get 3 output files per BAM, e.g. Control_1.final_genes.out, Control_1.final_genes.ent, Control_1.final_genes.uni

final_genes.out seems to match the first output file described, headers are a little different:

Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM UniqueLength UniqueReads UniqueTPM UniqueExonLength UniqueExonReads UniqueExonTPM UniqueIntronLength UniqueIntronReads UniqueIntronTPM

The _genes.ent file and _genes.uni files have these headers - what are the differences and where could I find transcript ID?
Gene_Id Chr Type Type_Number start end Length Reads TPM

Thanks

Compilation error: collect2: error: ld returned 1 exit status

Hi, I got this error message while compiling TPMcalculator 0.0.4:

export ...
make
...
...
g++ -g -I /share/app/bamtools/2.5.1/include/bamtools/ -L /share/app/bamtools/2.5.1/lib64 -Wl,-rpath,/share/app/bamtools/2.5.1/lib64  -o bin/TPMCalculator build/Release/GNU-MacOSX/src/DiffExpIR.o build/Release/GNU-MacOSX/src/FastaFactory.o build/Release/GNU-MacOSX/src/RandomFactory.o build/Release/GNU-MacOSX/src/ReadFactory.o build/Release/GNU-MacOSX/src/Stats.o build/Release/GNU-MacOSX/src/TextParser.o build/Release/GNU-MacOSX/src/bd0.o build/Release/GNU-MacOSX/src/bratio.o build/Release/GNU-MacOSX/src/bstring.o build/Release/GNU-MacOSX/src/chebyshev.o build/Release/GNU-MacOSX/src/choose.o build/Release/GNU-MacOSX/src/dnorm.o build/Release/GNU-MacOSX/src/dt.o build/Release/GNU-MacOSX/src/gamma.o build/Release/GNU-MacOSX/src/lbeta.o build/Release/GNU-MacOSX/src/lgamma.o build/Release/GNU-MacOSX/src/lgammacor.o build/Release/GNU-MacOSX/src/main.o build/Release/GNU-MacOSX/src/pbeta.o build/Release/GNU-MacOSX/src/phyper.o build/Release/GNU-MacOSX/src/pnorm.o build/Release/GNU-MacOSX/src/pt.o build/Release/GNU-MacOSX/src/qnorm.o build/Release/GNU-MacOSX/src/qt.o build/Release/GNU-MacOSX/src/stirlerr.o build/Release/GNU-MacOSX/src/sunif.o build/Release/GNU-MacOSX/src/wilcox.o -L../../bamtools/lib -lbamtools -lm -lz
build/Release/GNU-MacOSX/src/ReadFactory.o: In function `ngs::ReadFactory::ReadFactory()':
/software/TPMCalculator/src/ReadFactory.cpp:39: undefined reference to `BamTools::SamHeader::SamHeader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
build/Release/GNU-MacOSX/src/ReadFactory.o: In function `ngs::ReadFactory::processReadsFromBAM(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, unsigned short, unsigned short)':
/software/TPMCalculator/src/ReadFactory.cpp:349: undefined reference to `BamTools::BamReader::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
collect2: error: ld returned 1 exit status
make[2]: *** [bin/TPMCalculator] Error 1
make[2]: Leaving directory `/software/TPMCalculator'
make[1]: *** [.build-conf] Error 2
make[1]: Leaving directory `/software/TPMCalculator'
make: *** [.build-impl] Error 2

OS: CentOS Linux release 7.8.2003 (Core)
GCC: gcc version 7.5.0 (GCC)

Any idea how to solve this problem? Thanks!

Key ID for gene name was not found on GTF line

Hello,

I am using tpmcalculator (version 0.0.4) in conda env with a .gff3 annotation file. After running couple times with different -k parameters (first time default, second time ID, third time Parent), I got the same error message "Key gene_id/ID/Parent for gene name was not found on GTF line. Error processing GTF line at Chromosome level:"

The parameters are as following:

tpmcalculator
-g data/annotation/Cs_genes_v2_annot.gff3
-d $input1_dir
-b Aligned.sortedByCoord.out.bam
-p
-k "Parent"

The first couple line of my .gff3 annotation file are:
Chr1 AAFC_NRC gene 1 6504 . - . ID=Csa01g001000;Name=Csa01g001000;Note=methyl-CPG-binding domain 9

Chr1 AAFC_NRC gene 1 6504 . - . ID=Csa01g001000;Name=Csa01g001000

Chr1 AAFC_NRC mRNA 1 6504 . - . ID=Csa01g001000.1;Name=Csa01g001000.1;Parent=Csa01g001000;Note=methyl-CPG-binding domain 9

Chr1 AAFC_NRC five_prime_UTR 6380 6504 . - . ID=Csa01g001000.1.utr5p1;Parent=Csa01g001000.1

Chr1 AAFC_NRC exon 5865 6504 . - . ID=Csa01g001000.1.exon1;Parent=Csa01g001000.1

Does TPMCalculator not work with .gff3 annotation file? or I got some setting wrong when running the program?

Thank you in advance.

Installation without docker

In the installation instructions it says "Use provided Dockerfile based on the BioContainers base image." Is this step necessary? I do not have docker and it is difficult to install on my work system. I do not have sudo access.

Adding +1 to exon end.

From a gtf where coordinate is chr3 52554724 52554879 for an exon.

Looking into _genes.ent output then for this same exon :
chr3 52554724 52554880

It's always adding 1 to the end coordinate.

Weird, no ?

Output read counts file?

I noticed that if using multiple bam files it generates a table '_data_per_samples.txt' of TPM values for each gene/transcript for each sample. In order to replace FeatureCounts in a workflow, was wondering if it be useful to have a similar table of read counts generated?

That way you could use the TPM table to identify genes/transcripts that had low TPM across samples (e.g. you only want to analyze genes that had a TPM of 2 or greater in 75% of your samples), filter out those genes from the read counts table and then go onto some differential gene analysis directly. Wouldn't that save a step of having to merge '_gene.out' files across each sample if you wanted read counts?

Maybe a simple table such as:

Gene_Id
Chr
sample1_Count_Reads
sample2_Count_Reads
sample3_Count_Reads

TMPcalculator mising genes and duplicating other genes

I am facing issues with TPMcalculator! First, I noticed that it is counting much less number of genes. Second, among the counted ones there are duplications (same gene_ID in two lines).

I've checked my gtf file and looks fine... What could it be?

Possible to use gff3 in TPMCalculator v 0.4?

Hello,
Thank you for developing this nice tool. I tried using a GFF3 file with the -g flag, but it didn't work as expected. I then converted my GFF3 file to GTF, but I noticed that a significant amount of information was lost in the GTF file. So, I prefer to use the GFF3 file.

I am just wondering when the new version, incorporating the feature to handle GFF3 files, will be released.
Thanks in advance for your nice effort.
Cheers,
Farid

Add the gene strand to the overlap model.

Threads

hi~ How do you add threads in Linux?

Raw counts from paired-end alignments seem to be inflated by a factor of 2

For paired-end alignments generated with STAR, the reported ExonReads number is almost exactly double the number that STAR generates using the --quantMode GeneCounts option. Counts generated using kallisto agree closely with those from STAR. This discrepancy does not occur with single-end alignments.

Add the read strand to the count model

Different results based on seq. depth

Hi,
I was trying to compare TPM counts from different experiments. Because they have different seq. depths, for sanity check, I downsampled the same file and analyzed TPM in both cases (10 million vs 30 million PE fragments).
It seems that at higher depth, there is underestimation of TPM. Is there something I should be adjusting when running the program to fix this?
Thanks!

How to use it?

Need some documentation on how to use it, how to install it, etc.

Add a new option to use a directory as output destination

New feature to be added

how to generate the Output file: [genes|transcripts]_data_per_samples.txt ?

Hi,

running TPMCalculator installed with bioconda) on a directory: how do I get / create the output file as described in the "Outputr file section"?
command I used:

TPMCalculator -g GENOME.gtf -d BAMS/ -e
-> with all bam files in the filder BAM.

It worked fine :

Reading GTF file ...
Done in 8.92957 seconds
Parsing BAM files
...
Printing results

BUT I did not find in the usage then how to create the
Output file: [genes|transcripts]_data_per_samples.txt

Any help appreiated!!

Cheers Karen

No TPM values, no reads processed

Hi,

I am using TPMCalculator using the following command:

TPMCalculator -g file.gtf -d ./ -a -p -e

In the current directory, I have sorted bam files deriving from HISAT2 alignment tool of RNA reads against the genome.

Here, you will find the head of samtools view on bam file.

Here, follows part of the log file:

Reading GTF file ...
Done in 12.0684 seconds
Parsing sample: trimmed_hisat_out.bam.sorted
0 reads processed in 23.4511 seconds
Printing results
Total time: 42.3443 seconds

Thanks for the support!

Br,
Marco

Multimapped Reads

How does this package deal with multi-mapped reads?

Installation api/BamReader.h: No such file or directory

Follow the protocol to install and get the following error when doing make in TPM calculator folder.

:~/bin/TPMCalculator$ make
"make" -f nbproject/Makefile-Release.mk QMAKE= SUBPROJECTS= .build-conf
make[1]: Entering directory '/home/jean-philippe.villemin/bin/TPMCalculator'
"make"  -f nbproject/Makefile-Release.mk bin/TPMCalculator
make[2]: Entering directory '/home/jean-philippe.villemin/bin/TPMCalculator'
mkdir -p build/Release/GNU-MacOSX/src
rm -f "build/Release/GNU-MacOSX/src/DiffExpIR.o.d"
g++ -g -I /home/jean-philippe.villemin/bin/bamtools/include  -c -O2 -Iincludes -std=c++14 -MMD -MP -MF "build/Release/GNU-MacOSX/src/DiffExpIR.o.d" -o build/Release/GNU-MacOSX/src/DiffExpIR.o src/DiffExpIR.cpp
src/DiffExpIR.cpp:26:27: fatal error: api/BamReader.h: No such file or directory
 #include "api/BamReader.h"
                           ^
compilation terminated.
nbproject/Makefile-Release.mk:92: recipe for target 'build/Release/GNU-MacOSX/src/DiffExpIR.o' failed
make[2]: *** [build/Release/GNU-MacOSX/src/DiffExpIR.o] Error 1
make[2]: Leaving directory '/home/jean-philippe.villemin/bin/TPMCalculator'
nbproject/Makefile-Release.mk:85: recipe for target '.build-conf' failed
make[1]: *** [.build-conf] Error 2
make[1]: Leaving directory '/home/jean-philippe.villemin/bin/TPMCalculator'
nbproject/Makefile-impl.mk:39: recipe for target '.build-impl' failed
make: *** [.build-impl] Error

A critical issue: different number of genes in output for different BAM files

Dear TPMCalculator team,

Thank you for a very useful tool, but I found that it produces a different number of genes for different BAM files with the same GTF file. It seems that the tool output only genes with a non-zero count. This is a wrong way - for example, one produced 10 bam files and calculated TPMs, next week he produced another 5 BAM files and calculated TPMs for them but it is very difficult to join all these 15 TPM count files because they have a different number of lines! The genes with non-zero counts should be present in the output and should always be in the same order based on the GTF file. It makes much easy to join them with just a "paste" Linux command.

best regards,
Andrey

Output files desctiption

Hello,

I found some details about output in your article: "The TPMCalculator output is comprised of five files per sample reporting the TPM values and raw read counts for genes, transcripts, exons and introns."
In my case the TPMCalculator generated three files: _genes.uni, _genes.out and _genes.ent. I found most of the values in the _genes.out file, but there are two other files. What are they need for? Could you give a more detailed description of the TPMCalculator output, please.

Question on TPMCalculator : bam pre-treatment needed ?

Do we need to index or sort the bam before using TPMCalculator ?

Thanks.

libbamtools.so.2.4.0: cannot open shared object file: No such file or directory

Hello I'm using TPMcalculator on a cluster, so I installed the software with a BAMtools version installed in a conda environnement (during the installation it gave something like :

**```
export BAMTOOLS_DIR=/path1/mycondaenv/
export CPPFLAGS="-I $BAMTOOLS_DIR/include/bamtools/"
export LDFLAGS="-L $BAMTOOLS_DIR/lib64 -Wl,-rpath,$BAMTOOLS_DIR/lib64"


and then I simply  ran a  make, so the `/path1/TOOLS/TPMCalculator/bin/TPMCalculator` workes locally but when I try to run the program on a cluster I get the following error message : 

` /path1/TOOLS/TPMCalculator/bin/TPMCalculatorr: error while loading shared libraries: libbamtools.so.2.4.0: cannot open shared object file: No such file or directory`

Do you have an idea of what is going on ? 

Thank you for your help

Chromosome with name: A01 does not exist

My bam file comes from the STAR. When I run TPMCalculator, I received an error message:
Reading GTF file ...
Done in 0.19 seconds
Parsing sample: R500.Aligned.sortedByCoord.out.bam
Chromosome with name: A01 does not exist
Chromosome with name: A01 does not exist
Chromosome with name: A01 does not exist
Chromosome with name: A01 does not exist
Chromosome with name: A01 does not exist
Chromosome with name: A01 does not exist
...
Thanks for your help!

TPMCalculator is not working with the Ensembl genome annotation

I tried ensembl/grch37/release-94.
Didn't work ;(

symbol lookup error

Hi,

I installed TPMcalculator using conda. I have aligned FASTQs to GRCh38 with bowtie and have obtained GTF from Ensembl. I am not too sure what to make of this error - any help would be great, thanks.

Tracy

[tc6463@gadi-login-02 Scripts]$ /scratch/er01/apps/miniconda3/bin/TPMCalculator -g ../References/GRCh38_v103/Homo_sapiens.GRCh38.103.gtf -b ../RNA_cutadapt_min17_align_Coenen-Stass/HB10-aln-hg38.sorted.dedup.bam
Reading GTF file ...
Done in 163.879 seconds
Parsing sample: HB10-aln-hg38.sorted.dedup/scratch/er01/apps/miniconda3/bin/TPMCalculator: symbol lookup error: /scratch/er01/apps/miniconda3/bin/TPMCalculator: undefined symbol: _ZN8BamTools11SamSequenceC1ERKS0_

unrecoginized command line option std=c++14 on compilation

Hi there,

Compilation seems to fail due to our cluster running a relatively old version of the gcc compiler (maybe)

Error:

make
"make" -f nbproject/Makefile-Release.mk QMAKE= SUBPROJECTS= .build-conf
make[1]: Entering directory `/SAN/vyplab/alb_projects/tools/TPMCalculator'
"make"  -f nbproject/Makefile-Release.mk bin/TPMCalculator
make[2]: Entering directory `/SAN/vyplab/alb_projects/tools/TPMCalculator'
mkdir -p build/Release/GNU-MacOSX/src
rm -f "build/Release/GNU-MacOSX/src/DiffExpIR.o.d"
g++ -g -I /home/annbrown/tools/bamtools/include/bamtools/  -c -O2 -Iincludes -I../../bamtools/include/bamtools -std=c++14 -MMD -MP -MF "build/Release/GNU-MacOSX/src/DiffExpIR.o.d" -o build/Release/GNU-MacOSX/src/DiffExpIR.o src/DiffExpIR.cpp
g++: error: unrecognized command line option ‘-std=c++14

Our cluster currenty has

gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Copyright (C) 2015 Free Software Foundation, Inc.

And a brief google of the error returns this SO
https://stackoverflow.com/questions/36245428/c-error-unrecognized-command-line-option-std-c14
which suggest changing

-std=c++14
to
-std=c++1y

However just editing the Makefile-Release.mk(which is the only place I can find calls to -std=c++14, and yes I see the warning I should not edit it ^^) leads to another error

In file included from src/DiffExpIR.cpp:33:0:
includes/Sequence.h: In member function ‘std::pair<std::shared_ptr<sequence::DNA>, bool> sequence::DNAContainer::addElement(std::string)’:
includes/Sequence.h:218:58: error: ‘make_unique’ is not a member of ‘std’
             result = sequences.insert(std::make_pair(id, std::make_unique<DNA>()));
                                                          ^
includes/Sequence.h:218:78: error: expected primary-expression before ‘>’ token
             result = sequences.insert(std::make_pair(id, std::make_unique<DNA>()));
                                                                              ^
includes/Sequence.h:218:80: error: expected primary-expression before ‘)’ token
             result = sequences.insert(std::make_pair(id, std::make_unique<DNA>()));
                                                                                ^
make[2]: *** [build/Release/GNU-MacOSX/src/DiffExpIR.o] Error 1
make[2]: Leaving directory `/SAN/vyplab/alb_projects/tools/TPMCalculator'
make[1]: *** [.build-conf] Error 2
make[1]: Leaving directory `/SAN/vyplab/alb_projects/tools/TPMCalculator'
make: *** [.build-impl] Error 2

I've got the bamtools installed already as well.

 echo $CPPFLAGS
-I /home/annbrown/tools/bamtools/include/bamtools/
echo $LDFLAGS
-L /home/annbrown/tools/bamtools/lib64 -Wl,-rpath,/home/annbrown/tools/bamtools/lib64

cheers and thanks!

Compatibility with NCBI GFF3 genome annotation

Meaning of "UniqueReads" in genes.out file?

I have been comparing HTSeq (htseq-count) counts to TPMcalcualtor UniqueReads counts.

I find them generally to be in very good agreement. However, there are occasionally some cases where HTSeq and TPMcalculator will be in agreement for the "Reads" column of TPMcalculator output. but will disagree in the last 8 columns (TPMcalculator output will report zeros).

For example:

DTO2_FUN_009269 scaffold_6 3072673 3073284 612 366201 9037.3 612 366201 7579.98 0 0 0 0 0 0 0 0 0 0 0 0

This can happen with genes that receive just a few reads or with genes have 10's of thousands of reads aligning. It can happen with intronless genes or with multi-exon genes.

Since I'm using de novo assembly and annotation, I'm wondering if there could be something about my gff file that is tripping up TPMcalculator?

Here's my gff file for the example case above

$ grep DTO2_FUN_009269 ../../final_assemblies/full_assemblies/DTO2/genome.masked.gff3 
scaffold_6	funannotate	gene	3072674	3073285	.	-	.	ID=DTO2_FUN_009269;
scaffold_6	funannotate	mRNA	3072674	3073285	.	-	.	ID=DTO2_FUN_009269-T1;Parent=DTO2_FUN_009269;product=hypothetical protein;
scaffold_6	funannotate	exon	3072674	3073285	.	-	.	ID=DTO2_FUN_009269-T1.exon1;Parent=DTO2_FUN_009269-T1;
scaffold_6	funannotate	CDS	3072674	3073285	.	-	0	ID=DTO2_FUN_009269-T1.cds;Parent=DTO2_FUN_009269-T1;

Pool BAM file and get TPM for each BAM experiment ?

Hello, I have different BAM files wich corresponds do different tissue transcript sequencing reads.
So my question is the following:

Is there a way to pool all the BAM file, get a TPM value for a list of specific genes (with coordinates), and to know if the genes are expressed globaly (pooled BAM) and (specificaly to certain tissues)?

And also I wondered if TPMCalculator can calculate TPM values for genes but just with start/stop informations ? (I do not have any informations about exon/intron - my genes should be only exons since they are blast hits from a BLASTP analysis)

Thank you very much for your help

Compatibility with gencode GTF

I know you didn't test TPMCalculator with Gencode gtf in your publication .

That would be great if you could confirm that there no incompatibilities using gtf from Gencode.

What main difference between ucsc and gencode gtf could lead to wrong results ?

From test I have done, it seems to work. I didn't get specific errors with human gtf v25. But still, I didn't go into details to check the counts are well done.

Thanks

GTF error: "Key for isoform name was not found on GTF line"

I am getting the following error when I try to run TPMcalculator:
Key for isoform name was not found on GTF line.
Error processing GTF line at Chromosome level:
scaffold11323 maker exon 2123 2371 . - . gene_ID "Ha_G028047"; transcript_ID "Ha_T028047_A"

My GTF doesn't have isoforms and I don't see a way to either specific an isoform key or skip it.

Build problems on Ubuntu and MacOSX

Following the README instructions for building TPMCalculator from source fails for a couple of reasons:

1: setting LDFLAGS="-L $BAMTOOLS_DIR/lib64 -Wl,-rpath,$BAMTOOLS_DIR/lib64"
is not correct - the build process for bamtools does not generate a lib64 directory, rather
libbamtools.a is in the lib directory.

2: the TPMCalculator make fails since the final build command contains -L../../bamtools/lib
whereas the instructions have the default dir as /TPMCalculator from where the command
should be -L../bamtools/lib.

Input files and GeneID Errors

Hi! Thanks for developing such a useful tool! :)
I've installed it with Bioconda in a linux server and I'm trying to use it to analyze the STAR BAM files from PE reads, but I got two errors.

When running the following command:
nohup /localhome/icmm/kns601/.local/bin/anaconda3/bin/TPMCalculator -g /path/to/gencode.v35.annotation.gtf -d BAMwithMT -p -k gene_id &> nohup_TPMCounts_withMT.out &

I got this in the nohup report:
nohup: ignoring input
Reading GTF file ...
Done in 286.83 seconds
Parsing BAM files
Processing sample: PT979-2_Aligned.sortedByCoord.out
(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)(NUL)
Chromosome with name: GL000008.2 does not exist
Chromosome with name: GL000008.2 does not exist
Chromosome with name: GL000008.2 does not exist
Chromosome with name: GL000008.2 does not exist
Chromosome with name: GL000008.2 does not exist
Chromosome with name: GL000008.2 does not exist
...
Chromosome with name: KI270757.1 does not exist
Chromosome with name: KI270757.1 does not exist
Chromosome with name: KI270757.1 does not exist
Chromosome with name: KI270757.1 does not exist
Chromosome with name: KI270757.1 does not exist
Chromosome with name: KI270757.1 does not exist
in 2180.64 seconds. Processed 50949830 reads
Processing sample: 2730-006_Aligned.sortedByCoord.outCan't open file BAMwithMT/2730-006_Aligned.sortedByCoord.out.bam

The first problem I had is that its taking the geneIDs as chromosomes, and it prompts the "does not exist" error. It seems that when analyzing the genes from GTF reference that are in the regular gene annotation it doesn't prompt any error, but when arriving to the ones that are outside that annotation gets the geneID as chromosome for some reson.

The other problem is that the following file is not read properly, as the BAMwithMT/2730-006_Aligned.sortedByCoord.out.bam exists in my folder and the program itself has printed the path as I was using the "-d" option.

Can you please help me solving this issues?

Thanks!
Quim

*_genes.out duplicate genes

Hello, thanks for your good tool!
My problem is some genes are duplicated in *_genes.out result. I used hisat2 to align reads.

TPMCalculator -g ${GRCh38GTF} -b ${BamDir}/${Sample}.bam \
-c 150 -k gene_id -t transcript_id -q 60 -p -a

Then, some genes have multiple records in *_genes.out, for example "ENSG00000135406" and "ENSG00000235538"

# first 7 columns
ENSG00000135406.14#1    chr12   49293251        49293996        746     0       0
ENSG00000135406.14#2    chr12   49295146        49298685        3540    0       0

# ENSG00000235538
ENSG00000235538.3#1     chr6    163671576       163798848       127273  4       0.00366198
ENSG00000235538.3#2     chr6    163927121       164009979       82859   2       0.00281244
ENSG00000235538.3#3     chr6    164228330       164231609       3280    0       0

This is my GTF record:

$ grep "ENSG00000235538" gencode.v35.annotation.gtf | awk -v FS="\t" '$3!="exon" {$NF="";print}'
chr6 HAVANA gene 163671577 164231610 . + . 
chr6 HAVANA transcript 163671577 163760739 . + . 
chr6 HAVANA transcript 163671603 163716199 . + . 
chr6 HAVANA transcript 163671609 163774630 . + . 
chr6 HAVANA transcript 163671613 163774630 . + . 
chr6 HAVANA transcript 163671641 163774628 . + . 
chr6 HAVANA transcript 163671642 163773768 . + . 
chr6 HAVANA transcript 163671643 163760376 . + . 
chr6 HAVANA transcript 163671658 163760376 . + . 
chr6 HAVANA transcript 163671666 163774624 . + . 
chr6 HAVANA transcript 163671678 163774626 . + . 
chr6 HAVANA transcript 163703904 163759841 . + . 
chr6 HAVANA transcript 163748718 163760382 . + . 
chr6 HAVANA transcript 163759267 163798849 . + . 
chr6 HAVANA transcript 163927122 164009980 . + . 
chr6 HAVANA transcript 164228331 164231610 . + .

THis is bug?

CPM output

--Hi,

is it possible to add an option to output CPM instead of TPM ?
thank you --

Is the read counting strand-specific?

Hello,

I have used TPMCalculator to quantify RNA expression using bam files obtained using HISAT2.

For some genes I am getting extremely high intronic TPM counts when compared to exon values in the same gene. I have inspected those genes and I found that some introns overlap other genes in the opposite strand, partially or completely (see attached picture). So, when the TPM is calculated, the intron siganl seems to be inflated and doesn't correlate with the exonic values which are very low or zero.

To run TPMCalculator I use this commans:
TPMCalculator -v -g $annotation -k gene_id -t transcript_id -p -b $bam

My question is, when TPM counts the reads per feature, is the counting strand-specific?

Thank you,
Manuel.

Nothing in the output files.

Hi ,
Thank you for a exceeding useful tool. I am running RNA-seq of Rice now. I use subjunc mapping ,then i use TPMCalculator to run.But the result files don't have anything . I don't know why .I will upload I'd appreciate it if you could help.
The run code .

The gtf file.
IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf.zip
A portion of the bam.

Question about the -c fag and multi-mappers

Hello,
I am not sure I understand the meaning of the -c flag "-c Smaller size allowed for an intron created for genes. Default: 16. We recommend to use the reads length". Why do you recommend the reads length?

Also, how does the software treat multi-mapping reads, so reads matching multiple locations across the genome (value more than 1 in NH:i: field). Is that normally handled by MAPQ filtering?

All the best,
Agnieszka

Help me please

Hello everyone, can anyone tell me what the error is?

when building nginx packages with the vidayoot etu oshibku brotley module

objs/addon/static/ngx_http_brotli_static_module.o
objs/ngx_modules.o
-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie -ldl -lpthread -lpthread -lcrypt -L/usr/local/src/ngx_brotli/deps/brotli/c/../out -lbrotlienc -lbrotlicommon -lm -lpcre2-8 -lssl -lcrypto -ldl -lpthread -lz
-Wl,-E
/usr/bin/ld: cannot find -lbrotlienc
/usr/bin/ld: cannot find -lbrotlicommon
collect2: error: ld returned 1 exit status
make[2]: *** [objs/Makefile:333: objs/nginx] Error 1
make[2]: Leaving directory '/usr/local/src/nginx-1.24.0/debian/build-nginx'
make[1]: *** [Makefile:10: build] Error 2
make[1]: Leaving directory '/usr/local/src/nginx-1.24.0/debian/build-nginx'
make: *** [debian/rules:52: build-arch.nginx] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2

Gene number in input GTF differs from the TPMCalculator output

My input GTF and the GFF from which it was derived contains 62417 genes. I ran TPMCalculator on 15 BAM files corresponding to 15 RNA-seq experiments. The number of lines in the file alignments.sorted.pos.uniq_genes.out differs in each case:

31124
33419
33888
34408
34876
36120
36162
36412
37151
37192
37521
38853
40305
40581
40760

There was a repeated message to STDERR saying:

Chromosome with name: XXXXXX does not exist

But I don't know its meaning or whether it is an error or a warning.

Is it possible to provide a binary version of TPMCalculator

TPMCalculator is a useful tool to calculate expression of RNA-Seq datasets.
I try to install on our Linux system. But it failed.
So, could you provide binary version?

/usr/bin/ld: cannot find -lbamtools

Hi there,

I installed the BamTools and TPMCalculateor, but when I use make in the TPMCalulator fold, it reported some errors, as below:

"make" -f nbproject/Makefile-Release.mk QMAKE= SUBPROJECTS= .build-conf
make[1]: Entering directory '/home/congxiao/software/TPMCalculator-master'
"make" -f nbproject/Makefile-Release.mk bin/TPMCalculator
make[2]: Entering directory '/home/congxiao/software/TPMCalculator-master'
mkdir -p bin
g++ -g -I /home/congxiao/software/bamtools/include/bamtools/ -L /home/congxiao/software/bamtools/lib64 -Wl,-rpath,/home/congxiao/software/bamtools/lib64 -o bin/TPMCalculator build/Release/GNU-MacOSX/src/DiffExpIR.o build/Release/GNU-MacOSX/src/FastaFactory.o build/Release/GNU-MacOSX/src/RandomFactory.o build/Release/GNU-MacOSX/src/ReadFactory.o build/Release/GNU-MacOSX/src/Stats.o build/Release/GNU-MacOSX/src/TextParser.o build/Release/GNU-MacOSX/src/bd0.o build/Release/GNU-MacOSX/src/bratio.o build/Release/GNU-MacOSX/src/bstring.o build/Release/GNU-MacOSX/src/chebyshev.o build/Release/GNU-MacOSX/src/choose.o build/Release/GNU-MacOSX/src/dnorm.o build/Release/GNU-MacOSX/src/dt.o build/Release/GNU-MacOSX/src/gamma.o build/Release/GNU-MacOSX/src/lbeta.o build/Release/GNU-MacOSX/src/lgamma.o build/Release/GNU-MacOSX/src/lgammacor.o build/Release/GNU-MacOSX/src/main.o build/Release/GNU-MacOSX/src/pbeta.o build/Release/GNU-MacOSX/src/phyper.o build/Release/GNU-MacOSX/src/pnorm.o build/Release/GNU-MacOSX/src/pt.o build/Release/GNU-MacOSX/src/qnorm.o build/Release/GNU-MacOSX/src/qt.o build/Release/GNU-MacOSX/src/stirlerr.o build/Release/GNU-MacOSX/src/sunif.o build/Release/GNU-MacOSX/src/wilcox.o -L../../bamtools/lib -lbamtools -lm -lz
/usr/bin/ld: cannot find -lbamtools
collect2: error: ld returned 1 exit status
make[2]: *** [nbproject/Makefile-Release.mk:89: bin/TPMCalculator] Error 1
make[2]: Leaving directory '/home/congxiao/software/TPMCalculator-master'
make[1]: *** [nbproject/Makefile-Release.mk:85: .build-conf] Error 2
make[1]: Leaving directory '/home/congxiao/software/TPMCalculator-master'
make: *** [nbproject/Makefile-impl.mk:40: .build-impl] Error 2

I installed the bamtools just as your README file, I don't know why.

Thank you!

ID with "#" in _genes.out file

I use gtf file from Gencode database.
I found the TPMCalculator would prodeced two or more gene information of one gene id (ENSMUSG00000026162.7#1, ENSMUSG00000026162.7#2).
But the original id of gtf file is ENSMUSG00000026162.7.
Why TPMCalculator output two or more gene id with "#"?

77660 Segmentation fault

Hello, first of all thank you for the program it is very usefull for my work !

I installed TPMCalculator with conda and mapped transcriptomic data against a genome, wich gave me :

- 1 Transcriptomic_map.bam

and I also create a GTF file format :
-2 GTF_candidates..gtf

Then I used the code :

/data/anaconda3/envs/my_python3.6_env/bin/TPMCalculator -g /data/Genomes/Species/Mapping/Transcriptomic_reads/GFF_candidates.gtf -b /data/these/Genomes/Species/Mapping/Transcriptomic_reads/Transcriptomic_map.bam -e -a

but here I get the following issue :

Reading GTF file ... 
Done in 0.005154 seconds
Parsing sample: mapping_Species
/var/spool/slurmd/job1928015/slurm_script: line 46: 77660 Segmentation fault      /data/anaconda3/envs/my_python3.6_env/bin/TPMCalculator -g /data/Genomes/Species/Mapping/Transcriptomic_reads/GTF_candidates.
gff -b /data/Genomes/Species/Mapping/Transcriptomic_reads/Transcriptomic_map.bam -e -a

I already used the program with other genomes and mapping and everything worked well, i do not understand what is going on with this one ?

Thank you for your help

ncbi / tpmcalculator Goto Github PK

tpmcalculator's Introduction

TPMCalculator

Reference

Conda/Bioconda

NIH Biowulf

Requirements

BAMTools

Installation

Docker

Usage

Description

Validation

Credits

Public Domain notice

National Center for Biotechnology Information.

tpmcalculator's People

Contributors

Stargazers

Watchers

Forkers

tpmcalculator's Issues

Recommend Projects

Recommend Topics

Recommend Org