Comments (6)
Hi,
Thanks for reporting this.
I did some extra checks and TPMCalculator seems to be working fine. We count paired reads independently. If the read and its mate are in the same gene we will count it as 2 reads. Maybe this is the reason of the duplication.
I tried to run STAR with the --quantMode GeneCounts option but the output was not giving me any logical counts.
This was my STAR command line:
STAR --alignEndsType Local --alignSJDBoverhangMin 1 --alignSJoverhangMin 15 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --outFilterMismatchNoverLmax 0.3 --outFilterMultimapNmax 100 --outFilterScoreMinOverLread 0.3 --outFilterType BySJout --outSAMtype BAM Unsorted --outSAMunmapped Within --quantMode GeneCounts --readFilesCommand zcat --seedSearchStartLmax 12 --runThreadN 16 --twopassMode Basic --winAnchorMultimapNmax 50 --genomeDir ./STAR --outStd Log --readFilesIn 299_7_1.fastq.gz 299_7_2.fastq.gz --outFileNamePrefix 299_7
perseo:tpmcalculator> head 299_7ReadsPerGene.out.tab
N_unmapped 808563 808563 808563
N_multimapping 8128187 8128187 8128187
N_noFeature 8270340 33065941 8437845
N_ambiguous 550362 5759 253315
DDX11L1 0 0 0
WASH7P 24 0 24
MIR6859-3 0 0 0
MIR6859-2 0 0 0
MIR6859-1 0 0 0
MIR6859-4 0 0 0
The TPMCalculator output:
Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM UniqueLegth UniqueReads UniqueTPM UniqueExonLength UniqueExonReads UniqueExonTPM UniqueIntronLength UniqueIntronReads UniqueIntronTPM
DDX11L1 chr1 11873 14408 2536 121 0.284233 1652 73 0.239589 884 50 48.3515 2488 119 0.708173 1604 71 0.529718 884 50 62.9121
WASH7P chr1 14361 29369 15009 10961 4.35047 1769 5126 15.7111 13240 5903 381.133 14893 10944 10.8802 1721 5124 35.6303 13172 5885 496.949
MIR6859-3#1 chr1 17368 17435 68 26 2.27773 68 26 2.07309 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-4#1 chr1 17368 17435 68 23 2.01492 68 23 1.83389 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-1#1 chr1 17368 17435 68 23 2.01492 68 23 1.83389 0 0 0 0 0 0 0 0 0 0 0 0
MIR6859-2#1 chr1 17368 17435 68 23 2.01492 68 23 1.83389 0 0 0 0 0 0 0 0 0 0 0 0
FAM138A chr1 34610 36080 1471 4 0.0161989 1130 3 0.0143945 341 1 2.5069 0 0 0 0 0 0 0 0 0
FAM138C chr1 34610 36080 1471 4 0.0161989 1130 3 0.0143945 341 1 2.5069 0 0 0 0 0 0 0 0 0
FAM138F chr1 34610 36080 1471 4 0.0161989 1130 3 0.0143945 341 1 2.5069 0 0 0 0 0 0 0 0 0
We see for the first gene DDX11L1 73 reads whereas STAR report 0 reads. I tried counting reads with samtools directly on each exon and intron of the gene:
perseo:tpmcalculator> samtools view -bh -f2 299_7_sorted.bam > 299_7_sorted_pp.bam
perseo:tpmcalculator> samtools index 299_7_sorted_pp.bam
perseo:tpmcalculator> echo -e "Chr\tstart\tend\tTPMCalculator\tSamtools"; head 299_7_sorted_genes.ent | grep DDX11L1 | while read line; do s=`echo $line | awk '{print $5}'`; e=`echo $line | awk '{print $6}'`; c=`echo $line | awk '{print $2}'`; r=`samtools view 299_7_sorted_pp.bam $c:$s-$e | wc -l`; m=`echo $line | awk '{print $8}'`; echo -e "$c\t$s\t$e\t$m\t$r"; done
Chr start end TPMCalculator Samtools
chr1 11873 12226 27 29
chr1 12227 12611 29 29
chr1 12612 12720 1 1
chr1 12721 13219 21 21
chr1 13220 14408 45 46
TPMCalculator counts are very similar to the samtools reads, I could dig deeper but the one or two different reads should be due to boundaries limitations set by TPMCalculator.
Can you send me the STAR command you used for aligning the reads to see if I can reproduce the error, please?
Best,
Roberto
from tpmcalculator.
The STAR command I used was the following:
STAR --runThreadN 4 --genomeDir ./GRCm38.p6_Ensembl95_GENCODE_M20/GenomeIndex/STAR_2.7.0e_index_soft_masked/ --readFilesIn GF_10_R1.fastq.gz GF_10_R2.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --alignIntronMin 20 --alignIntronMax 1000000 --outFileNamePrefix GF_10 --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMattrIHstart 0 --outSAMmapqUnique Integer0to255 --quantMode GeneCounts --outReadsUnmapped Fastx --twopassMode Basic
But it sounds like counting paired reads independently is the most likely explanation for the behavior I saw.
from tpmcalculator.
Hi,
I aligned the BAM file using your STAR options. There is not big differences in the resulting alignment. However, I look at gene RN7SL2 which has a big differences in read count and I found something interesting.
TPMCalculator
Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM UniqueLegth UniqueReads UniqueTPM UniqueExonLength UniqueExonReads UniqueExonTPM UniqueIntronLength UniqueIntronReads UniqueIntronTPM
RN7SL2 chr14 49862550 49862848 299 4.01557e+06 91435.8 299 4.01557e+06 82218.8 0 0 0 299 4.01557e+06 268076 299 4.01557e+06 203392 0 0 0
STAR
RN7SL2 268186 96 268091
There is a huge difference in reads.
Samtools:
perseo:tpmcalculator> samtools view GF_10Aligned.sortedByCoord.out.bam "chr14:49862550-49862848" | wc -l
4287979
As you can see, samtools is also counting over 4M reads for that gene.
Finally, I look at the gene on IGV and the maximum coverage is over 800000. Therefore, there should be millions of reads at that gene.
In some way, STAR is not counting all reads in this gene. I don't know how STAR counts the reads per gene but this example shows that the number reported by STAR does not agree with the reads in the BAM file.
from tpmcalculator.
from tpmcalculator.
We did correlation with htseq-count for our paper using UCSC and Gencode genomes annotations.
Check this: https://github.com/ncbi/TPMCalculator/wiki/Validation#validation-using-gencode-v25-genome-annotation
This is the correlation with htseq-count for more than 1000 samples. As you can see, there is a high correlation which is totally different to the correlation we see now between STAR and TPMCalculator.
from tpmcalculator.
Hi calpan,
Alex Dobin answered my question in the STAR Github issue alexdobin/STAR#603 (comment)
Basically, STAR counts only reads in the same strand, therefore, it should be approximately half of the properly paired reads counted by TPMCalculator.
This is what you are seeing in your results. You can use either approach, just choose the one which is more convenient for your data analysis.
Thanks,
Roberto
from tpmcalculator.
Related Issues (20)
- Gene number in input GTF differs from the TPMCalculator output HOT 3
- symbol lookup error HOT 9
- Sets the name of the output HOT 2
- Add a new option to use a directory as output destination HOT 1
- For the paired end reads, is it recommended to use option -p or should i go with default without it? HOT 1
- Output file description HOT 2
- /usr/bin/ld: cannot find -lbamtools HOT 2
- *_genes.out duplicate genes HOT 3
- Compilation error: collect2: error: ld returned 1 exit status HOT 2
- Is the read counting strand-specific? HOT 2
- Key ID for gene name was not found on GTF line HOT 3
- No TPM values, no reads processed HOT 9
- Chromosome with name: ENST.... does not exist HOT 10
- Output files desctiption HOT 1
- Installation without docker HOT 1
- Meaning of "UniqueReads" in genes.out file? HOT 2
- Possible to use gff3 in TPMCalculator v 0.4? HOT 1
- Build problems on Ubuntu and MacOSX HOT 1
- Help me please HOT 1
- After installing version 0.0.4, -version still prints 0.0.3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tpmcalculator.