For paired-end alignments generated with STAR, the reported ExonReads number is almost

The STAR command I used was the following: <div class="snippet-clipboard-content n

STAR uses the same algorithm as htseq-count (<a href="https://github.co

Raw counts from paired-end alignments seem to be inflated by a factor of 2 about tpmcalculator HOT 6 CLOSED

ncbi commented on May 27, 2024

Raw counts from paired-end alignments seem to be inflated by a factor of 2

from tpmcalculator.

Comments (6)

r78v10a07 commented on May 27, 2024

Hi,
Thanks for reporting this.
I did some extra checks and TPMCalculator seems to be working fine. We count paired reads independently. If the read and its mate are in the same gene we will count it as 2 reads. Maybe this is the reason of the duplication.

I tried to run STAR with the --quantMode GeneCounts option but the output was not giving me any logical counts.
This was my STAR command line:

STAR --alignEndsType Local --alignSJDBoverhangMin 1 --alignSJoverhangMin 15 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --outFilterMismatchNoverLmax 0.3 --outFilterMultimapNmax 100 --outFilterScoreMinOverLread 0.3 --outFilterType BySJout --outSAMtype BAM Unsorted --outSAMunmapped Within --quantMode GeneCounts --readFilesCommand zcat --seedSearchStartLmax 12 --runThreadN 16 --twopassMode Basic --winAnchorMultimapNmax 50 --genomeDir ./STAR --outStd Log --readFilesIn 299_7_1.fastq.gz 299_7_2.fastq.gz --outFileNamePrefix 299_7

perseo:tpmcalculator> head 299_7ReadsPerGene.out.tab
N_unmapped	808563	808563	808563
N_multimapping	8128187	8128187	8128187
N_noFeature	8270340	33065941	8437845
N_ambiguous	550362	5759	253315
DDX11L1	0	0	0
WASH7P	24	0	24
MIR6859-3	0	0	0
MIR6859-2	0	0	0
MIR6859-1	0	0	0
MIR6859-4	0	0	0

The TPMCalculator output:

Gene_Id	Chr	Start	End	Length	Reads	TPM	ExonLength	ExonReads	ExonTPM	IntronLength	IntronReads	IntronTPM	UniqueLegth	UniqueReads	UniqueTPM	UniqueExonLength	UniqueExonReads	UniqueExonTPM	UniqueIntronLength	UniqueIntronReads	UniqueIntronTPM
DDX11L1	chr1	11873	14408	2536	121	0.284233	1652	73	0.239589	884	50	48.3515	2488	119	0.708173	1604	71	0.529718	884	50	62.9121
WASH7P	chr1	14361	29369	15009	10961	4.35047	1769	5126	15.7111	13240	5903	381.133	14893	10944	10.8802	1721	5124	35.6303	13172	5885	496.949
MIR6859-3#1	chr1	17368	17435	68	26	2.27773	68	26	2.07309	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-4#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-1#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-2#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
FAM138A	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0
FAM138C	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0
FAM138F	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0

We see for the first gene DDX11L1 73 reads whereas STAR report 0 reads. I tried counting reads with samtools directly on each exon and intron of the gene:

perseo:tpmcalculator> samtools view -bh -f2 299_7_sorted.bam > 299_7_sorted_pp.bam
perseo:tpmcalculator> samtools index 299_7_sorted_pp.bam
perseo:tpmcalculator> echo -e "Chr\tstart\tend\tTPMCalculator\tSamtools"; head 299_7_sorted_genes.ent | grep DDX11L1 | while read line; do s=`echo $line | awk '{print $5}'`; e=`echo $line | awk '{print $6}'`; c=`echo $line | awk '{print $2}'`; r=`samtools view 299_7_sorted_pp.bam $c:$s-$e | wc -l`; m=`echo $line | awk '{print $8}'`; echo -e "$c\t$s\t$e\t$m\t$r"; done
Chr	start	end	TPMCalculator	Samtools
chr1	11873	12226	27	29
chr1	12227	12611	29	29
chr1	12612	12720	1	1
chr1	12721	13219	21	21
chr1	13220	14408	45	46

TPMCalculator counts are very similar to the samtools reads, I could dig deeper but the one or two different reads should be due to boundaries limitations set by TPMCalculator.

Can you send me the STAR command you used for aligning the reads to see if I can reproduce the error, please?
Best,
Roberto

from tpmcalculator.

calpan commented on May 27, 2024

The STAR command I used was the following:

STAR --runThreadN 4 --genomeDir ./GRCm38.p6_Ensembl95_GENCODE_M20/GenomeIndex/STAR_2.7.0e_index_soft_masked/ --readFilesIn GF_10_R1.fastq.gz GF_10_R2.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --alignIntronMin 20 --alignIntronMax 1000000 --outFileNamePrefix GF_10 --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMattrIHstart 0 --outSAMmapqUnique Integer0to255 --quantMode GeneCounts --outReadsUnmapped Fastx --twopassMode Basic

But it sounds like counting paired reads independently is the most likely explanation for the behavior I saw.

from tpmcalculator.

r78v10a07 commented on May 27, 2024

Hi,
I aligned the BAM file using your STAR options. There is not big differences in the resulting alignment. However, I look at gene RN7SL2 which has a big differences in read count and I found something interesting.

TPMCalculator

Gene_Id	Chr	Start	End	Length	Reads	TPM	ExonLength	ExonReads	ExonTPM	IntronLength	IntronReads	IntronTPM	UniqueLegth	UniqueReads	UniqueTPM	UniqueExonLength	UniqueExonReads	UniqueExonTPM	UniqueIntronLength	UniqueIntronReads	UniqueIntronTPM
RN7SL2	chr14	49862550	49862848	299	4.01557e+06	91435.8	299	4.01557e+06	82218.8	0	0	0	299	4.01557e+06	268076	299	4.01557e+06	203392	0	0	0

STAR

RN7SL2	268186	96	268091

There is a huge difference in reads.

Samtools:

perseo:tpmcalculator> samtools view GF_10Aligned.sortedByCoord.out.bam "chr14:49862550-49862848" | wc -l
4287979

As you can see, samtools is also counting over 4M reads for that gene.

Finally, I look at the gene on IGV and the maximum coverage is over 800000. Therefore, there should be millions of reads at that gene.

In some way, STAR is not counting all reads in this gene. I don't know how STAR counts the reads per gene but this example shows that the number reported by STAR does not agree with the reads in the BAM file.

from tpmcalculator.

calpan commented on May 27, 2024

STAR uses the same algorithm as htseq-count ( https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf, section 7), which is described here: https://htseq.readthedocs.io/en/release_0.11.1/count.html. Do the assumptions used in the "union" model line up with the ones used in TPMCalculator?

…

On Tue, Apr 2, 2019 at 1:35 PM Roberto Vera Alvarez < ***@***.***> wrote: Hi, I aligned the BAM file using your STAR options. There is not big differences in the resulting alignment. However, I look at gene RN7SL2 which has a big differences in read count and I found something interesting. TPMCalculator Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM UniqueLegth UniqueReads UniqueTPM UniqueExonLength UniqueExonReads UniqueExonTPM UniqueIntronLength UniqueIntronReads UniqueIntronTPM RN7SL2 chr14 49862550 49862848 299 4.01557e+06 91435.8 299 4.01557e+06 82218.8 0 0 0 299 4.01557e+06 268076 299 4.01557e+06 203392 0 0 0 STAR RN7SL2 268186 96 268091 There is a huge difference in reads. Samtools: perseo:tpmcalculator> samtools view GF_10Aligned.sortedByCoord.out.bam "chr14:49862550-49862848" | wc -l 4287979 As you can see, samtools is also counting over 4M reads for that gene. Finally, I look at the gene on IGV and the maximum coverage is over 800000. Therefore, there should be millions of reads at that gene. [image: Screen Shot 2019-04-02 at 4 30 38 PM] <https://user-images.githubusercontent.com/8643256/55434164-ed6dd700-5564-11e9-81f3-43a0d8f4821f.png> In some way, STAR is not counting all reads in this gene. I don't know how STAR counts the reads per gene but this example shows that the number reported by STAR does not agree with the reads in the BAM file. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAxOGMbXoxVcSQLZ3ohSqXxaKDbbIANCks5vc771gaJpZM4cW7pV> .

from tpmcalculator.

r78v10a07 commented on May 27, 2024

We did correlation with htseq-count for our paper using UCSC and Gencode genomes annotations.

Check this: https://github.com/ncbi/TPMCalculator/wiki/Validation#validation-using-gencode-v25-genome-annotation

This is the correlation with htseq-count for more than 1000 samples. As you can see, there is a high correlation which is totally different to the correlation we see now between STAR and TPMCalculator.

from tpmcalculator.

r78v10a07 commented on May 27, 2024

Hi calpan,
Alex Dobin answered my question in the STAR Github issue alexdobin/STAR#603 (comment)

Basically, STAR counts only reads in the same strand, therefore, it should be approximately half of the properly paired reads counted by TPMCalculator.

This is what you are seeing in your results. You can use either approach, just choose the one which is more convenient for your data analysis.
Thanks,
Roberto

from tpmcalculator.

Raw counts from paired-end alignments seem to be inflated by a factor of 2 about tpmcalculator HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent