Giter Site home page Giter Site logo

Comments (6)

r78v10a07 avatar r78v10a07 commented on May 27, 2024

Hi,
Thanks for reporting this.
I did some extra checks and TPMCalculator seems to be working fine. We count paired reads independently. If the read and its mate are in the same gene we will count it as 2 reads. Maybe this is the reason of the duplication.

I tried to run STAR with the --quantMode GeneCounts option but the output was not giving me any logical counts.
This was my STAR command line:

STAR --alignEndsType Local --alignSJDBoverhangMin 1 --alignSJoverhangMin 15 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000 --outFilterMatchNminOverLread 0 --outFilterMismatchNmax 33 --outFilterMismatchNoverLmax 0.3 --outFilterMultimapNmax 100 --outFilterScoreMinOverLread 0.3 --outFilterType BySJout --outSAMtype BAM Unsorted --outSAMunmapped Within --quantMode GeneCounts --readFilesCommand zcat --seedSearchStartLmax 12 --runThreadN 16 --twopassMode Basic --winAnchorMultimapNmax 50 --genomeDir ./STAR --outStd Log --readFilesIn 299_7_1.fastq.gz 299_7_2.fastq.gz --outFileNamePrefix 299_7
perseo:tpmcalculator> head 299_7ReadsPerGene.out.tab
N_unmapped	808563	808563	808563
N_multimapping	8128187	8128187	8128187
N_noFeature	8270340	33065941	8437845
N_ambiguous	550362	5759	253315
DDX11L1	0	0	0
WASH7P	24	0	24
MIR6859-3	0	0	0
MIR6859-2	0	0	0
MIR6859-1	0	0	0
MIR6859-4	0	0	0

The TPMCalculator output:

Gene_Id	Chr	Start	End	Length	Reads	TPM	ExonLength	ExonReads	ExonTPM	IntronLength	IntronReads	IntronTPM	UniqueLegth	UniqueReads	UniqueTPM	UniqueExonLength	UniqueExonReads	UniqueExonTPM	UniqueIntronLength	UniqueIntronReads	UniqueIntronTPM
DDX11L1	chr1	11873	14408	2536	121	0.284233	1652	73	0.239589	884	50	48.3515	2488	119	0.708173	1604	71	0.529718	884	50	62.9121
WASH7P	chr1	14361	29369	15009	10961	4.35047	1769	5126	15.7111	13240	5903	381.133	14893	10944	10.8802	1721	5124	35.6303	13172	5885	496.949
MIR6859-3#1	chr1	17368	17435	68	26	2.27773	68	26	2.07309	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-4#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-1#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
MIR6859-2#1	chr1	17368	17435	68	23	2.01492	68	23	1.83389	0	0	0	0	0	0	0	0	0	0	0	0
FAM138A	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0
FAM138C	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0
FAM138F	chr1	34610	36080	1471	4	0.0161989	1130	3	0.0143945	341	1	2.5069	0	0	0	0	0	0	0	0	0

We see for the first gene DDX11L1 73 reads whereas STAR report 0 reads. I tried counting reads with samtools directly on each exon and intron of the gene:

perseo:tpmcalculator> samtools view -bh -f2 299_7_sorted.bam > 299_7_sorted_pp.bam
perseo:tpmcalculator> samtools index 299_7_sorted_pp.bam
perseo:tpmcalculator> echo -e "Chr\tstart\tend\tTPMCalculator\tSamtools"; head 299_7_sorted_genes.ent | grep DDX11L1 | while read line; do s=`echo $line | awk '{print $5}'`; e=`echo $line | awk '{print $6}'`; c=`echo $line | awk '{print $2}'`; r=`samtools view 299_7_sorted_pp.bam $c:$s-$e | wc -l`; m=`echo $line | awk '{print $8}'`; echo -e "$c\t$s\t$e\t$m\t$r"; done
Chr	start	end	TPMCalculator	Samtools
chr1	11873	12226	27	29
chr1	12227	12611	29	29
chr1	12612	12720	1	1
chr1	12721	13219	21	21
chr1	13220	14408	45	46

TPMCalculator counts are very similar to the samtools reads, I could dig deeper but the one or two different reads should be due to boundaries limitations set by TPMCalculator.

Can you send me the STAR command you used for aligning the reads to see if I can reproduce the error, please?
Best,
Roberto

from tpmcalculator.

calpan avatar calpan commented on May 27, 2024

The STAR command I used was the following:

STAR --runThreadN 4 --genomeDir ./GRCm38.p6_Ensembl95_GENCODE_M20/GenomeIndex/STAR_2.7.0e_index_soft_masked/ --readFilesIn GF_10_R1.fastq.gz GF_10_R2.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --alignIntronMin 20 --alignIntronMax 1000000 --outFileNamePrefix GF_10 --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMattrIHstart 0 --outSAMmapqUnique Integer0to255 --quantMode GeneCounts --outReadsUnmapped Fastx --twopassMode Basic

But it sounds like counting paired reads independently is the most likely explanation for the behavior I saw.

from tpmcalculator.

r78v10a07 avatar r78v10a07 commented on May 27, 2024

Hi,
I aligned the BAM file using your STAR options. There is not big differences in the resulting alignment. However, I look at gene RN7SL2 which has a big differences in read count and I found something interesting.

TPMCalculator

Gene_Id	Chr	Start	End	Length	Reads	TPM	ExonLength	ExonReads	ExonTPM	IntronLength	IntronReads	IntronTPM	UniqueLegth	UniqueReads	UniqueTPM	UniqueExonLength	UniqueExonReads	UniqueExonTPM	UniqueIntronLength	UniqueIntronReads	UniqueIntronTPM
RN7SL2	chr14	49862550	49862848	299	4.01557e+06	91435.8	299	4.01557e+06	82218.8	0	0	0	299	4.01557e+06	268076	299	4.01557e+06	203392	0	0	0

STAR

RN7SL2	268186	96	268091

There is a huge difference in reads.

Samtools:

perseo:tpmcalculator> samtools view GF_10Aligned.sortedByCoord.out.bam "chr14:49862550-49862848" | wc -l
4287979

As you can see, samtools is also counting over 4M reads for that gene.

Finally, I look at the gene on IGV and the maximum coverage is over 800000. Therefore, there should be millions of reads at that gene.

Screen Shot 2019-04-02 at 4 30 38 PM

In some way, STAR is not counting all reads in this gene. I don't know how STAR counts the reads per gene but this example shows that the number reported by STAR does not agree with the reads in the BAM file.

from tpmcalculator.

calpan avatar calpan commented on May 27, 2024

from tpmcalculator.

r78v10a07 avatar r78v10a07 commented on May 27, 2024

We did correlation with htseq-count for our paper using UCSC and Gencode genomes annotations.

Check this: https://github.com/ncbi/TPMCalculator/wiki/Validation#validation-using-gencode-v25-genome-annotation

This is the correlation with htseq-count for more than 1000 samples. As you can see, there is a high correlation which is totally different to the correlation we see now between STAR and TPMCalculator.

image

from tpmcalculator.

r78v10a07 avatar r78v10a07 commented on May 27, 2024

Hi calpan,
Alex Dobin answered my question in the STAR Github issue alexdobin/STAR#603 (comment)

Basically, STAR counts only reads in the same strand, therefore, it should be approximately half of the properly paired reads counted by TPMCalculator.

This is what you are seeing in your results. You can use either approach, just choose the one which is more convenient for your data analysis.
Thanks,
Roberto

from tpmcalculator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.