Comments (18)
Hi,
Thank you for reporting this.
Could you, please, share a portion of your BAM file in SAM or BAM format?
Thanks,
Roberto
from tpmcalculator.
Hi,
I've reading your GTF and BAM file and it seems to be some incompatibilities.
The SAM format specify that in the third column should be reference name. Therefore, TPMCalculator expect to find in that column the chromosome name. Your BAM file has the transcript id instead of the chromosome name.
I haven't used subjunc ever but you should check if the aligner is writing the BAM file using the transcript id as reference instead of the chromosome.
I would recommend you to use STAR (https://github.com/alexdobin/STAR) for the alignments and you will get a 100% compatible BAM file to use in TPMCalculator.
Please, let me know if this help.
Best,
Roberto
from tpmcalculator.
Hi,
Thank you very much for your help. I tried STAR once but mapping no result ,maybe my annotation file is not good for STAR and TPMCalculator . So I will try other annotation file.
Thank you once again for your help.
Best,
zyzhang
from tpmcalculator.
Hi Roberto,
The TPMCalculator is only use GTF ,GFF3 is not working .
Right ?
Best,
zyzhang
from tpmcalculator.
Hi,
You're right. GFF3 is not implemented yet.
Best,
Roberto
from tpmcalculator.
Hi,
Thank you very much for your help. I tried STAR once but mapping no result ,maybe my annotation file is not good for STAR and TPMCalculator . So I will try other annotation file.
Thank you once again for your help.
Best,
zyzhang
Your GTF seems good to me but it will be good if you can try another annotations as well.
Best,
Roberto
from tpmcalculator.
Hi,
I've aligned the first 2 053 369 reads of the SRA accession: SRR3724613
Everything worked fine, see description below.
Please, send me a descrition of how did you get your BAM file to see if I can help to find the problem.
Best,
Roberto
I used the GTF you provided with the NCBI genome downloaded from:
Genome
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/433/935/GCA_001433935.1_IRGSP-1.0/GCA_001433935.1_IRGSP-1.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
I've renamed the header of the fasta files like in your GTF for each chromosome, ie.
>chr01
Subread indexes
subread-buildindex -o IRGSP-1.0 ../*.fna
Subjunc alignment
subjunc -T 16 -d 50 -D 600 -i /gfs/data/genomes/NCBI/IRGSP-1.0/subjunc/IRGSP-1.0 -r 1.fastq -R 2.fastq -o subjunc_results.bam -a /gfs/data/genomes/NCBI/IRGSP-1.0/IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf -F GTF
Output
//============================= subjunc setting ==============================\\
|| ||
|| Function : Read alignment + Junction detection (RNA-Seq) ||
|| Input file 1 : 1.fastq ||
|| Input file 2 : 2.fastq ||
|| Output file : subjunc_results.bam (BAM) ||
|| Index name : IRGSP-1.0 ||
|| ||
|| ------------------------------------ ||
|| ||
|| Threads : 16 ||
|| Phred offset : 33 ||
|| # of extracted subreads : 14 ||
|| Min read1 vote : 1 ||
|| Min read2 vote : 1 ||
|| Max fragment size : 600 ||
|| Min fragment size : 50 ||
|| Max mismatches : 3 ||
|| Max indel length : 5 ||
|| Report multi-mapping reads : no ||
|| Max alignments per multi-mapping read : 1 ||
|| Annotations : IRGSP-1.0_representative_trans ... ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
...
//================================= Summary ==================================\\
|| ||
|| Total fragments : 2053369 ||
|| Mapped : 1816629 (88.5%) ||
|| Uniquely mapped : 1816629 ||
|| Multi-mapping : 0 ||
|| ||
|| Unmapped : 236740 ||
|| ||
|| Correctly paired : 1654461 ||
|| Not mapped in pairs : 162168 ||
|| Only one end mapped : 112105 ||
|| Multi-chromosomes : 9786 ||
|| Different strands : 945 ||
|| Not in PE distance : 19339 ||
|| Abnormal order : 19993 ||
|| ||
|| Junctions : 86895 ||
|| Indels : 19946 ||
|| ||
|| Running time : 1.1 minutes ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
BAM
perseo:data> samtools view subjunc_results.bam | head
SRR3724613.1 89 chr03 10563073 40 100M1S * 0 0 AAGTGAGCAGCGAGAAGCTTGGTGTGTGGTTTGGTTGCTGTCGGAGAATGGTTTTGTTTGACATGATGCTGTCAACACCGGCTATGTTATGTATCTATCTN ACCCACBBBBBCC@CC@BBB?B?BA;?BB?CCCBD@DEFIGEIHFFDBEGIHGGGDGGFBDGIIGIFC3EDGE?EDF8GIGHBGIGIIHHHFDD>DD?:1# HI:i:1 NH:i:1 NM:i:0
SRR3724613.1 165 * 0 0 * chr03 10563073 0 GTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN <<################################################################################################### HI:i:1 NH:i:1
SRR3724613.2 77 * 0 0 * * 0 0 NTTTGAAATTCCACACTCCAATACGTCATTCTAGTGGTTCAGACAAAGAGCATCAAAGCAAGAATACACAGCAACACAATTGGAATCACCTAAAACGATGC #073(()2)2<):=@())9=1)6:()2(.)2=>))1(2(:).3=(<).:)<)0))).91.7.6)6)).).((...33((:(.799=>1((--5;;28'(,(
SRR3724613.2 141 * 0 0 * * 0 0 GGCNNNNNNNNNNNNNNNNNTTANNNNNNNNNNNNNNNNGACNNNNNNNNNNNNNNNNANANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN (((#################(((################(0(################,#,########################################
SRR3724613.3 73 chr10 2756847 40 1S100M * 0 0 NCCTTCCACTGGCAGGCGATCGTGTTTTTCACAGGATTTCTCGTTACTCATGTCAGCATTCTCACTTCTGATATCTCCAGGTCTTGTCACCAAAAACCTTC #1=BDDDDDDDDDDEIAA>FDE3CDDDDDD9BD<D;9DDEI>;BC;C@7@CDE=;=CC7??CCDDDDDD:@@@;@@@A@A?AAA>AAAAAAA??????A>A HI:i:1 NH:i:1 NM:i:0
SRR3724613.3 133 * 0 0 * chr10 2756847 0 GANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN <<################################################################################################### HI:i:1 NH:i:1
SRR3724613.5 73 chr11 296266 40 1S100M * 0 0 NGGCCCTTCTCCTTGAGCCAGTCCAGAAGCTCCCTCAGCGTGATGTTGCCAGTTATGGTCCAGCGGTCCCAAACAGTCCAGGCCATGTCCTGGTGCTTGAT #1=DDFFFHHGHHIHIGIIIIICHDFHIHHBHIIIIIGIHFHGIIGGCGIIIHIIIIICHGGHIIGHEBDF@BCECCCCC>ABCCCCDCDCCCCCCCCABC HI:i:1 NH:i:1 NM:i:0
SRR3724613.5 133 * 0 0 * chr11 296266 0 AGCNNNNNNNNNNNNNNNNCGGGCNNNNNNNNNNNNNNTTCCTNNNNNNNANNNNNTGAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN <<<################22@@<##############00<??#######-#####,,9==######################################## HI:i:1 NH:i:1
SRR3724613.7 73 chr10 22333670 40 1S7M364N93M * 0 0 NTCCAAACCAGGTCATAGCCAAGTGGCAGAGCGAAATTTCCACCCACATTTTTGTAGACAGCAGCAACATGTGGTGGATGAATGCCAACATGAGCAATATC #1=DDFFFHHGHHIJIJIJJJJJHIJIJJJJIIJJJJJJJIGIJJJJIJJJJJIIGHJJJHHHGHFFFFEECEEC?@BDDCDEDDDDDDDCCDCDDDCDDE HI:i:1 NH:i:1 XS:A:- NM:i:0
SRR3724613.7 133 * 0 0 * chr10 22333670 0 CCTNNNNNNNNNNNNNNNNNTGTNNNNNNNNNNNNNNNNGCTNNNNNNNNNNNNNNNNCNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN <<<#################22@################00<################,#,######################################## HI:i:1 NH:i:1
perseo:data>
See the column 3 in the BAM output that is the chromosome name, the same in the GTF.
TPMCalculator
TPMCalculator -g /gfs/data/genomes/NCBI/IRGSP-1.0/IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf -b subjunc_results.bam -e
Output
_gene.out
perseo:data> head subjunc_results_genes.out
Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM UniqueLegth UniqueReads UniqueTPM UniqueExonLength UniqueExonReads UniqueExonTPM UniqueIntronLength UniqueIntronReads UniqueIntronTPM
Os01g0100100 chr01 2982 10814 7833 115 8.68579 2935 115 14.0137 4898 0 0 7833 115 8.65548 2935 115 15.0465 4898 0 0
Os01g0100200 chr01 11217 12434 1218 2 0.971454 1127 2 0.6347 91 0 0 305 0 0 305 0 0 0 0 0
Os01g0100300 chr01 11371 12283 913 2 1.29598 810 2 0.883095 103 0 0 0 0 0 0 0 0 0 0 0
Os01g0100400 chr01 12720 15684 2965 25 4.98833 2161 25 4.13759 804 0 0 1794 19 6.24385 1082 19 6.7433 712 0 0
Os01g0100466 chr01 12807 13977 1171 6 3.03133 1074 6 1.99806 97 2 123.241 0 0 0 0 0 0 0 0 0
Os01g0100500 chr01 16398 20143 3746 136 21.4788 2222 129 20.7639 1524 12 47.0646 3746 136 21.4039 2222 129 22.2942 1524 12 98.2904
Os01g0100600 chr01 22840 26891 4052 48 7.00828 1960 46 8.39391 2092 3 8.57151 3488 33 5.57775 1418 31 8.39521 2070 3 18.0911
Os01g0100650 chr01 25860 26423 564 18 18.8813 564 18 11.4145 0 0 0 0 0 0 0 0 0 0 0 0
Os01g0100700 chr01 27142 28643 1502 420 165.432 906 420 165.8 596 2 20.0577 1502 420 164.854 906 420 178.019 596 2 41.8889
perseo:data>
_transcripts.out
perseo:data> head subjunc_results_transcripts.out
Gene_Id Transcript_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength IntronReads IntronTPM
Os01g0100100 Os01t0100100-01 chr01 2982 10814 7833 115 5.8889 2935 115 9.67839 4898 0 0
Os01g0100200 Os01t0100200-01 chr01 11217 12434 1218 2 0.658639 1127 2 0.438348 91 0 0
Os01g0100300 Os01t0100300-00 chr01 11371 12283 913 2 0.878666 810 2 0.609899 103 0 0
Os01g0100400 Os01t0100400-01 chr01 12720 15684 2965 25 3.38205 2161 25 2.85758 804 0 0
Os01g0100466 Os01t0100466-00 chr01 12807 13977 1171 6 2.05522 1074 6 1.37994 97 2 71.9577
Os01g0100500 Os01t0100500-01 chr01 16398 20143 3746 136 14.5625 2222 129 14.3403 1524 12 27.4799
Os01g0100600 Os01t0100600-01 chr01 22840 26891 4052 48 4.75156 1960 46 5.79716 2092 3 5.00471
Os01g0100650 Os01t0100650-00 chr01 25860 26423 564 18 12.8014 564 18 7.88327 0 0 0
Os01g0100700 Os01t0100700-01 chr01 27142 28643 1502 420 112.161 906 420 114.508 596 2 11.7112
perseo:data>
from tpmcalculator.
Wow ~
Thanks,
This is my subjunc code :
subread-buildindex -o rice /ho me/zyzhang/RNA-seq/data/Reference/IRGSP-1.0_gene_2018-11-26.fasta subjunc -T 15 -i Reference/index/index_subjunc/rice -r 6914_1.fastq.gz -R 6914_2.fastq.gz -o result/aligned_subjunc/6914.bam featureCounts -p -T 64 -a /home/zyzhang/RNA-seq/data/Reference/IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf -o 6914_featureCounts.out 6914.bam TPMCalculator -g /home/zyzhang/RNA-seq/data/Reference/IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf -b 6914.bam
I will try with your procdure.Thank you for helping me patiently.
Best,
zyzhang
from tpmcalculator.
Hi,
It looks good to me. I don't see anything weird here.
You should use the GTF with subjunc adding these options:
-a IRGSP-1.0_representative_transcript_exon_2018-11-26.gtf -F GTF
Can you check your fasta IRGSP-1.0_gene_2018-11-26.fasta
header?
It should have all chromosomes independently:
>chr01
>chr02
...
Best,
Roberto
from tpmcalculator.
Hi ,
You are right .
The IRGSP-1.0_gene_2018-11-26.fasta
header is
It doesn't match my GTF .I will try with your Genome .
Thank you for your patience.
Best,
zyzhang
from tpmcalculator.
Hi,
Your fasta doesn't seem to be a Genome, it looks to me as the genetic code of proteins. You should use the genome for the alignment and the GTF for the annotation.
Please, remember to change the header of the chromosomes fasta files before creating the indexes.
Best,
Roberto
from tpmcalculator.
Hi,
I got it . I have got good output files by TPMCalculaor.
-rw-rw-r-- 1 zyzhang zyzhang 9423565861 3月 13 22:55 subjunc_2_results.bam
-rw-rw-r-- 1 zyzhang zyzhang 7500797 3月 13 22:55 subjunc_2_results.bam.indel.vcf
-rw-rw-r-- 1 zyzhang zyzhang 10649766 3月 13 22:55 subjunc_2_results.bam.junction.bed
-rw-rw-r-- 1 zyzhang zyzhang 15689097 3月 13 23:30 subjunc_2_results_genes.ent
-rw-rw-r-- 1 zyzhang zyzhang 4608259 3月 13 23:30 subjunc_2_results_genes.out
-rw-rw-r-- 1 zyzhang zyzhang 13312048 3月 13 23:30 subjunc_2_results_genes.uni
-rw-rw-r-- 1 zyzhang zyzhang 443 3月 14 09:55 subjunc_2_results.stat
-rw-rw-r-- 1 zyzhang zyzhang 28265992 3月 13 23:30 subjunc_2_results_transcripts.ent
-rw-rw-r-- 1 zyzhang zyzhang 4941586 3月 13 23:30 subjunc_2_results_transcripts.out
subjunc_2_results_genes.out
Gene_Id Chr Start End Length Reads TPM ExonLength ExonReads ExonTPM IntronLength
gene:Os01g0100100 1 2982 10814 7833 2421 6.64109 2935 2391 12.1541 4898 72
gene:Os01g0100200 1 11217 12434 1218 18 0.31754 1127 18 0.238286 91
gene:Os01g0100300 1 11371 12283 913 18 0.423618 810 18 0.331542
gene:Os01g0100400 1 12720 15684 2965 285 2.06535 2161 285 1.96762 804 0
gene:Os01g0100466 1 12807 13977 1171 135 2.47713 1074 135 1.87534 97 4
gene:EPlOSAG00000007463 1 16182 16269 88 22 5.37171 88 22 3.72984 0 0
gene:Os01g0100500 1 16398 20143 3746 3300 18.9286 2222 3270 21.956 1524 158
gene:Os01g0100600 1 22840 26891 4052 3206 17.0007 1960 3127 23.8025 2092 206
gene:Os01g0100650 1 25860 26423 564 1080 41.145 564 1080 28.569 0 0
gene:Os01g0100700 1 27142 28643 1502 1558 22.288 906 1527 25.1456 596 51
So the problem is that the Genome is wrong . And have to usu the head of the Genome is the same as GTF.
Thank you for your patience these days.
Wish you send more paper.
Best,
zyzhang
from tpmcalculator.
Hi Zyzhang,
Thank you so much for letting us now.
Please, remember to cite our TPMCalculator paper once you publish your research.
Best,
Roberto
from tpmcalculator.
Hi Roberto,
If I publish a paper, I'll quote it affirmatively.
Best,
zyzhang
from tpmcalculator.
Dear @r78v10a07,
I have the same issue with TMPcalculator
- all output files are empty.
I looked into my GTF and BAM files and cannot find anything suspicious.
My set-up
GTF: The original file is from NCBI: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/325/GCF_000007325.1_ASM732v1. I decompressed the file and removed features without a gene ID using grep -v 'gene_id ""'
(these were exon features of tRNAs and rRNAs).
BAM: The BAM file was created using STAR
(version 2.7.3a
). However, I used the original GTF file here, i.e. the features w/o a gene ID were not removed because this has not been an issue until now. The BAM file was processed afterwards using samtools
(version 1.10
):
samtools sort -@ {threads} -n -o {output.sorted} {input} &&
samtools fixmate -@ {threads} -m {output.sorted} {output.fm} &&
samtools sort -@ {threads} -o {output.sc} {output.fm} &&
samtools markdup -@ {threads} -r {output.sc} {output.rm} &&
samtools sort -@ {threads} -n -o {output.srmdup} {output.rm}
The file {output.srmdup}
is the one used as input for TMPCalculator
. Here are the first three lines:
NB501252:178:HJCCMBGXB:1:11101:1059:13206 99 NC_003454.1 39988 255 74M = 40035 122 GCTGCTGTTGAATTTCCACTTAAATAGAANTTAAGGGAATCTGGGTTT
TCTACATTTATTTTTGAATTTGACCC AAAAAEEEEEEEEEEEEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE NH:i:1 HI:i:1 AS:i:146 nM:i:0 MQ:i:255 M
C:Z:75M ms:i:2573
NB501252:178:HJCCMBGXB:1:11101:1059:13206 147 NC_003454.1 40035 255 75M = 39988 -122 TTCTACATTTATTTTTGAATTTGACCCTGTTGAATATAAAGGAACTGC
ATCTTTTACAAAATTACTTGTAGCCTT A6EEEEEEEEAEEEEEEEEEEEEEEEEAEEEEEEEEEE/EEEEEEEEEE/EEEAEEEEEEEEEEEAEEEEAAAAA NH:i:1 HI:i:1 AS:i:146 nM:i:0 MQ:i:255 M
C:Z:74M ms:i:2604
NB501252:178:HJCCMBGXB:1:11101:1060:11521 99 NC_003454.1 46237 255 75M = 46698 536 CTCCATAATCAAAGCATTTTTTTGCTGCTTCATCTAAATTTTCTCCAA
TTAAAACTGCTATAACTTTCTCATTAT AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE//AEAEA<AE NH:i:1 HI:i:1 AS:i:146 nM:i:1 MQ:i:255 M
C:Z:75M ms:i:2298
TMPCalculator
I installed the tool via conda
using this YAML:
channels:
- conda-forge
- bioconda
dependencies:
- tpmcalculator==0.0.3
The tool was called as follows:
TPMCalculator -g ATCC_25586_norna.gtf -b Fn2611_in_S24_Aligned.out.sorted.dedup.bam -p 2> err.log
The stderr
output was:
Reading GTF file ...
Done in 0.04 seconds
Parsing sample: Fn2611_in_S24_Aligned.out.sorted.dedup
13188282 reads processed in 118.29 seconds
Printing results
Total time: 118.33 seconds
Thank you in advance!
from tpmcalculator.
Hi @VGalata ,
I've checked your GTF and it does not include the exons. TPMCalculator use the "exon" keyword in the third column to create the gene model.
In your specific case, you could replace the keyword CDS with exon if what would you like to do is quantify the expression in the coding regions. That will make TPMCalculator able to create the gene model and quantify the data.
Let me know if you need anything else.
from tpmcalculator.
Hi @r78v10a07,
Thank you very much for the fast reply and the advice!
Replacing the keyword "CDS" with "exon" worked and the files *_genes.(ent|out|uni)
are not empty anymore.
from tpmcalculator.
Hi @VGalata ,
I'm glad to know that it worked for you.
Please, let me know if you need anything else.
Best,
Roberto
from tpmcalculator.
Related Issues (20)
- Gene number in input GTF differs from the TPMCalculator output HOT 3
- symbol lookup error HOT 9
- Sets the name of the output HOT 2
- Add a new option to use a directory as output destination HOT 1
- For the paired end reads, is it recommended to use option -p or should i go with default without it? HOT 1
- Output file description HOT 2
- /usr/bin/ld: cannot find -lbamtools HOT 2
- *_genes.out duplicate genes HOT 3
- Compilation error: collect2: error: ld returned 1 exit status HOT 2
- Is the read counting strand-specific? HOT 2
- Key ID for gene name was not found on GTF line HOT 3
- No TPM values, no reads processed HOT 9
- Chromosome with name: ENST.... does not exist HOT 10
- Output files desctiption HOT 1
- Installation without docker HOT 1
- Meaning of "UniqueReads" in genes.out file? HOT 2
- Possible to use gff3 in TPMCalculator v 0.4? HOT 1
- Build problems on Ubuntu and MacOSX HOT 1
- Help me please HOT 1
- After installing version 0.0.4, -version still prints 0.0.3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tpmcalculator.