comprna / suppa Goto Github PK

SUPPA: Fast quantification of splicing and differential splicing

License: MIT License

Python 93.53% R 3.21% Perl 3.26%

psi-calculation splicing-quantification differential-splicing-analysis transcript-isoform psi-values clustered-events inclusion-level

suppa's People

Contributors

Stargazers

Watchers

Forkers

ddpinto al3n70rn mgandal haoliangxue bioxiao haroon123 6guojun ttchuanbao zoulonghai altingia noahpieta cwt1 lucacozzuto biocodings peiwenliu18 vkedlian antpiron bioinformaticsstu cdzbiostu renzhonglu leequn maylamolinari flamehuang sneha196 wenmm atps hmyh1202 nbahti chlorproaine leipzig vallurumk senaj limeng12 leosfan amchalkie jaganmskcc olympus-terminal standardgalactic jianguozhou3 jisnoalia buzhizhang121 mariasr3 siyangming jing-xinxing mparker2 genomicsnx anazajec l1angyan yzjiang9 suqixuan siyer-23 genieus ruby-luo-0309 dydguang akgolebiewski wook2014 arnavbharti sciencecomputing qinqian

suppa's Issues

multipleFieldSelection.py bugs

Python 3.5.1
line 111 should change to for key, value in [(x,y) for x, y in dictionary.items() if x != "header"]:

Error in reading the tpm file

Hello,

I am trying to estimate PSI per local events and I get the following error:

ERROR:lib.tools:180733, in line 180734. Skipping line... ERROR:lib.tools:180734, in line 180735. Skipping line... ERROR:lib.tools:180735, in line 180736. Skipping line... INFO:lib.tools:File /transcript_abundances_per_sample_tpm_normalised.txt closed. ERROR:psiCalculator:No expression values have been buffered. ERROR:psiCalculator:Unknown error: 1

The exact same expression file works perfectly when I use it for PSI per transcript Isoform estimation. This is the same issue as #14 from my understanding and apologies if I am repeating this but it has been closed without any solution made publicly available.

Thank you in advance!

Adding an option (-v) for getting the tool version

Dear developers,
I would like to have a parameter for getting the version of the tool.

Many thanks!

Anaconda package with SUPPA?

Dear Authors,
I would like to use SUPPA2 in a bigger pipeline I build. Every step is in executed in a conda environment with specified libs and packages to ease re-runing the tool and avoid installation/dependency problems. Could you please consider building a SUPPA2 conda package and uploading it to anaconda cloud?
described here

This for sure would help not only me but also anyone that would like to utilize your/mine/anyone's tool that also runs SUPPA

Can't compute PSI whereas MAJIQ can do

--Hi,

the main problem in SUPPA is if one or more transcripts of the event do not appear in the expression file
we can't compute PSI, whereas in MAJIQ we do.
For example, for my gene of interest i have 4 transcripts, Salmon quantify only 3 of them, then SUPPA can't compute PSI on my gene!
Is there any option to permit the computation even if one transcript isn't quantified ?

Thank you --

transcript not found

Hello SUPPA crew!

I've been having some issues with getting the IDs to match up between the abundance file and my ioe file. Here's what I'm working with...

GCF_000337935.1_Cliv_1.0_genomic.gtf

NW_004973172.1  Gnomon  exon    1051    1236    .       +       .       gene_id "102083463"; transcript_id "XM_005497964.1";
NW_004973172.1  Gnomon  exon    2168    2472    .       +       .       gene_id "102083463"; transcript_id "XM_005497964.1";
NW_004973172.1  Gnomon  exon    8449    8571    .       +       .       gene_id "102083654"; transcript_id "XM_005497965.2";
NW_004973172.1  Gnomon  exon    23499   23757   .       +       .       gene_id "102083654"; transcript_id "XM_005497965.2";
NW_004973172.1  Gnomon  exon    51968   52206   .       +       .       gene_id "102083654"; transcript_id "XM_005497965.2";
NW_004973173.1  Gnomon  exon    4340    4527    .       +       .       gene_id "102084630"; transcript_id "XM_013372080.1";
NW_004973173.1  Gnomon  exon    11701   11835   .       +       .       gene_id "102084630"; transcript_id "XM_013372080.1";
NW_004973173.1  Gnomon  exon    12814   12938   .       +       .       gene_id "102084630"; transcript_id "XM_013372080.1";
NW_004973173.1  Gnomon  exon    13939   13952   .       +       .       gene_id "102084630"; transcript_id "XM_013372080.1";
NW_004973173.1  Gnomon  exon    14071   14155   .       +       .       gene_id "102084630"; transcript_id "XM_013372080.1";

ColLiv.allevents.ioe

NW_004973175.1  102087983       102087983;A3:NW_004973175.1:737370-738407:737370-738412:+       XM_005497995.2  XM_005497995.2,XM_005497992.2,XM_005497991.2
NW_004973178.1  102096347       102096347;A3:NW_004973178.1:37267-37489:37262-37489:-   XM_005498042.2  XM_005498042.2,XM_005498043.1
NW_004973178.1  102085005       102085005;A3:NW_004973178.1:52168-52854:52162-52854:-   XM_013368009.1  XM_013368009.1,XM_013368029.1
NW_004973179.1  102089602       102089602;A3:NW_004973179.1:1338172-1340967:1337719-1340967:-   XM_005498092.2  XM_005498092.2,XM_005498095.2,XM_005498097.2,XM_005498094.2,XM_005498096.2,XM_005498093.2,XM_005498091.2
NW_004973180.1  102097913       102097913;A3:NW_004973180.1:168779-169727:168776-169727:-       XM_005498138.1  XM_005498138.1,XM_005498139.1
NW_004973180.1  102083465       102083465;A3:NW_004973180.1:778255-780588:777670-780588:-       XM_013371569.1,XM_013371543.1,XM_013371398.1,XM_013371513.1,XM_013371496.1,XM_013371472.1,XM_013371443.1,XM_013371419.1,XM_013371368.1  XM_013371569.1,
NW_004973180.1  102083465       102083465;A3:NW_004973180.1:762515-775475:762512-775475:-       XM_013371496.1,XM_005498145.1   XM_013371496.1,XM_005498145.1,XM_013371513.1
NW_004973182.1  102084883       102084883;A3:NW_004973182.1:3522841-3523368:3522841-3523377:+   XM_005498239.2,XM_005498238.2,XM_013366718.1    XM_005498239.2,XM_005498238.2,XM_013366718.1,XM_005498237.2,XM_005498240.2
NW_004973182.1  102092693       102092693;A3:NW_004973182.1:6460037-6462036:6460037-6462039:+   XM_005498283.1,XM_005498284.1   XM_005498283.1,XM_005498284.1,XM_013366681.1
NW_004973182.1  102089790       102089790;A3:NW_004973182.1:643756-643982:643698-643982:-       XM_013366666.1,XM_005498179.1   XM_013366666.1,XM_005498179.1,XM_013366668.1
NW_004973182.1  102098451       102098451;A3:NW_004973182.1:2226212-2226688:2226209-2226688:-   XM_013366754.1  XM_013366754.1,XM_013366751.1

v1_iso_TPM.txt

blu7-x_female_gonad_stress_v1   blu7-x_female_hypothalamus_stress_v1    blu7-x_female_pituitary_stress_v1       blu-o-x-ATLAS_female_gonad_control_v1   blu-o-x-ATLAS_female_hypothalamus_control_v1    blu-o-x-ATLAS_female_pituitary_control_v1       g105-x_male_gonad_stress_v1     g105-x_male_hypothalamus_stress_v1      g105-x_male_pituitary_stress_v1 L-Blu13_male_gonad_control_v1   L-Blu13_male_hypothalamus_control_v1    L-Blu13_male_pituitary_control_v1       L-G107_male_gonad_control_v1    L-G107_male_hypothalamus_control_v1     L-G107_male_pituitary_control_v1        L-O116_male_gonad_stress_v1     L-O116_male_hypothalamus_stress_v1      L-O116_male_pituitary_stress_v1 L-R2_male_gonad_stress_v1       L-R2_male_hypothalamus_stress_v1        L-R2_male_pituitary_stress_v1   L-R3_male_gonad_control_v1      L-R3_male_hypothalamus_control_v1       L-R3_male_pituitary_control_v1  r6-x_female_gonad_control_v1    r6-x_female_hypothalamus_control_v1     r6-x_female_pituitary_control_v1        R-Blu12_female_gonad_stress_v1  R-Blu12_female_hypothalamus_stress_v1   R-Blu12_female_pituitary_stress_v1      R-W44_female_gonad_control_v1   R-W44_female_hypothalamus_control_v1    R-W44_female_pituitary_control_v1       R-W7_female_gonad_stress_v1     R-W7_female_hypothalamus_stress_v1      R-W7_female_pituitary_stress_v1
NM_001282808.1  0       1.57162 0       0.66475 0       0.220831        0       0.907605        0.906405        0       2.25501 0       0.415875        0       0       0       1.19995 0.990853        0.524462        0.535496        0       1.87365 0       0       2.10042 0       0       0.259625        1.2192  0       0.204883        0.381553        0       1.1309  0       0
NM_001282809.1  1.29073 1.2379  50.0972 0.268697        0       35.302  0.331408        0.926947        83.7728 0.735261        0       10.1895 0.663845        1.67197 24.4242 0       0.237119        20.4936 1.1076  0.162229        17.0305 0.993135        0.315786        55.6423 0       0       44.9443 0.388061        0.369715        29.4024 0.611911        0.327666        15.2097 0.107806        0       17.1811
NM_001282810.1  722.347 972.971 935.241 477.179 1102.49 762.617 107.541 774.884 1348.99 227.704 644.564 977.259 152.474 716.733 749.67  119.783 1572.66 947.73  162.487 750.441 933.566 85.087  684.716 957.027 1297.46 1000.88 1102.45 838.527 902.41  1045.09 1263.36 920.583 1164.87 1300.6  862.808 863.928
NM_001282811.1  33.2807 1328.19 3.86107 19.8139 223.623 0.508051        48.2987 1200.57 0       33.2902 660.107 0       27.1279 482.677 4.00297 0       1949.45 2.13059 27.23   547.382 6.60312 15.0452 1412.1  7.20179 14.7763 973.031 6.07347 26.9111 1047.95 8.98875 50.1585 1045.98 3.11808 5.75796 814.954 0
NM_001282812.1  7.96707 25.3233 5.50359 7.48515 5.63888 2.94712 0.201178        9.43736 4.35983 0.252396        24.0326 2.63242 0       10.897  2.74107 0       0.104697        10.692  0.242864        15.461  2.14724 0       6.72224 3.53525 12.185  16.1811 18.9998 4.93922 27.7648 6.04042 2.90592 3.12687 6.31379 1.9938  0.810314        6.18845
NM_001282813.1  0.442975        0.690355        13.2517 1.24213 8.18473 5.1655  9.87322 0.810658        11.7691 11.3146 0       19.5109 4.20998 0       5.36099 6.19408 0.370103        9.448   2.29165 0.81493 6.26402 11.0944 0       19.2161 0       0       31.2819 0.251072        0.897543        6.12253 0.375611        0.179238        14.0786 0.546195        0       5.30271
NM_001282814.1  4.00726 36.2563 2.89471 8.17849 20.6584 2.45488 17.4784 32.924  5.04972 11.8574 111.021 3.83462 9.76278 67.4373 3.44322 25.2416 4.01029 6.59979 13.9706 33.8739 4.68959 22.5148 37.5436 3.45249 3.30783 55.7678 6.49054 4.11995 40.5615 5.45875 4.35934 43.936  4.20431 3.48934 28.8335 1.01808
NM_001282815.1  17.325  12.2492 27.7845 12.2727 16.8283 13.7342 4.5869  23.9672 67.7658 4.84483 16.408  64.423  3.13273 19.7156 68.3625 2.56462 1.76188 62.4702 3.7879  29.167  58.7276 3.30422 7.85101 62.3308 6.99702 5.24579 38.0989 12.5421 10.4174 21.7266 10.1487 17.4345 19.2334 10.1115 12.1205 27.7903
NM_001282816.1  68.2551 121.745 105.4   68.7224 129.645 62.9095 92.6033 116.025 119.952 100.927 124.965 105.603 79.0448 133.861 82.9592 75.1333 113.585 110.076 83.6091 105.415 109.217 75.4153 113.578 119.154 45.5431 108.188 79.2438 57.7413 113.294 77.7973 32.4269 119.164 66.4777 69.8054 104.139 83.1719
NM_001282817.1  8.34595 1.65055 1.18683 2.67117 0       1.36009 3.03645 2.83289 1.61971 1.80269 0       1.3186  2.9495  0.807237        2.19773 1.75392 0       2.64997 2.87399 1.53462 0       1.48808 1.30741 1.13709 8.7492  8.86501 2.80919 4.3131  1.79954 2.42465 9.72328 1.2851  0       12.8626 0.553145        0
NM_001282818.1  0       0       0       0       0       0       0       0       0.301729        0       0       0       0.548696        1.98546 0.321877        0.365359        0.158749        0       0       0       2.17415 1.23488 0       0       0       0       0       0       0       0       0       0       0       0       0       0.94762
NM_001282819.1  12.3368 0       0       3.89595 0       0       1.54146 0.409267        0       1.71691 0       0       1.32878 0       0       1.50435 0       0.39388 1.47546 2.00523 0       2.49115 1.55422 0       50.1139 0       0       16.1181 0.679577        0       25.9584 0.242765        0       0       0.804466        0
NM_001282820.1  61.4997 2.77024 2.84495 34.1643 11.9293 0.470096        3.21776 12.8966 0       1.32484 11.4826 4.07571 2.00991 4.50486 2.57927 1.54119 5.85746 3.52275 1.98168 8.55367 1.75213 0       2.5772  6.44071 85.5255 1.36833 4.33828 192.328 4.22576 1.06496 184.302 16.1617 0.830061        67.7664 9.59892 2.28533
NM_001282821.1  0       0.547123        0       0       3.41916 0       15.4926 0.55032 0.63027 13.8689 3.90285 0       17.0333 0.819537        0       22.2942 0       0.268618        25.0199 0.281742        0       17.3363 0       0       0       0       0       0       1.56766 0       0.129165        0       0       0       0       0
NM_001282822.1  0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       3.62285 0       0       0       0       0       0       0       0.263681        0       0       0       0
NM_001282823.1  19.3793 1.87017 0.497018        5.07065 2.11047 1.42105 57.4257 1.40738 0       61.1133 0       1.1345  26.2796 0       3.27942 38.255  0.000425725     0.672164        48.4739 1.12974 1.20478 60.7452 0.608354        2.44581 6.85487 0       1.60292 9.12418 0.637802        0.971116        8.1099  1.53185 1.23369 16.67   0.364868        0.511185
NM_001282824.1  0.74587 0       0.567711        0       0       0       4.90536 0       0       0       0       0       0.695432        0       0       4.39999 0       0       1.39846 0       0       0       0       0       0       0       0       0       0       0       0       0.222912        0       0.363655        0       0
NM_001282825.1  0       0       3.28149 0       0.572221        0       0       0       0       0.591541        0       3.76795 0       0       2.05539 0       0.173325        0       0.241558        0       4.18911 0       0       3.39584 0       0       0       0.232903        0.73725 2.90115 0.457461        0.0897866       1.87047 0.318047        0       4.38529

when I run the psiPerEvent function of SUPPA, I get numerous errors, all indicating a transcript ID was not found in the "expression file".

INFO:psiCalculator:Buffering transcript expression levels.
INFO:lib.tools:File /pylon2/mc3bg6p/al2025/isoform/SUPPA/v1_iso_TPM.txt closed.
INFO:lib.tools:File /pylon2/mc3bg6p/al2025/isoform/SUPPA/ColLiv_V1_events/ColLiv.allevents.ioe opened in reading mode.
INFO:psiCalculator:Calculating PSI from the ioe file.
ERROR:psiCalculator:transcript XM_005497995.2 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event 102087983;A3:NW_004973175.1:737370-738407:737370-738412:+.
ERROR:psiCalculator:transcript XM_005498042.2 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event 102096347;A3:NW_004973178.1:37267-37489:37262-37489:-.
ERROR:psiCalculator:transcript XM_013368009.1 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event 102085005;A3:NW_004973178.1:52168-52854:52162-52854:-.
ERROR:psiCalculator:transcript XM_005498092.2 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event 102089602;A3:NW_004973179.1:1338172-1340967:1337719-1340967:-.
ERROR:psiCalculator:transcript XM_005498138.1 not found in the "expression file".
ERROR:psiCalculator:PSI not calculated for event 102097913;A3:NW_004973180.1:168779-169727:168776-169727:-.
ERROR:psiCalculator:transcript XM_013371569.1 not found in the "expression file".

I did invoke the -p flag when generating events, as these are RefSeq annotations. Is there something funky going on here? Or is it as simple, yet perhaps bizarre, as a subset of transcripts (~3.5k of ~20K) are present in the gtf, yet absent from the transcriptome? Any thoughts?

Adding the shebang #!/usr/bin/env python / R etc

Dear developers,
I noticed that the scripts are often missing the shebang (or have an hardcoded one). What do you think about using #!/usr/bin/env python/R etc. ?

Many thanks!

Luca

diffSplice between paralogs

Hi there,

I've been using SUPPA for awhile now and I recently realized there will problems if I wanted to analyze the difference in AS events between paralogs (namely given how tied the ioe file seems to be its corresponding gene). Do you think there's a workaround for this? I could assume the paralogous genes have the same length and filter the .psi output based on equivalent junction/events but I'm not sure the results would be meaningful.

If you have any input I'd appreciate it!
Thanks!

how to produce ioi

The "generateEvents" seems changed. Following the tutorial, when specify -f ioi the software keeping complaining "suppa.py generateEvents: error: the following arguments are required: -e/--event-type".

Transcript expression file

Dear team

I am analysing AS events in Arabidopsis thaliana using SUPPA. This is the first time I am working with RNA Seq data. I predicted AS_events using generateEvents option. For calculating PSI, it requires Transcript expression file. But I do not know from where I can get the Transcript expression file for my sample? Can anyone help me in this issue? Thank you in advance

can't find generate_boxplot_event.py script

--hi,

in the tutorial you use a script named generate_boxplot_event.py, but it is not available on your github.
Can you tell me if it's available somewhere ?

thank you --

custom gtf file for SUPPA

Hi, I wanted to create a MiniSOX9 truncated version of SOX9 isoform. Therefore, I am trying to create a custom gtf file to generate events using SUPPA. But my isoform "MiniSOX9" is not showing up in the .ioe output file. Initially, for this analysis I didn't find SOX9 after doing SUPPA. I would appreciate if you could let me know what I am doing wrong or why it is not generating this specific isoform. I am using the hg19 gtf file. Below is the gtf file.

chr17 unknown exon 70117161 70117963 . + . gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown CDS 70117533 70117963 . + 0 gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown start_codon 70117533 70117535 . + . gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown CDS 70118860 70119113 . + 1 gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown exon 70118860 70119113 . + . gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown CDS 70119684 70120525 . + 2 gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown exon 70119684 70122560 . + . gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824"; chr17 unknown stop_codon 70120526 70120528 . + . gene_id "SOX9"; gene_name "SOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";

`chr17	unknown	exon	70117161	70117963	.	+	.	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	CDS	70117533	70117963	.	+	0	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	start_codon	70117533	70117535	.	+	.	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	CDS	70118860	70119683	.	+	1	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	exon	70118860	70119113	.	+	.	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	CDS	70119684	70120525	.	+	2	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	exon	70119684	70122560	.	+	.	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";
chr17	unknown	stop_codon	70120526	70120528	.	+	.	gene_id "miniSOX9"; gene_name "miniSOX9"; p_id "P11849"; transcript_id "NM_000346"; tss_id "TSS28824";

My code is below:
python3.5 /home/nje17/.local/lib/python3.5/site-packages/SUPPA/suppa.py generateEvents -i /projectsp/splicing_events/hg19.gtf -o /splicing_events/ioe_files -e SE SS MX RI FL

9th column of a GTF version and leading white space handling

Some versions of GTF for example GRCh37 v67 have leading white space in the 9th column values.

gtf_store.py line 201 doesn't handle this issue. I made a quick fix by popping the first element of the first list of attribute list and replacing its remaining string 'gene_id "ENST..."' with 'gene_id', '"ENST..."'

DTU

While trying to run the psiPerIsoform, I found that I was receiving a file where all isoform events are either marked as 1.0 or NaN. I am using a Refseq annotation file for this command, and was wondering if this had to do with the Refseq annotation?

generateEvents error

This issue has been resolved by reinstalling them

-v b and s ,for iso-seq which is best ?

Images for README

pip install SUPPA==2.2.1

The following command can not be used for installing the newest version.
pip install SUPPA==2.2.1

what's the mean of PSI event value 0 and 1?

Extract TPM from Salmon dictionary.iteritems()

Hello,
I'm trying to follow the tutorial but I found a problem with your script to extract TPM from Salmon output.
It seems to call dictionary.iteritems(), which fails.
In fact the function iteritems() is renamed items() in Python3

Warning about imp module

Hi,
when I typed
python3 /home/zhechen/.local/lib/python3.6/site-packages/SUPPA/suppa.py generateEvents;
it said，
~/.local/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
I didn't know if it would affect running results?

Calculating intron retention

Hi all,

sorry if this is not the correct channel to ask this but I have a question regarding how the PSI is calculated for the cases of intron retention, specifically which isoforms are selected to be part of the numerator or the denominator in the psiPerEvent formula for this case?? could you give an example as in the case of exon inclusion?
My doubt is mainly because in the "normal" transcriptome I'm using for quantification the intron sequence is not included in the isoforms sequences, so the reads that would be assigned there are not counted.
I'm sure I'm confusing something here because otherwise it makes no sense to try to calculate an intron retention event if the intron sequence is not part of the transcriptome I'm quantifying against or, in the case of genome mapping, I won't be counting the reads mapped to these intron sections.

Thanks in advance.
Ignacio

dpsi file for Cluster analysis

Here is the example of dpsi file for clustser analysis:
Cond1_Cond2_dPSI Cond1_Cond2_p-val Cond2_Cond3_dPSI Cond2_Cond3_p-val
event1 <dpsi_value> <dpsi_value> <dpsi_value> <dpsi_value>
event2 <dpsi_value> <dpsi_value> <dpsi_value> <dpsi_value>
event3 <dpsi_value> <dpsi_value> <dpsi_value> <dpsi_value>
The header is dPSI and p-value, but the matrix only includes <dpsi_value>, could you tell me whether p-value is needed? And why Cond1_Cond3_dPSI is not included?

I have 4 conditions, I also want to known how to order the columns of dpsi file for cluster analysis.

split_file.R vs psiPerEvent number of fields expectation

Dear Devs,

split_file.R expect a file formatted as (Note the column Name, this is generated with salmon quantmerge):

Name    hum54-1 hum54-2 hum54-3 hum54-4 hum56-2 hum56-3 hum56-4 hum57-1 hum57-2 hum57-3 hum57-4 mbe4-2  mbe4-3  mbe4-4
ENST00000640864       0       0       0       0       0       0       0       0       0       0       0       0       0       0
ENST00000638534       0       0       0       0       0       0       0       0       0       0       0       0       0       0

and psiPerEvent expects without the Name column name:

hum54-1 hum54-2 hum54-3 hum54-4 hum56-2 hum56-3 hum56-4 hum57-1 hum57-2 hum57-3 hum57-4 mbe4-2  mbe4-3  mbe4-4
ENST00000640864 0       0       0       0       0       0       0       0       0       0       0       0       0       0
ENST00000638534 0       0       0       0       0       0       0       0       0       0       0       0       0       0

With Name, I get:

ERROR:suppa.tools:35426, in line 35427. Unexpeced number of fields. 16 expected, 15 given. Skipping line...

Thanks,
Anthony.

-f ioe

I found that SUPPA==2.2.1 didn't recognize "-f ioe" when generateEvents.

Formatting salmon output to match GTF

In the tutorial there is this part:

For the calculation of the events, SUPPA2 requires a GTF file. Here we provide one from Ensembl. The transcripts ids in our iso_tpm file should be equal as the ids in the GTF. We have a small script in R for formatting this ids

~/Rscript ~/scripts/format_Ensembl_ids.R iso_tpm.txt

Here is the file already formatted with the TPM of the isoforms.

The R script would then change the the transcript ids to match the GTF file

so
I tried it on my own salmon output but it gives me an error saying this:

Rscript ensembl_ids.R ~/data/suppa_output/cemwt_all_tpm.txt
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
  duplicate 'row.names' are not allowed
Calls: rownames<- -> row.names<- -> row.names<-.data.frame
In addition: Warning message:
non-unique values when setting 'row.names':
Execution halted

Is there another way to change the file to match the GTF? im not even sure what the R script does exactly.

regards,

I. Muller

No reading tpm file

Hey,

After following the steps in the manual I seem to have a reading error of my TPM file, though its in the correct format.
Command is: python3 /usr/local/lib/python3.5/dist-packages/SUPPA/suppa.py psiPerEvent -e tpm/WT-T.txt -i gtf/merged_RI_strict.ioe -o psi/WT-T_RI

ERROR:lib.tools:87303, in line 87304. Skipping line...
ERROR:lib.tools:87304, in line 87305. Skipping line...
ERROR:psiCalculator:No expression values have been buffered.
ERROR:psiCalculator:Unknown error: 1

Would this have to do with an incorrect install of a certain package or?

Thanks in advance

psiPerEvent error

#transcript expression file: isoforms.FPKM.txt（TAB-delimiter）
cat isoforms.FPKM.txt
condition1 condition2 condition3 condition4
TU265 5.76298 5.20137 4.39812 3.88459
TU267 1.82729 2.33866 2.36421 2.35131
TU268 3.03788 1.32347 1.13227 0.801828
TU269 3.06905 1.82759 1.03018 1.15169
TU275 0.681639 0.533497 1.31494 0.850208
TU277 0.000166195 0.000167452 0.0193253 0.0075806
TU278 6.53042 6.89793 6.2919 4.32661
TU279 0 0 1.90249 41.9227

#ioe-file
head -3 pe.events.ioe
seqname gene_id event_id alternative_transcripts total_transcripts
ch02 G183 G183;A3:ch02:4299306-4299385:4299306-4299443:+ TU269,TU267 TU265,TU269,TU267
ch02 G183 G183;A3:ch02:4314937-4315322:4314937-4315325:+ TU267,TU275,TU268 TU268,TU275,TU265,TU274,TU273,TU267

conda list |grep suppa
suppa                     2.3                      py36_0    bioconda

#run the program is:
suppa.py psiPerEvent -i pe.events.ioe -e isoforms.FPKM.txt -o test_events

but the result is

INFO:lib.tools:File isoforms.FPKM.txt opened in reading mode.
INFO:psiCalculator:Buffering transcript expression levels.
ERROR:lib.tools:1, in line 2. Skipping line...
ERROR:lib.tools:2, in line 3. Skipping line...
ERROR:lib.tools:3, in line 4. Skipping line...
.....................
ERROR:lib.tools:56490, in line 56491. Skipping line...
ERROR:lib.tools:56491, in line 56492. Skipping line...
ERROR:lib.tools:56492, in line 56493. Skipping line...
ERROR:lib.tools:56493, in line 56494. Skipping line...
ERROR:lib.tools:56494, in line 56495. Skipping line...
ERROR:lib.tools:56495, in line 56496. Skipping line...
ERROR:lib.tools:56496, in line 56497. Skipping line...
INFO:lib.tools:File isoforms.FPKM.txt closed.
ERROR:psiCalculator:No expression values have been buffered.
ERROR:psiCalculator:Unknown error: 1

AND I check the transcriptsID of ioe file are in isoforms.FPKM.txt. Although previous similar issuses was raised, it does not resolve my problem.

Is there another way to solve this problem ?

Thank you in advance!

diffSplice returns an error for a parameter that is already set

I got following error message when I tried to use diffSplice function for isoforms. I tried the function with -nan , --nan-threshold and without both but still getting the same error message

python3 ../SUPPA-2.3/suppa.py diffSplice -m classical -p mmIsoform_isoform.psi normalIsoform_isoform.psi -e mmTPM.txt normalTPM.txt -i ../GencodeV24/gencode.v24.ioi -o IsoformDiffSplice -nan 0.01 -th 0.5 -mo DEBUG
Calculating differential analysis between conditions: mmIsoform_isoform and normalIsoform_isoform
/usr/local/lib/python3.6/site-packages/scipy/stats/stats.py:4911: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
ERROR:main:Unknown error: (<class 'TypeError'>, TypeError("calculate_delta_psi() missing 1 required positional argument: 'nan_th'",), <traceback object at 0x16ff15888>)

AS locus info from miscellaneous AS types (AF and AL) present simultaneously in one specific ioe file

Dear developers,

I found this when I was trying to extract the locus information from the ioe files generated for different event types. Just like the figure shows, in the ioe file specific for the "AF" type (red and blue rectangles), there is information that does not belong to this type (green rectangle). It seems like this info belongs to the "AL" type, rather than the "AF" type. Likewise, info from the "AF" type was found to present in the ioe file specifically for the "AL" type.

I want to know whether this was set deliberately or not. Thanks!

Sincerely,

Hao Zhang

Generate the events from the GTF file argument -f

The command line to generate local AS events will be of the form:

python3.4 suppa.py generateEvents -i <input-file.gtf> -o -f ioe -e
The command to generate the transcript "events" would be of the form:

python3.4 suppa.py generateEvents -i <input-file.gtf> -o -f ioi

It seems the -f argument is no longer supported or needed in the current release.
Running the local AS events command failed with msg:
suppa.py: error: unrecognized arguments: -f ioe
And in the tutorial this argument is not used.

delta PSI p-value distribution

I'm using SUPPA's Differential splicing analysis on one of the datasets I'm working on and the final BH corrected delta PSI p-value distribution turns out to be roughly between 0.2 - 0.6.
I've tried using a smaller set with more reads on different parameters, but the p-value distribution remains the same. This may be due to the transcript quantification step, but wanted to touch base and see what your opinion was and if this distribution/behavior could be expected.

Difference between .psi file and .psivec file?

Hi,

I am reading through README.md and I am not sure what the difference between the content in .psi file and .psivec file is. It seems to me that both of them are showing PSI for each event in each sample. I also don't understand why there are two sample 1 and sample 2 columns in the example .psivec file.

Thank you in advance!

The use of ioi file in PSI calculation for transcripts

Hi! In the beginning of your explanations for the PSI calculation, you say:
"An ioi/ioe file and a "transcript expression file" are required as input."

But then in the example command, you give:
python3.4 suppa.py psiPerIsoform -g -e -o

Where is ioi file used in PSI calculations? I tried giving it instead of a GTF and it doesn't work. Also, the flag -i is not recognized for additionally giving an ioi file.

That being said, reading GTF takes a few minutes for me at the beginning of exectuion, can this be somehow reduced (by creating said ioi file, for example)?

Thanks!

generateEvents misses splicing events

Dear SUPPA developers,

We want to analyse data from Arabidopsis with your nice tool.
The gene 'AT1G45474 have splicing events but they are not detected by SUPPA.

Do you have any idea why ?

Thanks a lot for your help.

Our call:
python3 /bin/suppa/suppa.py generateEvents -i SUPPA.gtf -o generateEvents.gtf --event-type SE SS MX RI FL --exon-length 5 --boundary V --threshold 1

SUPPA.gtf ( only the rows for AT1G45474)

1	TAIR10	exon	17179302	17179537	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17179610	17179742	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17179828	17179940	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17180021	17180217	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17180297	17180439	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17180701	17180806	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.2"
1	TAIR10	exon	17179302	17179537	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.1"
1	TAIR10	exon	17179610	17179742	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.1"
1	TAIR10	exon	17179828	17179940	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.1"
1	TAIR10	exon	17180021	17180217	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.1"
1	TAIR10	exon	17180297	17180530	0	+	.	gene_id "AT1G45474"; transcript_id "AT1G45474.1"

suppa joinFiles lacks compatibility

Dear Authors,

I work with suppa 2.3 pulled from anaconda repository:
https://anaconda.org/bioconda/suppa
I just noticed that 3 days ago there was an update on anaconda. I did not run my pipeline with SUPPA since 2 weeks but when I tried today it raised an error - presumably suppa.py joinFiles works differently now and does not produce headers for the output merged file, which causes all the later file processing to crash.
Could you please add the headers again? Also, I would suggest to mark all the changes with new repository versions, like 2.3.1? That way we will all avoid compatibility problems.

Kind regards,
Maciek

psi-values

Dear Eduardo:
Thank you so much for providing this alternative spliced analysis tools with us.
While I used it to analyze my data some questions confused me. I hope you can give me some advice.

The first is that why the psi value for several events equal to 1.0000000000000002
such as:
MSTRG.10015;A5:chr4:21724525-21724660:21724522-21724660:+ 1.0000000000000002
For each local events , I think the PSI values are normalized between 0 and 1.
The second question is that "in your SUPPA manual ,an example of an psivec file is the following one:"
sample1 sample2 sample1 sample2
ENSG00000000003;A5:chrX:99890743-99891188:99890743-99891605:- 0.14343855447180462 0.02929736320730957 0.12495621749266525 -1.0
ENSG00000000003;A5:chrX:99890743-99891605:99890743-99891790:- 1.0 1.0 1.0 -1.0
Why there are the psi values equal to -1.

Thank you so much in advance for your help!
Respectfully,
YanXiaomin.

Benchmarking SUPPA & rMATS

Dear Eduardo,
Thank you so much for this extremely useful program, SUPPA2.

I have been running your program on my RNA-seq dataset (2 conditions in triplicates) for the analysis of alternative splicing changes. By curiosity I wanted to compare the outputs from SUPPA2 and rMATS "at least for (significant) skipped-exons events that are identically defined by SUPPA & rMATS". To do so I think I need to do some per-processing of the output from each algorithm to get a common "data structure", because as you know the outputs are not similar.

I was planning to make like a coverage plot of exonic regions that are spliced to see whether they are consistent between the two algorithms or not!
Is there an easy and more efficient way to do this?

Thank you so much in advance for your help!
Respectfully,
Jamal.

problem with diffSplice

Dear developers,
I'm having some trouble running suppa.py diffSplice.

I got the following error:

ERROR:main:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment",), <traceback object at 0x2ae6f3c69d48>)

Can you give me some hints for solving it, please?

Luca

Differential isoform usage

Dear SUPPA developers,

first of all: thanks a lot for providing your nice tool set!

I have one feature suggestion though:
Since you have now included a script for the differential analysis of splicing events, and since SUPPA also reports PSIs per isoform, it would be nice to provide an analogous tool for differential transcript/isoform usage analysis, i.e. a tool that identifies robust isoform switches by making use of replicates and expression levels/abundances similar to diffSplice.

Any chance of getting that anytime soon? :-)

Thanks and best,
Alex

psiPerEvent error

Hi,
I am trying to calculate psiPerEvent using Suppa. My transcript per million file looks as follows:

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample16 sample17 sample18 sample19 sample20 sample21 sample22 sample23 sample24 sample25 sample26 sample27 sample28 sample29 sample30 sample31 sample32 sample33 sample34 sample35 sample36 sample37 sample38 sample39 sample40 sample41 sample42 sample43 sample44 sample45 sample46 sample47 sample48 sample49 sample50 sample51 sample52 sample53 sample54 sample55 sample56 sample57 sample58 sample59 sample60 sample61 sample62 sample63 sample64 sample65 sample66 sample67 sample68 sample69 sample70 sample71 sample72 sample73 sample74 sample75 sample76 sample77 sample78 sample79 sample80
NR_046018 0.146696232 0 0.02246929 0 0.043484372 0 0.064345526 0 0 0.641447528 0.070457715 0 0 0 0.288446756 0 0.018291742 0.098839876 0 0 0 0.060880505 0.125044033 0 0 0.162726933 0.056250345 0 0.04533607 0.111419096 0.25858868 0 0.10616014 0.081480545 0 0.320725657 0.24226526 0.091504489 0.528826249 0.194352761 0.046725318 0.151982368 0 0 0.226697819 0.084641889 0 0.566318677 0.1484086 0 0.268734574 0.220563037 0 0.109770725 0.05704479 0 NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NR_024540 19.19284277 53.37251412 32.7337806 74.98841145 11.89824843 33.80215601 10.03499221 45.43118253 83.04926565 22.62463001 47.85138289 15.54743843 3.433399326 39.67863222 46.35756323 27.33616404 16.91112386 54.79703587 31.89242283 21.97120496 0 21.57606474 13.35599624 0 4.456391694 28.82763272 66.00394482 32.31343683 11.47348555 52.37874052 57.88761215 11.796335 31.4022174 20.15156518 23.12214838 31.36330696 67.39185448 32.57448068 45.41363924 23.18642724 41.86773395 131.0017562 64.54765994 83.05975972 40.50607761 25.76826385 7.357259576 54.52854758 35.24881828 12.15573145 38.36113881 36.43888614 36.29433057 84.0586856 29.09977577 106.9131704 NA NA 78.02340573 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

I have attached my complete file.
samples_file.txt

And my ioe file as follows:
seqname gene_id event_id alternative_transcripts total_transcripts
chr17 MIR22HG MIR22HG;A3:chr17:1616189-1616997:1615693-1616997:- NR_028503 NR_028503,NR_028504,NR_028505
chr17 FAM104A FAM104A;A3:chr17:71223403-71228225:71223321-71228225:- NM_032837,NM_001098832 NM_032837,NM_001098832,NM_001289411

However I am getting an error
INFO:lib.tools:samples_file.txt opened in reading mode.
INFO:psiCalculator:Buffering transcript expression levels.
ERROR:lib.tools:1, in line 2. Skipping line...
ERROR:lib.tools:2, in line 3. Skipping line...
ERROR:lib.tools:3, in line 4. Skipping line...

I tried to look for leading/trailing spaces however I could not find any.
Is there a way by which I can solve this issue.

Thanks in advance!!!

diffSplice finishes with *.dpsi.temp.0

hi,

I'm running the following command

python suppa.py diffSplice -m empirical -gc -i isoforms.ioi \
  -p group1_isoform.psi group2_isoform.psi \
  -e group1.tpm group2.tpm -o filename

which finishes successfully, but the output file keeps the .temp.0 ending. It has the correct number of rows (same as the number of the TPM and psi files).

Python 3.6 and SUPPA 2.3

No shebang lines

Scripts do not run directly from the command-line (as suggested by the README.md) because they lack shebang. It is not as simple, though, because adding a shebang line does not solve the problem. I have to run with:

python3 suppa.py

error with diffSplice

Hi, I am trying to run diffSplice, but getting the error below:

$ python ~/miniconda3/pkgs/suppa-2.3-py36_1/bin/suppa.py \
> diffSplice --method empirical \
> --input ../suppa_transcript_events_mm10.ioi \
> --psi NS_control_transcript_PSI_isoform.psi NS_GMTcDay1_transcript_PSI_isoform.psi \
> --tpm NS_control_merged.tpm NS_GMTcDay1_merged.tpm \
> --area 1000 \
> --lower-bound 0.05 \
> -gc \
> -o diff_transcript
Calculating differential analysis between conditions: NS_control_transcript_PSI_isoform and NS_GMTcDay1_transcript_PSI_isoform
ERROR:__main__:Unknown error: (<class 'UnboundLocalError'>, UnboundLocalError("local variable 'i' referenced before assignment",), <traceback object at 0x1a1f421448>)

PSI local events -1 value

Hello,

Had a quick question regarding PSI values generated via local events. I have PSIs showing -1.0. what does the negative value for PSI (not dPSI) mean?

Thanks

Meaning of the pvalue column after diffSplice

hi,

What is the meaning of the pvalue column after running diffSplice? Is this a BH "adjusted p-value" (similar to a q-value with pi0 = 1)? How is alpha used?

_main__:Unknown error Namespace' object has no attribute 'which'"

Hi,
I have installed SUPPAv2.0 according to the instructions, but error occurred, when I run:

python3 ./suppa.py

It returned:
ERROR:main:Unknown error: (<class 'AttributeError'>, AttributeError("'Namespace' object has no attribute 'which'",), <traceback object at 0x7fc7fb352d48>)

How to fix it?
Thank you very much.

Retrieve SUPPA version

How can I determine the version of SUPPA that I'm running on my machine?

Memory leak in diffSplice / significanceCalculator.py

Hi again,

I'm currently running hundreds of individual diffSplice comparisons and I have realized that for big datasets, i.e. comparisons with many replicates in one or both of the conditions, diffSplice is killed because it runs out of memory.

For most comparisons, I have less than 100 replicates in total, and 3 Gb seem to be sufficient for these cases. But as soon as the total number of replicates is higher than ~100-150, memory requirements start going through the roof. For comparisons involving a condition with 414 replicates (single cell dataset), I needed close to 100Gb of memory (and a looooot of patience for empirical P value calculation ;-).

This was on human data with ~110k transcripts and analyzing ~20k skipped exon events. The total combined size of the input files (tpm, psi & ioe files) is <500 Mb. Not surprisingly, when analyzing less events and/or less transcripts, the problem appears less dramatic, although I haven't really tried to look at it systematically. In any case, it appears to me that the way in which memory requirements scale with replicate number/transcripts/events is exponential, and I cannot really see why that would need to be the case...

If you like, I could provide you with some input files in order to replicate the issue, although I would imagine that you should not have any problem replicating the issue with your own super-sized expression and psi tables (perhaps randomly merged).

Best,
Alex

about psiPerEvent operation

Dear Eduardo,
I have a question about psiPerEvent function. After using generateEvent function from GTF file to obain SE SS MX RI FL, 5 ioe files, should I use these ioe files as input into psiPerEvent function separately, or merging 5 ioe into 1 ioe file before psiPerEvent operation? I am not sure that SUPPA regards different alternative splicing patterns as irrelevant events.
If so, when using diffSplice, it aslo should be calculated p-value and p.adj value in 5 patterns separately. there are different number of significant local events in different patterns , so the multiple test correction p-values should be different. For example, an ES event has significant difference between 2 conditions after multiple test correction in ES pattern, but when merging all events of 5 patterns, this ES event may not be significant or may be with different p.adj values due to changes of number of total events.
I also notice that in your paper "Large-scale analysis of genome and transcriptome alterations in multiple umors unveils novel cancer-relevant splicing networks" Genome Research 2016, BH methods were used to correct for multiple testing. So, should I only use diffSplice to generate original p-value in 5 patterns, and then merge total events of 5 patterns to calculate p-adjust value using BH correction method, myself?

Than you very much!
Kind regards,
Elias

comprna / suppa Goto Github PK

suppa's People

Contributors

Stargazers

Watchers

Forkers

suppa's Issues

but the result is

Recommend Projects

Recommend Topics

Recommend Org