pachterlab / kma Goto Github PK

Keep Me Around: Intron Retention Detection

License: GNU General Public License v2.0

R 37.15% Python 62.85%

kma's Introduction

Keep Me Around (`kma`): Intron Retention Detection

kma is an R package that performs intron retention estimation and detection using biological replicates and resampling. Updated code can always be found at https://github.com/pachterlab/kma

Installation

To install, first ensure you have the required packages:

required_packages <- c("devtools", "data.table", "reshape2", "dplyr")
install.packages(required_packages)

You can then install the package using devtools:

devtools::install_github("pachterlab/kma")

Assuming all goes well, load kma:

library("kma")

Tutorial

After it has been installed, please see the vignette in R:

vignette("kma")

Bugs and feature requests

Please file these on Github.

Future work

Additional exploratory analysis plotting tools
Provide differential intron usage analysis between experimental conditions
- We currently have some ideas on how to do this and will likely be implementing it soon
Provide time series analysis

Authors

Software was developed by Harold Pimentel. Methods were developed with Lior Pachter and John Conboy.

Related open source tools

Below you will find a list of related tools and how they differ from kma.

DEXSeq

DEXSeq is interested in differential usage across genic regions. As a result, it does not determine whether an intron is being "used" (relative to transript expression), simply that it is being "differentially used."

MISO

MISO can calculate the intronic percent spliced in (PSI), though it currently requires a modified annotation from their website. kma can currently work with any annotation, as the annotation will be processed during the pre-processing step. Also, MISO does not currently provide built-in suppport for replicates.

kma's People

Contributors

Stargazers

Watchers

Forkers

al3n70rn tobby2002 skurscheid fciamponi hong16 pragati27 smshuai adamtongji louisgendron26 chizhou-siti ieguinoa bgiheng myselvan mdshw5 flywind2 metamaden davymoon wangjian99

kma's Issues

Error in "newIntronRetention"

Hi, when I run "newIntronRention", an error was shown as:

`summarise_each()` is deprecated.
Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
To map `funs` over a selection of variables, use `summarise_at()`
Error in summarise_impl(.data, dots) : 
  Evaluation error: invalid 'type' (character) of argument

My R version is 3.4.1 on MacOS system.
Thanks for your attention!

error in hypothesis testing step in kea

Hi,
After running everything right, I am trying to do the hypothesis testing where kma uses a null distribution and it uses the function retention_test

I am getting this error:

Error in dim(ordered) <- ns : 
  dims [product 1] do not match the length of object [0]

How can I fix this?

Thanks

Question about --extend parameter

I'm trying to understand what the extend parameter is doing. Does it affect the intron sequences that are outputted into intron.fa during the preprocessing step?

Problem in R: computing dominator

hello
I tried to use Kma for the detection of intronretention in my RNA-seq data. Now I have the problem that with my data the dominator is not computed in R, maybe someone knows what might have gone wrong

Thanks in advance!

head(ir$flat)
intron sample numerator denominator retention condition unique_counts
1 chr9:130579764-130595482 ortho1 178608.1000 NA NA ortho 2003
2 chr9:130595588-130609953 ortho1 106730.6000 NA NA ortho 1033
3 chr9:130609987-130612455 ortho1 1039.8910 NA NA ortho 2
4 chr9:130612505-130612955 ortho1 0.0000 NA NA ortho 0
5 chr9:130613027-130614287 ortho1 0.0000 NA NA ortho 0
6 chr9:130614345-130616354 ortho1 648.3534 NA NA ortho 1

generate_introns.py different number of introns in introns.bed and intron_to_transcripts.txt

Something I observed is that there are more introns in the introns.bed file and introns.fa file than in the intron_to_transcripts.txt file.

Is this something expected?

The problem is that I tried to run RSEM, but the entries in the fasta file are more than the ones in the annotation (that is generated from intron_to_transcripts.txt file).

About what kma can do?

I have one guess whether kma detect IR in one sample?
For example: I have four sample 1.xprs, 2.xprs, 3.xprs, 4.xprs
If I can use kma to analysis IR for these four files, respectively?

Does kma support kallisto?

Hi,

Does kma supports kallisto? If yes, could you please provide few commands on how to process the kallisto output files.

inner_join in newIntronRetention

Dear all,

I had an issue with this function ('newIntronRetention') as I encountered the following error:
Error in Ops.data.frame(numExp, denomExp) :
‘/’ only defined for equally-sized data frames
which seems to be caused by 'retentionExp <- numExp/denomExp' in the function.
The data.tables numExp and denomExp are not of the same dimension. As numExp is derived from denomExp, I traced it back to the use of an 'inner_join' in the creation of numExp:
numExp <- select(denomExp, intron, intron_extension) %>%
inner_join(tmp_targExpression, by = c("intron_extension")) %>%
arrange(intron_extension)

Replacing the 'inner_join' by a 'left_join' solves the issue. But is this correct??

with kind regards,
Aldo

Install error on linux

I had an issue installing kma on CentOS running R version 3.1.0 (2014-04-10) using RStudio.

The installation raised the following error:
Error: /tmp/Rtmp519pGn/R.INSTALL17f366f68149/kma-master/man/check_groupings.Rd:16: Bad \link text

and then failed to install. Not sure why this was a fatal error since this is just the manual text.

So I downloaded the github and change line 16 from:
\link{\code{dplyr::group_by}}
to:
\code{\link{dplyr::group_by}}

This resulted in a successful install with a just a warning:
Rd warning: /tmp/RtmpKJbQaL/R.INSTALL188737d8a723/kma-master/man/check_groupings.Rd:16: missing link ‘dplyr::group_by’

eXpress runs out of memory

Hi,

does anyone know how much memory eXpress normaly requires.
I am running out of memory after a few minutes when I a run:

express -o 2monthsCTX/WT/WT1/xprs_out annotation/mm10_and_introns.fa
2monthsCTX/WT/WT1/mm10.bam

my mm10_and_introns.fa has 3.6 gb
and my mm10.bam file has 9.2 gb

I have 42GB allocated RAM on my virtual Linux machine.
Does anyone know, how much I would need to run at least one sample?

documentation 404

I found a small issue in the documentation.

The line:

eXpress against augmented

in kma/vignettes/kma.Rmd is broken, and gives a 404.

numerator && denominator

hi,
The KMA packages was used to detect RI in our data. The results were containing columns of numerator and denominator, so could you tell us what are means of numerator && denominator ?

Thanks !
Best wishes!

Error 'KeyError:3'

Please see my error trying to preprocess using kma

lynnyi⟫ python2.7 /home/lynnyi/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py --genome ~/genomes/Mus_musculus.GRCm38.dna.alt.fa --gtf ~/genomes/Mus_musculus.GRCm38.91.gtf --extend 20 --out ~/introns INFO: Reading in GTF: /home/lynnyi/genomes/Mus_musculus.GRCm38.91.gtf
INFO: Grouping transcripts by gene
INFO: Writing intron BED file: /home/lynnyi/introns/introns.bed
INFO: Computing intron-to-transcript compatability
INFO: Opening FASTA: /home/lynnyi/genomes/Mus_musculus.GRCm38.dna.alt.fa
INFO: Note: will take a while the first time it is opened.
Traceback (most recent call last):
File "/home/lynnyi/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 154, in
main()
File "/home/lynnyi/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 151, in main
bed_to_introns(bed_out, args.genome, introns_out)
File "/home/lynnyi/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 55, in bed_to_introns
seq = fasta[ref][int(start):int(stop)]
File "/usr/local/lib/python2.7/dist-packages/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: '3'

Another "newIntronRetention" error

Everything works up to:

> ir <- newIntronRetention( xprs$tpm, intron_to_trans, xprs$condition, xprs$uniq_counts )
'melting' unique counts
computing denominator
Error in sum(strand) : invalid 'type' (character) of argument

My "intron_to_trans" looks like:

> head( intron_to_trans )
                      intron         target_id               gene
1:  chr7:140534672-140549910 ENST00000288602.6  ENSG00000157764.8
2:  chr7:140534672-140549910 ENST00000469930.1  ENSG00000157764.8
3:  chr7:140534672-140549910 ENST00000497784.1  ENSG00000157764.8
4: chr12:121881596-121881814 ENST00000536437.1 ENSG00000089094.12
5: chr12:121881596-121881814 ENST00000377071.4 ENSG00000089094.12
6: chr12:121881596-121881814 ENST00000377069.4 ENSG00000089094.12
            intron_extension strand
1:  chr7:140534651-140549931      -
2:  chr7:140534651-140549931      -
3:  chr7:140534651-140549931      -
4: chr12:121881575-121881835      -
5: chr12:121881575-121881835      -
6: chr12:121881575-121881835      -

"strand" is "+" or "-" ...

> table( intron_to_trans$strand )

     -      + 
477309 491376

On a whim I made "strand" a factor with levels c("+","-"), but that failed similarly:

> ir <- newIntronRetention( xprs$tpm, intron_to_trans, xprs$condition, xprs$uniq_counts )
'melting' unique counts
computing denominator
Error in Summary.factor(c(NA_integer_, NA_integer_), na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

Any advice would be appreciated ...
Thanks!

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/ubuntu/KMA/R-3.6.1/lib/libRblas.so
LAPACK: /home/ubuntu/KMA/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kma_0.1.0   dplyr_0.8.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        crayon_1.3.4      assertthat_0.2.1  R6_2.4.0         
 [5] plyr_1.8.4        magrittr_1.5      pillar_1.4.2      stringi_1.4.3    
 [9] rlang_0.4.0       reshape2_1.4.3    data.table_1.12.4 tools_3.6.1      
[13] stringr_1.4.0     glue_1.3.1        purrr_0.3.2       compiler_3.6.1   
[17] pkgconfig_2.0.3   tidyselect_0.2.5  tibble_2.1.3

newIntronRetention error

computing denominator
Error in summarise_impl(.data, dots): invalid 'type' (character) of argument
Traceback:

newIntronRetention(xprs$tpm, intron_to_trans, xprs$condition)
left_join(intronToUnion, targetExpression, by = "target_id") %>%
. group_by(intron) %>% select(-(target_id)) %>% summarise_each(funs(sum),
. -matches("gene"), -matches("intron_extension")) %>% arrange(intron) %>%
. left_join(select(intronToUnion, intron, intron_extension) %>%
. distinct(), by = c("intron"))
withVisible(eval(quote(_fseq(_lhs)), env, env))
eval(quote(_fseq(_lhs)), env, env)
eval(expr, envir, enclos)
_fseq(_lhs)
freduce(value, _function_list)
function_list[i]
summarise_each(., funs(sum), -matches("gene"), -matches("intron_extension"))
summarise_each_(tbl, funs, lazyeval::lazy_dots(...))
summarise_(tbl, .dots = vars)
summarise_.tbl_df(tbl, .dots = vars)
summarise_impl(.data, dots)

error "object 'pvalue' not found" when using all_zc() R function

When using all_zc() function as detailed in the vignette:

"Error in eval(substitute(expr), envir, enclos) :
object 'pvalue' not found"

This part of the vignette:
zc_fnames <- Sys.glob(file.path(base_dir, "experiment/*/*/zero_coverage.txt")) zc_samples <- sub(file.path(base_dir, "experiment/[a-z]+/"), "", zc_fnames) %>% sub("zero_coverage.txt", "", .) %>% gsub("/", "", .) zc_conditions <- sub("[0-9]+", "", zc_samples) all_zc <- get_batch_intron_zc(zc_fnames, zc_samples, zc_conditions) head(all_zc)

This was fixed for me by setting the .keep_all parameter of the distinct() dplyr function to TRUE distinct(.keep_all = TRUE) within the zero_coverage.R file.

Rupert

Error in generate_introns.py.

when I run this command : python $PRE/generate_introns.py --genome $path_fa --gtf $path_gtf --extend 25 --out results/kma

I got the following error:
Traceback (most recent call last):
File "/home/zhengjt/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 154, in
main()
File "/home/zhengjt/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 151, in main
bed_to_introns(bed_out, args.genome, introns_out)
File "/home/zhengjt/R/x86_64-pc-linux-gnu-library/3.4/kma/pre-process/generate_introns.py", line 55, in bed_to_introns
seq = fasta[ref][int(start):int(stop)]
File "/home/zhengjt/.local/lib/python2.7/site-packages/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: 'MG4136_PATCH'

My fasta file is : Mus_musculus.GRCm38.75.dna.SORTED.fa
part of file：

1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Help please...
Thanks Advanced!

Installation failed

Installation failed: An unknown option was passed in to libcurl.
R version 3.4.4, CentOS7
Please help !

couldn't see vignette after installation

Hi, I installed kma in my macbook. R version is R3.2.0. Everything seems to be fine until I try to look at the vignette. Here's the error:

library("kma")
vignette("kma")
Warning message:
vignette ‘kma’ not found

Could you please direct me to a solution?

Thanks
Lin

Error 11 in running generate_introns.py

Hi, I encountered an error when running generate_introns.py:

(base) [xfu1@sjgenappprdn01 hg38]$ python2 /export/home/xfu1/Profiler/Profiler/kma/inst/pre-process/generate_introns.py --genome /export/home/xfu1/hg38/Homo_sapiens.GRCh38.dna.alt.fa --gtf /export/home/xfu1/hg38/Homo_sapiens.GRCh38.108.gtf --extend 20 --out /export/home/xfu1/Profiler/Profiler/kma/out_dir
INFO: Reading in GTF: /export/home/xfu1/hg38/Homo_sapiens.GRCh38.108.gtf
INFO: Grouping transcripts by gene
INFO: Writing intron BED file: /export/home/xfu1/Profiler/Profiler/kma/out_dir/introns.bed
INFO: Computing intron-to-transcript compatability
INFO: Opening FASTA: /export/home/xfu1/hg38/Homo_sapiens.GRCh38.dna.alt.fa
INFO: Note: will take a while the first time it is opened.
Traceback (most recent call last):
File "/export/home/xfu1/Profiler/Profiler/kma/inst/pre-process/generate_introns.py", line 154, in
main()
File "/export/home/xfu1/Profiler/Profiler/kma/inst/pre-process/generate_introns.py", line 151, in main
bed_to_introns(bed_out, args.genome, introns_out)
File "/export/home/xfu1/Profiler/Profiler/kma/inst/pre-process/generate_introns.py", line 55, in bed_to_introns
seq = fasta[ref][int(start):int(stop)]
File "/export/home/xfu1/.local/lib/python2.7/site-packages/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: '11'

I used the human reference genome downloaded from ensembl. Can someone help me to fix it? Thanks in advance!
-Eddie

--extend

I want to know what does means "N is the number of bases bases to overlap into the intron. Note, this should be at most read_length - 1, but we suggest making it smaller."?

Pre-processing with generate_introns.py failing

When generate_introns.py begins grouping transcripts by gene I receive the following:

Traceback (most recent call last):
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 154, in <module>
    main()
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/generate_introns.py", line 104, in main
    g2t = intron_ops.reduce_to_gene(gtf_list)
  File "/home/kasowitz/R/x86_64-redhat-linux-gnu-library/3.2/kma/pre-process/intron_ops.py", line 45, in reduce_to_gene
    cur_gene = trans.gene_id
AttributeError: 'NoneType' object has no attribute 'gene_id'

I am using genome and annotations from Ensembl. Looking back at the previous part of this processing I see an occassional line reading
'NoneType' object has no attribute 'add_exon'
which I assume are the same objects causing the intron_ops exit.

Output of a new GTF file with IR events?

This tool looks very useful but from what I can see, it does not currently output a GTF of the transcriptome that includes the novel IR events it identifies? Is this true and is it possible to add such a feature? I imagine it would be tricky to know which existing transcripts to duplicate and then add in a IR event into but if it was done for all transcripts that could contain that IR event, I think it could work.

HISAT2 - cufflinks

I would like to use KMA, but I use HISAT2 and cufflinks for the alignment and quantification. How can I implement KMA in this pipeline for intron retention detection?

Thanks in advance!

can quantification by kallisto being supported?

Hi,

Since eXpress is not being maintained and kallisto is among the state-of-the-art transcript quantification tools, can kallisto output (quantified on transcript.fa) being support by KMA? Thanks.

kma-kallisto

Hi!

I can see that a zip file was added for 'kma-kallisto' a few years back - I've downloaded this, but the included vignette still describes quantification with eXpress. I've also tried to install as an R package, but without success (this may be my own error).

Could I check if kallisto is now supported, and if this is the correct route by which to analyse kallisto output with kma?

Best wishes,
Kevin