andymckenzie / dgca Goto Github PK

Differential Gene Correlation Analysis

R 57.79% HTML 42.21%

correlation r cran rna-seq differential-correlation gene-expression

dgca's Introduction

DGCA

The goal of DGCA is to calculate differential correlations across conditions.

It simplifies the process of seeing whether two correlations are different without having to rely solely on parametric assumptions by leveraging non-parametric permutation tests and adjusting the resulting empirical p-values for multiple corrections using the qvalue R package.

It also has several other options including calculating the average differential correlation between groups of genes, gene ontology enrichment analyses of the results, and differential correlation network identification via integration with MEGENA.

Installation

You can install DGCA from CRAN with:

install.packages("DGCA")

You can install the development version of DGCA from github with:

# install.packages("devtools")
devtools::install_github("andymckenzie/DGCA")

Basic Example

library(DGCA)
data(darmanis); data(design_mat)
ddcor_res = ddcorAll(inputMat = darmanis, design = design_mat, compare = c("oligodendrocyte", "neuron"))
head(ddcor_res, 3)
#   Gene1  Gene2 oligodendrocyte_cor oligodendrocyte_pVal neuron_cor neuron_pVal
# 1 CACYBP   NACA        -0.070261455           0.67509118  0.9567267           0
# 2 CACYBP    SSB        -0.055290516           0.74162636  0.9578999           0
# 3 NDUFB9    SSB        -0.009668455           0.95405875  0.9491904           0
#   zScoreDiff     pValDiff     empPVals pValDiff_adj Classes
# 1  10.256977 1.100991e-24 1.040991e-05    0.6404514     0/+
# 2  10.251847 1.161031e-24 1.040991e-05    0.6404514     0/+
# 3   9.515191 1.813802e-21 2.265685e-05    0.6404514     0/+

Vignettes

There are three vignettes available in order to help you learn how to use the package:

DGCA Basic: This will get you going quickly.
DGCA: This is a more extended version that explains a bit about how the package works and shows several of the options available in the package.
DGCA Modules: This will show you how to use the package to perform module-based and network-based analyses.

The second two vignettes can be found in inst/doc.

Applications

You can view the manuscript describing DGCA in detail as well as several applications here:

http://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-016-0349-1

Material for associated simulations and networks created from MEGENA can be found here:

https://github.com/andymckenzie/dgca_manuscript

dgca's People

Contributors

Stargazers

Watchers

Forkers

genomicsnx rintukutum datasciencebio nbahti choijamtsmunkhzul eugeniaradulescu dongfang1021 qindan2008 jalilsharif shaohua9511

dgca's Issues

Suggest dCorMats truncate correlation above corr_cutoff and below -corr_cutoff

Hi,

atanh(1) = NaN and atanh(-1) = -Inf. -1 correlation will lead to instable or wrong result in the downstream calculation.

dCorMats <- function (matA, nmatA, matB, nmatB, corr_cutoff = 0.99, corrType = "pearson", 
    secondMat = FALSE, signType = "none") 
{
    if (!identical(dim(matA), dim(nmatA))) {
        stop("Matrices are not the same size.")
    }
    if (!identical(dim(matA), dim(matB))) {
        stop("Matrices are not the same size.")
    }
    if (!identical(dim(matA), dim(nmatB))) {
        stop("Matrices are not the same size.")
    }
    n_col = ncol(matA)
    n_row = nrow(matA)
    
    ## need to set cutoff on both -1 and 1 end to have stable result
    ## atanh(-1) = -Inf, dz -> Inf/-Inf
    
    ## added lines of code
    ##corr_cutoff <- abs(corr_cutoff)
    matA[matA > corr_cutoff] = corr_cutoff
    ##matA[matA < -corr_cutoff] = -corr_cutoff
    matB[matB > corr_cutoff] = corr_cutoff
    ##matB[matB < -corr_cutoff] = -corr_cutoff
    
    
    if (signType == "positive") {
        matA[matA < 0] = 0
        matB[matB < 0] = 0
    }
    if (signType == "negative") {
        matA[matA > 0] = 0
        matB[matB > 0] = 0
    }
    if (!secondMat) {
        matA = matA[upper.tri(matA)]
        nmatA = nmatA[upper.tri(nmatA)]
        matB = matB[upper.tri(matB)]
        nmatB = nmatB[upper.tri(nmatB)]
        dcorr_res = DGCA::dCorrs(as.vector(matA), as.vector(nmatA), 
            as.vector(matB), as.vector(nmatB), corrType = corrType)
        diffs = matrix(NA, ncol = n_col, nrow = n_row)
        diffs[upper.tri(diffs)] = dcorr_res
        rownames(diffs) = rownames(matA)
        colnames(diffs) = colnames(matA)
        pvals_res = 2 * (pnorm(abs(dcorr_res), mean = 0, sd = 1, 
            lower.tail = FALSE, log.p = FALSE))
        pvals = matrix(NA, ncol = n_col, nrow = n_row)
        pvals[upper.tri(pvals)] = pvals_res
        rownames(pvals) = rownames(matA)
        colnames(pvals) = colnames(matA)
    }
    else {
        dcorr_res = DGCA::dCorrs(as.vector(matA), as.vector(nmatA), 
            as.vector(matB), as.vector(nmatB), corrType = corrType)
        diffs = matrix(dcorr_res, ncol = n_col, nrow = n_row)
        rownames(diffs) = rownames(matA)
        colnames(diffs) = colnames(matA)
        pvals = 2 * (pnorm(abs(dcorr_res), mean = 0, sd = 1, 
            lower.tail = FALSE, log.p = FALSE))
        pvals = matrix(pvals, ncol = n_col, nrow = n_row)
        rownames(pvals) = rownames(matA)
        colnames(pvals) = colnames(matA)
    }
    return(list(diffs = diffs, pvals = pvals))
}

differential correlations among many conditions

Hello,

I was wondering if the DGCA tool is suitable for comparing the most differential correlations among 11 conditions. In particular, I would like to focus on the differences of correlation of cell types frequences among these 11 conditions.

Thanks a lot in advance.

Exact p values from moduleDC()

Hello,
Is there a way to return non rounded empirical p-values from the moduleDC() function? The values returned appear to be rounded to two decimal places, which is particularly unhelpful for low values which are rounded to zero. I am. hoping to get something more like the differential correlation pvals returned ddcorAll() when running the pairwise DGCA pipeline.

Any guidance much appreciated, thank you!

dcTopPairs removes NA and NaN values leading to missing rows in the result

Hi,

sort(pValDiff, index.return = 1) will remove NA and NaN and the resulted values are not of the same length as the upper.tri part.

We would also request to have nsample for condition A and B in the topPairs table.

Thank you.

dcTopPairs <- function (dcObject, nPairs, adjust = "none", plotFdr = FALSE, 
    classify = TRUE, sigThresh = 1, corSigThresh = 0.05, zScorePerm = NULL, 
    verbose = FALSE, compare = NULL, secondMat = FALSE) 
{
    SAF = getOption("stringsAsFactors", FALSE)
    on.exit(options(stringsAsFactors = SAF))
    options(stringsAsFactors = FALSE)
    if (!is(dcObject, "dcPair")) 
        stop("Input to dcTopPairs must be a dcPair object.")
    pValDiff = slot(dcObject, "PValDiff")

    ## NaN and NA are excluded by sorted
    sorted = sort(pValDiff, index.return = 1)
    sorted_pvals = sorted$x
    sorted_indx = sorted$ix[1:nPairs]
    if (!secondMat) {
        indx = which(upper.tri(pValDiff, diag = FALSE), arr.ind = TRUE)
        corA = slot(dcObject, "corA")
        gene1 = rownames(corA)[indx[, 1]][sorted_indx]
        gene2 = colnames(corA)[indx[, 2]][sorted_indx]
    }
    else {
        dIndx = arrayInd(sorted_indx, dim(pValDiff))
        corA = slot(dcObject, "corA")
        gene1 = rownames(corA)[dIndx[, 1]]
        gene2 = colnames(corA)[dIndx[, 2]]
    }
    extractAndOrderSlots <- function(extractSlot) {
        tmp = slot(dcObject, extractSlot)
        if (!secondMat) 
            tmp = tmp[upper.tri(tmp)]
        tmp = tmp[sorted_indx]
        return(tmp)
    }
    corA = extractAndOrderSlots("corA")
    corApVal = extractAndOrderSlots("corPvalA")
    corB = extractAndOrderSlots("corB")
    corBpVal = extractAndOrderSlots("corPvalB")
    zScoreDiff = extractAndOrderSlots("ZDiff")
    pValDiff_unadj = extractAndOrderSlots("PValDiff")
    if (is.null(compare)) {
        colnames_topPairs = c("Gene1", "Gene2", "corA", "corA_pVal", 
            "corB", "corB_pVal", "zScoreDiff", "pValDiff")
    }
    else {
        colnames_topPairs = c("Gene1", "Gene2", paste0(compare[1], 
            "_cor"), paste0(compare[1], "_pVal"), paste0(compare[2], 
            "_cor"), paste0(compare[2], "_pVal"), "zScoreDiff", 
            "pValDiff")
    }
    topPairs = data.frame(gene1, gene2, corA, corApVal, corB, 
        corBpVal, zScoreDiff, pValDiff_unadj, stringsAsFactors = FALSE)
    if (!adjust == "none") {
        if (!adjust == "perm") {
            pValDiffAdj_sorted = adjustPVals(pValDiff_unadj, 
                adjust = adjust)
            topPairs = data.frame(topPairs, pValDiffAdj_sorted, 
                stringsAsFactors = FALSE)
            colnames_topPairs = c(colnames_topPairs, "pValDiff_adj")
        }
        else {
            pValDiffAdj = permQValue(dcObject, zScorePerm, secondMat = secondMat, 
                testSlot = "ZDiff", verbose = verbose, plotFdr = plotFdr)
            empPVals_sorted = pValDiffAdj[["empPVals"]][sorted_indx]
            pValDiffAdj_sorted = pValDiffAdj[["pValDiffAdj"]][sorted_indx]
            topPairs = data.frame(topPairs, empPVals_sorted, 
                pValDiffAdj_sorted, stringsAsFactors = FALSE)
            colnames_topPairs = c(colnames_topPairs, "empPVals", 
                "pValDiff_adj")
        }
    }
    message("Classifying the differential correlation calls.")
    if (classify == TRUE) {
        if (adjust == "none") {
            class_list = dCorClass(corA, corApVal, corB, corBpVal, 
                pValDiff_unadj, sigThresh = sigThresh, corSigThresh = corSigThresh)
        }
        else {
            class_list = dCorClass(corA, corApVal, corB, corBpVal, 
                pValDiffAdj_sorted, sigThresh = sigThresh, corSigThresh = corSigThresh)
        }
        class_factor = factor(class_list, levels = c(0, 1, 2, 
            3, 4, 5, 6, 7, 8, 9), labels = c("NonSig", "+/+", 
            "+/0", "+/-", "0/+", "0/0", "0/-", "-/+", "-/0", 
            "-/-"))
        topPairs = data.frame(topPairs, class_factor, stringsAsFactors = FALSE)
        colnames_topPairs = c(colnames_topPairs, "Classes")
    }
    colnames(topPairs) = colnames_topPairs
    return(topPairs)
}

Here is my quick fix. There are probably better solutions.

  ## fix: sort omits NA and NaN
  pValDiff_impute <- pValDiff
  pValDiff_impute[upper.tri(pValDiff) & is.na(pValDiff)] <- Inf

  sorted = sort(pValDiff_impute, index.return = 1)

DGCA::matCorSig wrong indices used for updating pvals

Hi,

corrs is a vector, with a length same as the values in the upper.tri(pvals). abs(corrs) == 1 is not the correct indices for pvals.

DGCA::matCorSig
function (corrs, nsamp, secondMat = FALSE) 
{
    if (!secondMat) {
        pvals = matrix(NA, nrow = nrow(corrs), ncol = ncol(corrs))
        corrs = corrs[upper.tri(corrs)]
        nsamp = nsamp[upper.tri(nsamp)]
        df = nsamp - 2
        pvals_upper = 2 * (1 - pt(abs(corrs) * sqrt(df)/sqrt(1 - 
            corrs^2), df))
        pvals[upper.tri(pvals)] = pvals_upper

        ## corrs is a vector, length is the same as the upper.tri
        pvals[abs(corrs) == 1] = 0

        ## suggested fix
        ## pvals[which(upper.tri(pvals))[abs(corrs) == 1]] = 0
    }
    else {
        df = nsamp - 2
        pvals = matrix(2 * (1 - pt(abs(corrs) * sqrt(df)/sqrt(1 - 
            corrs^2), df)), nrow = nrow(corrs))
        pvals[abs(corrs) == 1] = 0
    }
    return(pvals)
}

Take actual correlation values of gene pairs

Hi Andy and many thanks for your useful package.
I have a gene pairs dataframe (mydf):

Psma4 Mcpr1
9430016H08Rik 1110034B05Rik
LOC433494 Itgb4bp
4933427E11Rik Jrk
Mfhas1 LOC434299
LOC436084 Eef1a1
LOC434064 Gm461

I can very easily plot the correlation of expression values between condition A and B for the first gene pair for example:

plotCors(inputMat = mydf, design = design_mat, compare = c("conditionA", "conditionB"), geneA = "Psma4", geneB = "Mcpr1")

but i was wondering whether there is any function implemented in your package that could give me the actual correlation value for every gene pair in this df.

Thanks,
Dimitris

dCorrs need to check n1>3 and n2>3

Hi,

dCorrs need to check n1>3 and n2>3

dCorrs <- function (rho1, n1, rho2, n2, corrType = "pearson") 
{
    if (!(all.equal(length(rho1), length(n1), length(rho2), length(n2)))) 
        stop("All of the input vectors must be the same length.")
    zr1 = atanh(rho1)
    zr2 = atanh(rho2)
    if (corrType == "pearson") {
        diff12 = (zr2 - zr1)/sqrt((1/(n1 - 3)) + (1/(n2 - 3)))
    }
    if (corrType == "spearman") {
        diff12 = (zr2 - zr1)/sqrt((1.06/(n1 - 3)) + (1.06/(n2 - 
            3)))
    }
    return(diff12)
}

Here is my quick fix.

DGCA_dCorrs <- function (rho1, n1, rho2, n2, corrType = "pearson")
{
  if (!(all.equal(length(rho1), length(n1), length(rho2), length(n2))))
    stop("All of the input vectors must be the same length.")
  zr1 = atanh(rho1)
  zr2 = atanh(rho2)
  if (corrType == "pearson") {
    diff12 = (zr2 - zr1)/sqrt((1/(n1 - 3)) + (1/(n2 - 3)))
  }
  if (corrType == "spearman") {
    diff12 = (zr2 - zr1)/sqrt((1.06/(n1 - 3)) + (1.06/(n2 -
                                                         3)))
  }
  ## NaN is more appropriate than NA
  diff12[n1 <= 3 | n2 <= 3] <- NaN
  return(diff12)
}

Hello, I'm having an issue with ddMEGENA function.

The error I am getting is this:

`####### PFN Calculation commences ########

Error in serial.PFN(sortedEdge = ijw, Ng = N, maxENum = maxENum) :
The input PFG has exceeding number of edges.
`

This function works great on all my datasets except one and I do not know why?
I don't know if this is the write place to ask. Im using ddcorAll() results and feeding them
into ddMEGENA(). I don't why this error is happening!

I am having issues using the parallelized version of ddMEGENA?

I know ddMEGENA is a wrapper for MEGENA. But I know you discuss how to use ddMEGENA in your DGCA Modules vingette. When I am running the parallelized version of ddMEGENA I am getting a warning:
1: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, :scheduled cores 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 16, 17, 19 did not deliver results, all values of the jobs will be affected

Do you have a vignette/tips on how go use the parallelized version of ddMEGENA. * I already have installed the doMC package *

Thanks,
Ashay

Running ddcorrAll with nPairs does not result in the same top correlations as running without

Hi,

Great package! Just managed to run into an issue while using it.

Reading the manuals (basic, advanced, and pdf-manual) I find the following information about using nPairs for ddcorrAll:

To run the full differential correlation analysis and extract all of the top differentially correlated pairs...

and

nPairs
Either a number, specifying the number of top differentially correlated identifier pairs to display in the resulting table, or a the string "all" specifying that all of the pairs should be returned. If splitSet is specified, this is reset to the number of non-splitSet identifiers in the input matrix, and therefore will not be evaluated

So it sounds like the ddcorAll should still perform the full analysis, but only return the top nPairs correlations. But when I actually run ddcorAll I get completely different top hits when using nPairs vs not using nPairs, and the analysis runs faster with nPairs specified. This makes it feel like something is wrong and that the parameter simply limits the number of samples analyzed and not the size of the output table.

I don't know if it's a bug or if I have just misunderstood how it's supposed to work and it's actually doing what it's supposed to be doing. Googling this gave me no clarity 😬

Thank you! 😅

Error in moduleDc

Hi Andy,

I am trying to compare the expression of modules generated by WGCNA between two genotypes.

I am facing an error in moduleDC:

moduleDC_res = moduleDC(inputMat = data.frame(t(vstTissue)), design = dmat, compare = c("WT_DEL", "DEL"), genes = modules$values, labels = modules$ind, nPerm = 1, number_DC_genes = 3,dCorAvgMethod = "median")
Calculating MDC for module #1, which is called Module_23
Calculating permutation number 1.
Classifying the differential correlation calls.
Calculating the differential correlation average.
Error in matrixStats::colMedians(zdiff_perm_gene_nonself) : 
  Argument 'dim' must be an integer vector of length two.

I checked the structures of different objects. Do you have insight into the issue.

Thanks in advance.

Issue with permutation test setting for ddcorAll

I am having an issue at the moment running a large number of permutations using the ddcorAll.
Not unexpectedly, running these permutations requires a lot of memory. However, it seems to be prohibitively the case if I attempt to run 1000 permutations, even if I submit this job to our HPC with additional resources.
Is it possible to run this function on 1000 permutations? Was it never intended to run so many?

Any advice/guidance/tips to make this possible would be very much appreciated.

Kind regards,
Sarah

sorting heatmaps

Hi, I have a question about the sortby option in the ddcorall function. The manual only gives "zScoreDiff" as an option. I'd like to leave the order of the genes as they are in the original matrix. Is this a possible option? What would be the character string? thanks.

Differential significance across differential correlation classes?

Hi @andymckenzie,

Thank you for developing this amazing resource!

I am working on translating DGCA to a similar use case, but instead correlating the expression of genes, I am computing correlations between gene expression and peak accessibility (as measured by open chromatin profiling techniques such as ATAC-seq). So far, my results look promising.

However, I have noticed that the "+/-" and "-/+" differential correlation classes typically have higher differential correlation Z scores and are typically "more statistically significant" relative to the other differential correlation classes:

classes <- dCorClass(corsB = p2g.2$Correlation,
                           corsA = p2g.1$Correlation,
                           pvalsB = p2g.2$FDR,
                           pvalsA = p2g.1$FDR,
                           dCorPVals = pvals_res_adj,
                           sigThresh = 1e-2,
                           corSigThresh = 1e-2,
                           convertClasses = T)

Based on my interpretation of my code above, I used 0.01 as the threshold for calling the correlation in each condition statistically significant and as the threshold for calling a differential correlation. Based on how a peak-gene pair passes these thresholds, it is sorted into one of the nine differential correlation classes or is labeled "NonSig" if the differential correlation test FDR p-value >0.01.

My interpretation of the "+/-" and "-/+" differential correlation classes typically having higher differential correlation Z scores and lower p-values, is that this may arise naturally due to the natural difference in magnitude in a "+/-" or "-/+" scenario relative to a "+/0" or "-/0" scenario. In other words, having a positive correlation in condition A with a negative correlation in condition B would more often lead to large differential correlation Z scores simply b/c they are on both sides of 0. Whereas the "+/0" scenario would require a more drastic difference in Z score to achieve similar values to those observed in "+/-" or "-/+".

Please feel free to provide feedback or correct any misunderstandings that I may have. If you could suggest some ideas or changes to my analysis, that would be greatly appreciated!