myles-lewis / glmmseq Goto Github PK

View Code? Open in Web Editor NEW

18.0 5.0 10.0 14.17 MB

Gene-level general linear mixed model

Home Page: https://myles-lewis.github.io/glmmSeq/

License: Other

R 98.08% HTML 1.92%

bioinformatics differential-gene-expression gene-expression transcriptomics cran glmm mixed-models

glmmseq's Introduction

glmmSeq

This R package is designed to model gene expression with a general linear mixed model (GLMM). This allows us to include random effects as well as fixed effects. For the purpose of the package we use the glmer function from the lme4 package which fits a GLMM.

This package focuses in particular on changes in genes expression between different response or treatment groups over time.

Loading the package

From CRAN

install.packages("glmmSeq")

From Github

devtools::install_github("myles-lewis/glmmSeq")

Locally

You can also download the source directory and load the functions individually:

functions = list.files("./R", full.names = TRUE)
invisible(lapply(functions, source))

But you will need to load in the additional libraries then:

# Install CRAN packages
invisible(lapply(c("MASS", "car", "ggplot2", "ggpubr", "lme4", 
                   "lmerTest", "methods", "parallel", "plotly", 
                   "pbapply", "pbmcapply"),
                 function(p){
                   if(! p %in% rownames(installed.packages())) {
                     install.packages(p)
                   }
                   library(p, character.only=TRUE)
                 }))

# Install BioConductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
invisible(lapply(c("qvalue"), function(p){
  if(! p %in% rownames(installed.packages())) BiocManager::install(p)
  library(p, character.only=TRUE)
}))

Example script

For examples see the vignette.

Reference

glmmSeq was developed by the bioinformatics team at the Experimental Medicine & Rheumatology department and Centre for Translational Bioinformatics at Queen Mary University London.

If you use this package please cite as:

citation("glmmSeq")

## To cite package ‘glmmSeq’ in publications use:
##
##  Myles Lewis, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk and Anna Surace (2021). 
##  glmmSeq: General Linear Mixed Models for Gene-level Differential Expression. 
##  R package version 0.5.4. https://github.com/myles-lewis/glmmSeq
##
## A BibTeX entry for LaTeX users is
##
##  @Manual{,
##    title = {glmmSeq: General Linear Mixed Models for Gene-level Differential Expression},
##    author = {Myles Lewis and Katriona Goldmann and Elisabetta Sciacca and Cankut Cubuk and Anna Surace},
##    year = {2022},
##    note = {R package version 0.5.4},
##    url = {https://github.com/myles-lewis/glmmSeq},
##  }

glmmseq's People

Contributors

Stargazers

Watchers

Forkers

zktuong kew24 genomicsnx rnaimehaom shicheng-guo jespergrud katrionagoldmann denvern3 stemangiola

glmmseq's Issues

vector memory exhausted issue

Hi,

Thanks for developing this package!

I'm following the vignette and trying to run glmmSeq on a relatively small dataset (26 samples * 20k genes). The 26 samples are 13 pairs as a random effects variable of (1|individual). If I'm using the model ~disease+(1|condition)+covar1+covar2+covar3+covar4, R will give me Error: cannot allocate vector of size 6223.5 Gb. It runs ok if I remove 1 fixed effect variable. It wouldn't run on an HPC either, and I suppose no cores can handle a vector of this size.

Order of x1values argument in fcPlot() and maPlot() only switches axis label, does not alter y-axis values

Hello, I am using the function of fcPlot and maPlot to compare 3 treatments (Veh, cort10, and cort1000) in control and patient.
I notice that the y-axis values in fcPlot or maPlot remain the same if I compare cort1000 vs. veh or veh vs. cort1000. I would expect that the y-axis values would change in respective order. Do you have any suggestion here?
Thank you in advance for your advice.

Create glmmGene function to run model on one gene alone and return fit

sizeFactors instructions in vignette

hi,

in your vignette under size factors,

when using:

sizeFactors <- calcNormFactors(counts, method="TMM")

the sizeFactor should have an additional step like:

sizeFactors <- dgelist$samples$lib.size * dgelist$samples$norm.factors

because in the calcNormFactors details, they explicitly state:

This function computes scaling factors to convert observed library sizes into effective library sizes. The effective library sizes for use in downstream analysis are lib.size * norm.factors where lib.size contains the original library sizes and norm.factors is the vector of scaling factors computed by this function.

https://rdrr.io/bioc/edgeR/man/calcNormFactors.html

related to #20

plotCutoff parameter in fcPlot

hi Kat,

I don't know if it's me, but the role of the plotCutoff option in the fcPlot function doesn't look to be very clear .
I think it would be good if you can explain better what it is meant for.
I was trying to remove all the dots that are below my significance threshold (q < 0.05), I thought this option was able to help me but maybe it's not the case.

Dispersion memory issue

Do you have any recommendations on how to calculate the dispersion in a memory-efficient manner?

I've tried both:

disp <- apply(tpm, 1, function(x){
(var(x, na.rm=TRUE)-mean(x, na.rm=TRUE))/(mean(x, na.rm=TRUE)**2)
})
disp <- setNames(edgeR::estimateDisp(tpm)$tagwise.dispersion, rownames(tpm))

Normally I use edgeR::filterByExpr; however, then it won't work for the downstream glmmSeq it seems since some dispersion values are missing.

colors in base plot

Something is still going wrong with colors :(
if I do:

  pairedPlot(glmmResult=fitted.model,
                geneName = gene,
                x1Label = "Time",
                x2Label="Treatment",
                xTitle="Time",
                IDColumn = "Patient.ID",
                graphics = "base",
                colours = c("DMF" = "khaki", "Vehicle" = "slategray2"),
                lineColour = c("DMF" = "khaki", "Vehicle" = "slategray2"),
                modelSize = 3,
                modelColour = c("DMF" = "gold3", "Vehicle" = "darkblue"),
                modelLineColour = c("DMF" = "gold3", "Vehicle" = "darkblue"),
                fontSize=1,
                x2Offset = 6,
                logTransform=TRUE,
                addViolin = FALSE, addBox = FALSE,
                pairedOnly = FALSE)

I get:

while what I wanted was:

(this sencond plot is done with the code of my latest push)
Do let me know if you need my data to test it :)

Create modelPlot function

mention size factors in the vignette

Although it's optional, the vignette could mention that we can specify size factors as offsets to be included in the linear predictor during fitting.
Three options to get size factors are:

Manually:

sizeFactors <- colSums(tpmdata)  #total no of reads 
sizeFactors <- sizeFactors2 / mean(sizeFactors2)  #scale so that mean=1 (normalise)

edgeR:
sizeFactors <- calcNormFactors(counts, method="TMM")
DESeq2:
sizeFactors <- estimateSizeFactorsForMatrix(tpmdata)

Allow users to input their own family in lme4::glmer()

Feature requested by email

Add progress bar

Add progress bar to glmmSeq. It is currently difficult to tell how long each run will take. This will require the pbmcapply package.

maPlot function has the same name as the one in the 'edge' package

The maPlot function has the same name as the one in the edge package. So if the user has loaded edge first, our function will be masked and this of course will generate errors to an inattentive user.

Output summary

Could the model output table be explained?
Also is there a journal article for the methods used in creating this package?

Carrie

add a glmmSeq function which doesnt use glmer but glm (without the random effects)

separate glmSeq function from glmmSeq
check which Wald test can be used for glm (check first version of DESeq)

GSEA Support

Thanks for the outstanding package!

I am trying to formulate an analogous GSEA metric for the output of glmmSeq. Using other methods (edgeR, limma, DESeq2, wilcoxon tests) the test statistic is signed. The statistic is used for ranking the genes for differential expression, and the sign conveniently indicates the direction of enrichment. In the case with using a chi-squared statistic, I am wondering if there can be an analog. So far I have been doing the following for the contrast implied by b and the statistic stat:

sign(res@stats$coef[,b])*res@stats$Chisq[,stat]

Any thoughts on this would be much appreciated!

add zeroCount to offset the zero in the count data with log scaled

Can glmmseq be used in non-longitudinal data?

Hi,

I'm trying to use glmmseq to analyze some dataset in which each patient contributed multiple samples. I want to consider patient ID as random factor and the diagnosis as fixed factor. So my formula would be like:

glmmSeq(~ diagnosis + (1 | patient_ID),
countdata = OTU.table,
metadata = Sample.table,
dispersion = disp,
progress = TRUE)

As you can see, I don't have a "Timepoint" variable in this model.

Based on glmmseq manual, glmmseq is "Designed for longitudinal analysis..." If I don't have Timepoint can I still use this model?

Thanks very much!
Leran

Using TPM normalised vs raw count expression data

Hello,

Great package, many thanks for making this! I had a question regarding the example in the vignette provided. The example uses TPM normalised expression data, and with this data it calculates size factors, dispersions and fits it to a NB.

I was wondering if this is the recommended way of using this package? A negative binominal distribution can be unsuitable for TPM normalised data according to some (see for example https://biostars.org/p/471335/ ), and I'm confused why it would still be useful to calculate size factors on normalised data?

Thanks in advance.
Best wishes, Rik

fontSize option does not work

We have to fix the fontSize option.
If I try with:

fontSize =12,

I get the following error:

Error in pairedPlot(glmmResult = fitted.model, geneName = gene, x1Label = "VisitCode",  : 
  formal argument "fontSize" matched by multiple actual arguments

pairedPlot error using GlmmSeq object with covariates

Thanks for your work on this package!

When using an object of class GlmmSeq in pairedPlot(), if additional covariates are included the model, the pairedPlot() function does not work due to the code shown below. The vignette shows an example with an interaction and a random effect ( ~ Timepoint * EULAR_6m + (1 | PATID)). However, if other covariates (e.g., gender) are in the model, the following error occurs.

Is there a reason why you've chosen to throw an error here (e.g. plotting difficulty, statistical messiness), and is there a workaround if other covariates are in the model?

relevant code: lines 133-136 of pairedPlots.R

if(ncol(glmmResult@modelData) != 2){
  stop(
    paste("These plots only work for interactions between two variable.",
          "Therefore nrow(glmmResult@modelData) should be 2."))

Thanks!

About turing off annotationPosition

Hi Kat,

I have a minor issue.
I don't know if you had a particular idea in mind when you set up annotationPosition to TRUE in the fcPlot (line 212, fcPlot.R).
If there's no specif reason, I would turn it off by default as I find it very annoying when I try to move gene labels and I move the arrow instead by mistake. This makes me loose the original position of the label and it's difficult to put it back to the original dot.
From the related plotly code:

annotationPosition determines if the main anchor of the annotation is editable. The main anchor corresponds to the text (if no arrow) or the arrow (which drags the whole thing leaving the arrow length & direction unchanged).

If you turn it off the label movement will still be assured by annotationText = TRUE and annotationTail = TRUE.

do I need to add age into the model?

Hi,
Thanks for developing this useful package. I am wondering about the age correction. Do I need to add it to the model? Or other confounders?

Best,

Add '...' to fcPlot