nhejazi / methyvim Goto Github PK

:package: :microscope: R/methyvim: Targeted, Robust, and Model-free Differential Methylation Analysis

Home Page: https://code.nimahejazi.org/methyvim/

License: MIT License

R 43.29% TeX 56.16% Makefile 0.54%

statistics machine-learning bioinformatics biostatistics causal-inference methylation-microarrays variable-importance dna-methylation computational-biology targeted-learning

methyvim's Introduction

hello, i'm Nima

I'm an academic (bio)statistician working at the interface of causal inference, machine learning, and non- and semi-parametric statistics. I'm passionate about building open-source software tools to improve the accessibility of modern, model-agnostic and assumption-lean methods for statistical inference and causal machine learning, and I'm especially excited by the applications of statistics to the biomedical and public health sciences.

Are you looking for open source software for targeted causal machine learning? Maybe you should check out the tlverse project and browse our free open-source handbook!

methyvim's People

Contributors

Stargazers

Watchers

methyvim's Issues

Parameter estimation using tmle-npvi

The R package tmle.npvi provides a nonparametric approach to assessing variable importance in the Targeted Learning framework. It could provide several notable improvements to the algorithm currently available in this R package -- specifically,

Ability to operate on a continuous exposure (A), rather than relying on discretization. This would be particularly useful when estimating the effect of DNA methylation on an outcome -- that is, it would no longer be necessary to binarize m-values.
The current algorithm, implemented for a discrete exposure, could be relied on in cases where there is a good reason to discretize exposure (e.g., well characterized effective levels of a toxin), with the intent to assess the effects of such exposures on DNA methylation as an outcome (Y)

parallelization: foreach -> BiocParallel

In switching from a standard parallelized loop using foreach to the use of the same backend through BiocParallel, an odd (apparently low-level) error arises. In particular, it would be ideal to implement the parallelized loop using a combination of future, doFuture, and BiocParallel (see lines 205-218 of the methyvim function as of ef5332f).

When this loop is invoked, the resultant error and traceback() appear as follows:

R >     methy_vim_out <- BiocParallel::bplapply(X = methy_tmle_ind,
...                                             FUN = methyvim_tmle,
...                                             methytmle_screened = methy_tmle,
...                                             var_of_interest = var_of_interest,
...                                             type = type,
...                                             corr = corr_max,
...                                             obs_per_covar = obs_per_covar,
...                                             target_param = "ate",
...                                             family = tmle_args$family,
...                                             g_lib = tmle_args$g_lib,
...                                             Q_lib = tmle_args$Q_lib,
...                                             return_ic = return_ic
...                                            )
Error in updt[[as.integer(i)]] <- .error(msg) :
  attempt to select more than one element in integerOneIndex
In addition: Warning message:
In updt[[as.integer(i)]] <- .error(msg) : NAs introduced by coercion
R >
R > traceback()
8: value[[3L]](cond)
7: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
5: tryCatch({
       foreach::foreach(X = X, .errorhandling = handle) %dopar%
           FUN(X, ...)
   }, error = function(e) {
       txt <- "'DoparParam()' does not support partial results"
       updt <- rep(list(.error_not_available(txt)), length(X))
       msg <- conditionMessage(e)
       i <- sub("task ([[:digit:]]+).*", "\\1", msg)
       updt[[as.integer(i)]] <- .error(msg)
       updt
   })
4: bplapply(X, FUN, ..., BPREDO = BPREDO, BPPARAM = BPPARAM)
3: bplapply(X, FUN, ..., BPREDO = BPREDO, BPPARAM = BPPARAM)
2: BiocParallel::bplapply(X = methy_tmle_ind, FUN = methyvim_tmle,
       methytmle_screened = methy_tmle, var_of_interest = var_of_interest,
       type = type, corr = corr_max, obs_per_covar = obs_per_covar,
       target_param = "ate", family = tmle_args$family, g_lib = tmle_args$g_lib,
       Q_lib = tmle_args$Q_lib, return_ic = return_ic)
1: BiocParallel::bplapply(X = methy_tmle_ind, FUN = methyvim_tmle,
       methytmle_screened = methy_tmle, var_of_interest = var_of_interest,
       type = type, corr = corr_max, obs_per_covar = obs_per_covar,
       target_param = "ate", family = tmle_args$family, g_lib = tmle_args$g_lib,
       Q_lib = tmle_args$Q_lib, return_ic = return_ic)

BioC3.9 Windows builds

Windows builds are broken, as reported by Bioconductor. See https://bioconductor.org/checkResults/3.9/bioc-LATEST/methyvim/tokay2-checksrc.html

supporting marginal structural models

The package should implement a method to estimate the parameters of interest via a standard (parametric) MSM approach

Bioc Build Failure

Current (v3.10) Bioconductor builds are failing on all platforms. See https://bioconductor.org/checkResults/devel/bioc-LATEST/methyvim/. This needs to be resolved as soon as possible in order to avoid package deprecation in the next release (v3.11).

[Wishlist] Bioc3.8 Release

Working feature list for future releases:

Additional Options

~~add support for continuous treatments using tmle.npvi~~
add support for doubly robust TMLEs using drtmle
add support for influence function estimates and shrinkage with drtmle + limma
add support for filtering/pre-screening using nonparametric methods (e.g., adaptest)

Big Picture

support for finding/evaluating differentially methylated regions (DMR) in a flexible manner as per #35
flexible definition of CpG clusters using information about the design of Illumina assays (e.g., using the "probe lasso" approach)

installation error due to tmle.npvi

When installing methyvim via devtools::install_github("nhejazi/methyvim)", without having previously installed tmle.npvi, tmle, and limma, the following error occurs

Error : object ‘getSic’ is not exported by 'namespace:tmle.npvi'
ERROR: lazy loading failed for package ‘methyvim’
* removing ‘/home/USER/R/x86_64-pc-linux-gnu-library/3.4/methyvim’
Installation failed: Command failed (1)

If tmle.npvi, limma, and tmle are installed before installing methyvim, the error does not appear.

Originally reported by @rachaelvphillips

resources

ACIC 2017 poster

screening procedure: tmle.npvi

The non-parametric variable importance measure implemented in the tmle.npvi package could be used as a screening procedure for identifying sites to be tested more rigorously.

The general screening algorithm would works as follows:

For each CpG site measured via a methylation assay, construct an input matrix for tmle.npvi using the unscaled outcome (Y), the methylation measurements of the given site (A), and any background covariates at the subject-level. The matrix would thus be of dimension = n x (2 + #(baseline covars)), where n denotes the sample size (in subjects).
Using the input matrix described above, the tmle.npvi procedure could then be run with its default settings, using the "learning" flavor. This will essentially result in a GLM being fit to each CpG site.
Using the resultant p-values, a subset of the original CpG sites could be selected to being more thoroughly assessed via either tmle or tmle.npvi, based on the type of exposure in question.

succinctly convey analysis results

The Bioconductor developer guidelines recommend implementing a method for show (in this case show.methytmle) to better convey the information presented by the output object. Currently, methytmle objects are sub-classed from the standard SummarizedExperiment class, so the standard call to print/show only displays the relevant SummarizedExperiment information, while the analysis results are hidden (though a tutorial on how to extract them is treated in the vignette). This should be corrected ASAP.

TODO: Unit Tests and Examples

While the package currently builds, it does not pass R CMD BiocCheck. Examples need to be added to all functions, especially those that are user-facing.
Such examples can then rather easily be turned into unit tests for each function within the testthat framework.

[Wishlist] DMR Detection Methodology

We should support the finding of differentially methylated regions (DMRs) in two ways:

Design of a nonparametric method for assessing whether a region may be labeled as a DMR --- this should build upon work in the detection of sparse signals, using the test statistic for the TMLE of the target parameter of choice as a score, over which such techniques may operate (e.g., see the work of T. Cai, X. Lin). This idea stemmed from helpful conversations with Rajarshi Mukherjee.
Integration with already existing techniques/software for the detection of DMRs. Algorithms and software like bumphunter generate p-values from test statistics derived from parametric models; however, the DMR detection methods themselves are nonparametric usually (e.g., permutation tests). It should be possible --- even relatively trivial --- to use the p-values derived from the TMLE procedures provided as input to the DMR-finding algorithms of other Bioconductor packages. This would amount to a very useful integration with existing software. This idea stemmed from helpful conversations with Alan Hubbard.

Dimension reduction on W

The varImpact package appears to use Mark vdL's and Katie Pollard's Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) algorithm (and the associated hopach Bioconductor package) to perform dimension reduction on the set of variables that ought be included in the adjustment set (W). Using HOPACH in this way seems to make sense for this use-case, especially given that the CpG sites that would end up in W would be neighbors of the "treatment" site. (One might even be able to include baseline -- that is, phenotype-level -- covariates in the adjustment set, provided a sufficient degree of dimension reduction were to be performed.)

If nothing else, this is certainly worth exploring further, given that MvdlL's heuristic is including no more than 1 covariate in the adjustment set for every 20 subjects...

TODO: fix R CMD BiocCheck

As part of the Bioconductor submission process, the packages needs to pass R CMD BiocCheck, without any warnings or errors.

Currently (as of 0ef6641), this means that the following need to be resolved/completed:

WARNING: Add non-empty \value sections to the following man pages: man/cluster_sites.Rd, man/fdr_msa.Rd, man/methyvim_tmle.Rd, man/plot.methytmle.Rd, man/set_parallel.Rd
ERROR: At least 80% of man pages documenting exported objects must have runnable examples. The following pages do not: fdr_msa.Rd, methytmle-class.Rd, methyvim.Rd

Below is the full output of this build process on Bioconductor's "veracruz1" server, available here:

* using log directory /Users/pkgbuild/packagebuilder/workers/jobs/464/0ef6641/methyvim.Rcheck
* using R version 3.4.1 (2017-06-30)
* using platform: x86_64-apple-darwin15.6.0 (64-bit)
* using session charset: UTF-8
* using option --no-vignettes
* checking for file methyvim/DESCRIPTION ... OK
* this is package methyvim version 0.99.0
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package methyvim can be installed ... [12s/12s] OK
* checking installed package size ... OK
* checking package directory ... OK
* checking build directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking installed files from inst/doc ... OK
* checking files in vignettes ... OK
* checking examples ... NONE
* checking for unstated dependencies in tests ... OK
* checking tests ...
  Running testthat.R [6s/6s]
 [6s/6s] OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in inst/doc ... OK
* checking running R code from vignettes ... SKIPPED
* checking re-building of vignette outputs ... SKIPPED
* checking PDF version of manual ... OK
* DONE

Status: OK

This is BiocCheck version 1.13.1. BiocCheck is a work in progress.
Output and severity of issues may change. Installing package...
* Checking for version number mismatch...
* Checking if other packages can import this one...
* Checking to see if we understand object initialization...
* Checking vignette directory...
    This is a software package
    # of chunks: 19, # of eval=FALSE: 2 (10%)
* Checking whether vignette is built with 'R CMD build'...
* Checking version number...
* Checking new package version number...
* Checking version number validity...
    Package version 0.99.0; pre-release
* Checking R Version dependency...
* Checking biocViews...
* Checking that biocViews are present...
* Checking for non-trivial biocViews...
* Checking that biocViews come from the same category...
* Checking biocViews validity...
* Checking for recommended biocViews...
* Checking build system compatibility...
* Checking for blank lines in DESCRIPTION...
* Checking for whitespace in DESCRIPTION field names...
* Checking that Package field matches directory/tarball name...
* Checking for Version field...
* Checking for valid maintainer...
* Checking unit tests...
* Checking native routine registration...
* Checking for deprecated package usage...
* Checking parsed R code in R directory, examples, vignettes...
* Checking for direct slot access...
* Checking for T...
* Checking for F...
* Checking for browser()...
* Checking for <<-...
* Checking for library/require of methyvim...
* Checking DESCRIPTION/NAMESPACE consistency...
* Checking function lengths.........
    The longest function is 159 lines long
    The longest 5 functions are:
    methyvim() (R/methyvim.R, line 86): 159 lines
    methyvim_tmle() (R/tmle_classic.R, line 45): 151 lines
    plot.methytmle() (R/plots.R, line 47): 123 lines
    set_parallel() (R/utils.R, line 80): 29 lines
    force_positivity() (R/utils.R, line 135): 27 lines
* Checking man pages...
    * WARNING: Add non-empty \value sections to the following man
      pages: man/cluster_sites.Rd, man/fdr_msa.Rd,
      man/methyvim_tmle.Rd, man/plot.methytmle.Rd, man/set_parallel.Rd
* Checking exported objects have runnable examples...
    * ERROR: At least 80% of man pages documenting exported objects
      must have runnable examples. The following pages do not:
      fdr_msa.Rd, methytmle-class.Rd, methyvim.Rd
* Checking package NEWS...
* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source,
  and vignette source...
    * NOTE: Consider shorter lines; 2 lines (0%) are > 80 characters
      long.
    First 2 lines:
      R/zzz.R:2   packageStartupMessage("methyvim: Nonparametric Differential...
      vignettes/exp_methyvim.Rmd:9   %\VignetteIndexEntry{Targeted Learning o...
    * NOTE: Consider multiples of 4 spaces for line indents, 301
      lines(20%) are not.
    First 6 lines:
      R/methytmle.R:12        Class = "methytmle",
      R/methytmle.R:13        slots = list(call = "call",
      R/methytmle.R:20        contains = "GenomicRatioSet"
      R/methyvim.R:87                      sites_comp = 10,
      R/methyvim.R:88                      var_int = 1,
      R/methyvim.R:89                      vim = c("ate", "rr", "npvi"),
    See http://bioconductor.org/developers/how-to/coding-style/
* Checking for canned comments in man pages...
* Checking if package already exists in CRAN...
* Checking if new package already exists in Bioconductor...
* Checking for bioc-devel mailing list subscription...
    Maintainer is subscribed to bioc-devel.
* Checking for support site registration...
    Maintainer is registered at support site.


Summary:
ERROR count: 1
WARNING count: 1
NOTE count: 2
For detailed information about these checks, see the BiocCheck
vignette, available at
https://bioconductor.org/packages/3.6/bioc/vignettes/BiocCheck/inst/doc/BiocCheck.html#interpreting-bioccheck-output
BiocCheck FAILED.

Timeline and Goals

New/revised project concept: essentially, this R package, and the associated software and applied papers (I guess Bioinformatics might be a good target journal) will represent a collection of tools to facilitate performing variable importance analysis with causal parameters, specifically targeted towards CpG sites and DMRs. It will act as a collection of tools from various papers by MvdL et al. (similar in spirit to stremr, which ought to be used as a model), where wrappers to tmle.npvi and tmle (and, eventually, gentmle/gentmle2) are made available for convenient use in the Bioconductor landscape.

GOALS:

complete internals of R package (e.g., S4 class) and prepare data for testing by mid July (note: it's probably better if the S4 class sub-classes the w/e the core class introduced by minfi is...)
complete wrappers for estimation procedures (tmle, tmle.npvi) by mid-July
resolve known issue with inflated variance from use of the tmle.npvi procedure by mid-July (contact Antoine as necessary -- he said something about this being due to positivity violations, as the variance is computed using X|W, rather than just X, but I don't remember exact details...)
look into the use of HOPACH for performing dimension reduction on the set of neighbors associated with a "treatment" CpG site, as well as for including phenotype-level covariates in the adjustment set (W); Sandrine's notes on this from PH240F (Spring 2017) were pretty good...

DELIVERABLES:

Bioconductor 3.6 will be released this Fall (exact timeline not yet available); should try to target a minimally working package by 15 August with revisions throughout September
The software paper for this package will be targeted toward the Bioconductor channel of the open (access/review) journal F1000Research

A (working) list of tools/methods to be streamlined here include

~~MvdL and K. Pollard's HOPACH, called specifically for dimension reduction over CpG sites and baseline (phenotype-level) covariates, implemented in the eponymous hopach package~~
MvdL and C. Tuglus's FDR-MSA for multiple testing corrections (already implemented as part of this package)
A. Chambaz's TMLE-NPVI (the NPVI parameter appears to be some kind of linear projection of the dose-response curve), implemented in the tmle.npvi package
A. Hubbard's TMLE for the ATE in genomic settings, applied to DNA methylation data (and bespoke R code, already made available)
Screening procedures, including those based on limma, DA.Test, and tmle.npvi

Bioconductor Submission

Submission of this package and the associated data package methyvimData is in progress. View the contributions issue here.

add TMLE for risk ratio

TMLE for the ratio (similar to what's used for ATE, but requires a delta method) -- this should already be implemented in the tmle::tmle function that we wrap around

supporting doubly robust inference

we should also support David's very general drtmle package, which is on CRAN and on GitHub. This can come in a subsequent release on Bioconductor.

This has mostly been implemented on the branch drtmle. There remain a few (possibly) outstanding issues that need to be dealt with

drtmle (sensibly) does not support the setting where W = 1 uniformly. There are suggestions on how to handle this setting here
A recent update of drtmle on CRAN (v1.0.2) returns the observed data after application of the efficient influence function transform. This makes it possible to port the variance shrinkage approach (based on limma and implemented in biotmle) to this setting.