Hi,
I ran into the same plink2R
installation issue as @jluyapan did at gabraham/plink2R#6 (comment) using R 3.6.1. While this issue could be solved by updating the package a little bit, it seems from gabraham/plink2R#7 (comment) that the plink2R
project has been abandoned (@anadon also wanted to use it for FUSION TWAS). The author of plink2R
(@gabraham) recommends switching to BEDMatrix
at https://cran.r-project.org/web/packages/BEDMatrix/index.html by @agrueneberg.
From https://github.com/gusevlab/fusion_twas/search?q=plink2R&unscoped_q=plink2R I know that you use plink2R
in three scripts. From https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R I can see that you read BED files using impute = 'avg'
and use all 3 components returned by plink2R::read_plink()
.
Can BEDMatrix
be used?
I asked myself if BEDMatrix
could be used to replace plink2R
for FUSION TWAS and it looks like it can, mostly as shown below using R 3.5.3 (where plink2R
can be installed without issues).
library('BEDMatrix')
path <- system.file("extdata", "example.bed",
package = "BEDMatrix")
m <- BEDMatrix(path)
library('plink2R')
genos <- read_plink(gsub('\\.bed', '', path), impute = 'avg')
lapply(genos, head)
identical(
ncol(m),
nrow(genos$bim)
)
identical(
gsub('_.*', '', colnames(m)),
genos$bim[, 2]
)
sum(is.na(m2))
sum(is.na(genos$bed))
table(as.vector(m2 - genos$bed), useNA = 'ifany')
## Impute using the average
## equivalent to
## https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L153-L192
means <- colMeans(m2, na.rm = TRUE)
m3 <- m2
m3[is.na(m2)] <- rep(means, colSums(is.na(m2)))
## Force the names to be the same
## Could also use the bim and fam objects as in
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L20-L21
## to avoid any regex issues
colnames(m3) <- gsub('_.*', '', colnames(m3))
rownames(m3) <- gsub('_', ':', rownames(m3))
identical(
m3,
genos$bed
)
## BEDMatrix doesn't read the full fam file as you can see at
## https://github.com/QuantGen/BEDMatrix/blob/master/R/BEDMatrix.R#L30-L61
## plink2R does (although with the slower utils::read.table function)
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L17
famfile <- gsub('\\.bed', '\\.fam', path)
fam <- read.table(famfile, header=FALSE, sep="", stringsAsFactors=FALSE)
## For https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R#L256
identical(
fam[, c(1,2,6)],
genos$fam[, c(1,2,6)]
)
## BEDMatrix doesn't read the full bim file as you can see at
## https://github.com/QuantGen/BEDMatrix/blob/master/R/BEDMatrix.R#L75-L108
## plink2R does (although with the slower utils::read.table function)
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L16
bimfile <- gsub('\\.bed', '\\.bim', path)
bim <- read.table(bimfile, header=FALSE, sep="", stringsAsFactors=FALSE)
## For https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R#L374
identical(
bim,
genos$bim
)
Initial BEDMatrix
dependent function
I wrote a read_plink_custom
function that used BEDMatrix
to read in the genotype matrix and that can handle the impute ='avg
scenario from plink2R
. I didn't implement the impute = 'random'
scenario since I don't think that you use it (from https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L194-L255).
read_plink_custom <- function(root, impute = c('none', 'avg', 'random')) {
if(impute == 'random') {
stop("The 'impute' random option has not been implemented.", call. = FALSE)
}
## structure from https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R
proot <- path.expand(root)
bedfile <- paste(proot, ".bed", sep="")
famfile <- paste(proot, ".fam", sep="")
bimfile <- paste(proot, ".bim", sep="")
## Could change this code to use data.table
bim <- read.table(bimfile, header=FALSE, sep="", stringsAsFactors=FALSE)
fam <- read.table(famfile, header=FALSE, sep="", stringsAsFactors=FALSE)
## Set the dimensions
geno <- BEDMatrix::BEDMatrix(bedfile, n = nrow(fam), p = nrow(bim))
## Convert to a matrix
geno <- as.matrix(geno)
if(impute == 'avg') {
## Check if any are missing
geno_na <- is.na(geno)
if(any(geno_na)) {
means <- colMeans(geno, na.rm = TRUE)
geno[geno_na] <- rep(means, colSums(geno_na))
}
}
colnames(geno) <- bim[,2]
rownames(geno) <- paste(fam[,1], fam[, 2], sep=":")
list(bed=geno, fam=fam, bim=bim)
}
genos_custom <- read_plink_custom(gsub('\\.bed', '', path), impute = 'avg')
identical(
genos_custom,
genos
)
data.table flavor
I then wrote a version of the above function that uses data.table
which in theory should be faster.
read_plink_custom_fast <- function(root, impute = c('none', 'avg', 'random')) {
if(impute == 'random') {
stop("The 'impute' random option has not been implemented.", call. = FALSE)
}
if(!requireNamespace("data.table", quietly = TRUE)) {
stop("Please install the 'data.table' package using install.packages('data.table').", call. = FALSE)
}
## structure from https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R
proot <- path.expand(root)
bedfile <- paste(proot, ".bed", sep="")
famfile <- paste(proot, ".fam", sep="")
bimfile <- paste(proot, ".bim", sep="")
## Could change this code to use data.table
bim <- data.table::fread(
bimfile,
colClasses = list('character' = c(2, 5, 6), 'integer' = c(1, 3, 4)),
showProgress = FALSE,
data.table = FALSE
)
fam <- data.table::fread(famfile,
colClasses = list('character' = 1:2, 'integer' = 3:6),
showProgress = FALSE,
data.table = FALSE
)
## Set the dimensions
geno <- BEDMatrix::BEDMatrix(bedfile, n = nrow(fam), p = nrow(bim))
## Convert to a matrix
geno <- as.matrix(geno)
if(impute == 'avg') {
## Check if any are missing
geno_na <- is.na(geno)
if(any(geno_na)) {
means <- colMeans(geno, na.rm = TRUE)
geno[geno_na] <- rep(means, colSums(geno_na))
}
}
## Extract data using the data.table syntax
## in case fread(data.table = TRUE) was used (which is the default)
# colnames(geno) <- bim[[2]]
# rownames(geno) <- paste(fam[[1]], fam[[2]], sep=":")
colnames(geno) <- bim[,2]
rownames(geno) <- paste(fam[,1], fam[, 2], sep=":")
## If you used fread(data.table = TRUE) then you need to cast
## the objects using as.data.frame()
list(bed=geno, fam=fam, bim=bim)
}
genos_custom_fast <- read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'avg')
identical(
genos_custom_fast,
genos
)
However, the microbenchmark results show that that's not the case. At least with the BEDMatrix
example BED
file, though it could play a role in much larger BED files.
library('microbenchmark')
micro <- microbenchmark(
read_plink_custom(gsub('\\.bed', '', path), impute = 'avg'),
read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'avg')
)
options(width = 150)
summary(micro)
# expr min lq mean median uq max neval cld
# 1 read_plink_custom(gsub("\\\\.bed", "", path), impute = "avg") 6.009474 6.538338 7.91712 7.738728 8.242563 15.08098 100 a
# 2 read_plink_custom_fast(gsub("\\\\.bed", "", path), impute = "avg") 9.096066 10.006155 10.94152 10.580656 11.017670 17.00687 100 b
In any case both functions work. So yes, BEDMatrix
can be used to read in BED
files with either the 'impute' = 'none'
or 'impute' = 'avg'
methods from plink2R
.
## Check the 'none' scenario
genos_none <- read_plink(gsub('\\.bed', '', path), impute = 'none')
genos_custom_none <- read_plink_custom(gsub('\\.bed', '', path), impute = 'none')
genos_custom_fast_none <- read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'none')
testthat::expect_equivalent(
genos_custom_none,
genos_none
)
testthat::expect_equivalent(
genos_custom_fast_none,
genos_none
)
Moving forward
Moving forward, I guess that it would be ideal if BEDMatrix
incorporated a version of the read_plink_custom()
/read_plink_custom_fast()
function such that you would only need to change library('plink2R')
to library('BEDMatrix')
in your scripts and to enable others to keep using what plink2R
provided in a package that is maintained and available from CRAN. The only part missing would be the 'impute' = 'random'
part of https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L194-L255 though I don't know what drand48()
returns (I imagine that it could be replaced with some R code). This is likely better than creating another GitHub R package with this function, as it could likely end up being in "abandonware" mode as plink2R
did.
Let me know what you think.
Best,
Leo
Reproducibility
library('sessioninfo')
session_info()
# ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# setting value
# version R version 3.5.3 Patched (2019-03-11 r76311)
# os Red Hat Enterprise Linux Server release 6.9 (Santiago)
# system x86_64, linux-gnu
# ui X11
# language (EN)
# collate en_US.UTF-8
# ctype en_US.UTF-8
# tz US/Eastern
# date 2019-08-22
#
# ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# package * version date lib source
# assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.5.1)
# BEDMatrix * 1.6.1 2019-06-21 [1] CRAN (R 3.5.3)
# cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.3)
# codetools 0.2-16 2018-12-24 [3] CRAN (R 3.5.3)
# colorout * 1.2-0 2018-05-02 [1] Github (jalvesaq/colorout@c42088d)
# colorspace 1.4-1 2019-03-18 [2] CRAN (R 3.5.1)
# crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
# crochet 2.2.0 2019-06-13 [1] CRAN (R 3.5.3)
# data.table 1.12.2 2019-04-07 [1] CRAN (R 3.5.3)
# digest 0.6.20 2019-07-04 [1] CRAN (R 3.5.3)
# dplyr 0.8.3 2019-07-04 [1] CRAN (R 3.5.3)
# ggplot2 3.2.1 2019-08-10 [1] CRAN (R 3.5.3)
# glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.1)
# gtable 0.3.0 2019-03-25 [2] CRAN (R 3.5.1)
# htmltools 0.3.6 2017-04-28 [2] CRAN (R 3.5.0)
# htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.1)
# httpuv 1.5.1 2019-04-05 [2] CRAN (R 3.5.3)
# jsonlite 1.6 2018-12-07 [2] CRAN (R 3.5.1)
# later 0.8.0 2019-02-11 [2] CRAN (R 3.5.1)
# lattice 0.20-38 2018-11-04 [3] CRAN (R 3.5.3)
# lazyeval 0.2.2 2019-03-15 [2] CRAN (R 3.5.1)
# magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0)
# MASS 7.3-51.1 2018-11-01 [3] CRAN (R 3.5.3)
# Matrix 1.2-15 2018-11-01 [3] CRAN (R 3.5.3)
# microbenchmark * 1.4-6 2018-10-18 [1] CRAN (R 3.5.3)
# multcomp 1.4-10 2019-03-05 [2] CRAN (R 3.5.1)
# munsell 0.5.0 2018-06-12 [2] CRAN (R 3.5.1)
# mvtnorm 1.0-11 2019-06-19 [2] CRAN (R 3.5.3)
# pillar 1.4.2 2019-06-29 [1] CRAN (R 3.5.3)
# pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
# plink2R * 1.1 2018-12-06 [1] Github (gabraham/plink2R@d74be01)
# png 0.1-7 2013-12-03 [2] CRAN (R 3.5.0)
# promises 1.0.1 2018-04-13 [2] CRAN (R 3.5.0)
# purrr 0.3.2 2019-03-15 [2] CRAN (R 3.5.1)
# R6 2.4.0 2019-02-14 [2] CRAN (R 3.5.1)
# Rcpp * 1.0.2 2019-07-25 [1] CRAN (R 3.5.3)
# RcppEigen * 0.3.3.5.0 2018-11-24 [1] CRAN (R 3.5.1)
# rlang 0.4.0 2019-06-25 [1] CRAN (R 3.5.3)
# rmote * 0.3.4 2018-05-02 [1] deltarho (R 3.5.0)
# sandwich 2.5-1 2019-04-06 [2] CRAN (R 3.5.3)
# scales 1.0.0 2018-08-09 [2] CRAN (R 3.5.1)
# servr 0.15 2019-08-07 [1] CRAN (R 3.5.3)
# sessioninfo * 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
# survival 2.43-3 2018-11-26 [3] CRAN (R 3.5.3)
# testthat 2.2.1 2019-07-25 [1] CRAN (R 3.5.3)
# TH.data 1.0-10 2019-01-21 [2] CRAN (R 3.5.1)
# tibble 2.1.3 2019-06-06 [1] CRAN (R 3.5.3)
# tidyselect 0.2.5 2018-10-11 [2] CRAN (R 3.5.1)
# withr 2.1.2 2018-03-15 [2] CRAN (R 3.5.0)
# xfun 0.9 2019-08-21 [1] CRAN (R 3.5.3)
# zoo 1.8-6 2019-05-28 [2] CRAN (R 3.5.3)
#
# [1] /users/lcollado/R/x86_64-pc-linux-gnu-library/3.5.x
# [2] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/site-library
# [3] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/library
system('plink --version')
# PLINK v1.90b6.6 64-bit (10 Oct 2018)