gusevlab / fusion_twas Goto Github PK

methods for functional summary-based imputation

Home Page: http://gusevlab.org/projects/fusion/

License: GNU General Public License v3.0

R 100.00%

fusion_twas's Introduction

Functional Summary-based Imputation

FUSION is a suite of tools for performing transcriptome-wide and regulome-wide association studies (TWAS and RWAS). FUSION builds predictive models of the genetic component of a functional/molecular phenotype and predicts and tests that component for association with disease using GWAS summary statistics.

Please see http://gusevlab.org/projects/fusion/ for documentation.

fusion_twas's People

Contributors

Stargazers

Watchers

Forkers

phuwanat xtmgah tfwillems drveera xuqiang9042 chongwu-biostat quattro dongjt0727 veramonlux zrcjessica lieberinstitute geneticresources xhyuo wangdi2014 jangehlen xinhe-lab chuanj kris69w maegsul hsun3163 kathalexknuts powellgenomicslab sankarastone wbs-whuer sudhanshu1610 stoikumo yangchuhua hanpengluo xiangyuye saramonteiromartins biolyn biocko yrm2021 hengbingao jujiaokang skresearcher forget999 echolee0925 ahang6000 ar-benjamin huizhong1993 chris1221

fusion_twas's Issues

Error in post-process FUSION

Hi,

For some of the work that I am doing I have been running conditional analysis for 1 of my significant TWAS locus. For some reason I get this warning for 1 of the 2 loci :

Warning : snp.cond.se = diag( sqrt( 1 - snp.ge.ld %% ( cur.dinv ) %% t(snp.ge.ld) ) ) : NaNs produced.
I get a really high Chi-squared conditioned value like in the thousand range. I am not sure to completely understand what is wrong. Is it because the GWAS signal is so important that is driving most of the association ?

Many thanks for your help and your patience !

Best,

Salim.

QTL scores values to FUSION model format

Hello, I am starting in the xWAS (PWAS, TWAS, MWAS) field, at first I have only been using the models published by you on my patients summary data with predixcan (https://github.com/hakyimlab/PrediXcan).
Despite this, I am trying to use more available information, for this purpose, I want to use QTL score information (for example: https://www.omicspred.org/) to create models in the predixcan format so I can apply them to my data.
The main problem is that I do not find information about how to use the QTL score information published by multiple authors. I would be very grateful if someone could give me some advice on this area.

Best model specified as enet, yet all values in wgt.matrix are reported as 0

Hello,

While parsing some of the weights obtained from http://gusevlab.org/projects/fusion/ (GTEx v7), I have noticed that the model with the highest R^2 is enet, yet the reported values in the wgt.matrix for enet are all 0.

Example files are: Brain_Caudate_basal_ganglia.ENSG00000162368.9.wgt.RDat, Brain_Caudate_basal_ganglia.ENSG00000230092.3.wgt.RDat

I am attempting to generate a list of SNPs from the best model of each weight, based on the model's specifications, but am unable to do so when all SNPs have a reported coefficient of 0.

problematic_weights.zip

Thank you

GRCh38

Are there any plans to migrate fusion_twas to use GRCh38 and version 8 of GTEx?

GEUV.compute_weights.sh -- $PRE_GEXP.ID

There is no file or directory for : $PRE_GEXP.ID

On line 50 of GEUV.compute_weights.sh

how to impute Z-scores for the missing SNPs?

Hello,

I was running fusion via:

Rscript FUSION.assoc_test.R
--sumstats chr17.txt
--weights ./weight/retina.2pos
--weights_dir ./weight/
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 17
--out CHR.17.dat

and I got:

WARNING : retina/retina.ENSG00000279879.wgt.RDat RP11-1072C15.6 17 44689858 44693340 had 90 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model
WARNING : retina/retina.ENSG00000279880.wgt.RDat RP11-855A2.2 17 67973934 67976072 had 173 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model
WARNING : retina/retina.ENSG00000280046.wgt.RDat RP11-1099M24.6 17 7858943 7866083 had 159 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model
WARNING : retina/retina.ENSG00000280136.wgt.RDat RP11-1228E12.2 17 88641 118578 had 93 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model
WARNING : retina/retina.ENSG00000280279.wgt.RDat RP11-1228E12.1 17 64099 76866 had 88 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model
Analysis completed.
NOTE: 608 / 1069 genes were skipped
If a large number of genes were skipped, verify that your GWAS Z-scores, expression weights, and LDREF data use the same SNPs (or nearly)
Or consider pre-imputing your summary statistics to the LDREF markers using summary-imputation software such as [http://bogdan.bioinformatics.ucla.edu/software/impg/]

This website http://bogdan.bioinformatics.ucla.edu/software/impg/ doesn't exist. Can you please point me to where the code is located?

Thanks
Ana

Problem of Fusion "coloc"

Rscript FUSION.assoc_test.R \

--sumstats PGC.UKB.MDD.for.fusion.sumstats \

--weights ./WEIGHTS/CMC.BRAIN.RNASEQ.pos \

--weights_dir ./WEIGHTS/ \

--ref_ld_chr ./LDREF/1000G.EUR. \

--chr 7 \

--coloc_P 0.05 \

--GWASN 1234567 \

--PANELN CMC.BRAIN.RNASEQ.profile $2 \

--out PGC.BPM2018.CMC.DLPFC.chr7.dat

ERROR : 'N' field needed in weights file or 'PANEL' column and --PANELN flag required for COLOC analysis

I'm not sure, what does PANELN need me to type in here? I have looked through the manual carefully and still can't solve it. Can anyone help me？Thanks!

Best,
CC

Add covariates when computing weigths

Hello,

Concerning the covariates, what is the best way to deal with missing value when adding covariates with the --covar flag?
Do we need to use the .cov format from plink with 2+C columns?

Thank you in advance.
Best

Pierre

What is the best way to normalise RNA-seq data when running FUSION.compute_weights.R?

Dear developer,

Can you please let me know the preferred normalisation method for RNA-seq data when running FUSION.compute_weights.R? logCPM or other method? I want to run TWAS for both gene level and transcript level but got confused about the data normalisation methods and do not know whether there is a recommended approach. It is quite important and very looking forward for your reply. Will appreciate that very much

many thanks,

Fei

Is beta T or F in the model top1 for calculating gene weights?

The beta is T in the CROSSVALIDATION ANALYSES. But the beta is F in the calcuation of final model weights. Why?

FUSION.assoc_test.R: wgtlist does not have an N column

Hi Sasha,

I've been trying to call coloc from FUSION.assoc_test.R and have noticed that the script tries to get wgtlist$N[w] for the expression data sample size for gene w: (line 330)

# perform COLOC test
	if ( !is.na(opt$coloc_P) && !is.na(out.tbl$TWAS.Z[w]) && out.tbl$TWAS.P[w] < opt$coloc_P ) {
		b1 = wgt.matrix[,eqtlmod] / sqrt(wgtlist$N[w])
		b2 = cur.Z / sqrt(opt$GWASN)

		vb1 = rep(1/wgtlist$N[w],length(b1))
		vb2 = `rep(1/opt$GWASN,length(b2)

However, wgtlist (from [something].pos) does not have an N column. Am I missing something? Is there any reason not to manually add an N column to the .pos file?

Cheers,
William Gordon (Lindstroem Group)

No overlapping GWAS Z-score

Hello!
When I do the TWAS with the weights, my summary and LDREF I have no problem. When I add --coloc_P, it tells me that there is no overlapping GWAS Z-scores. I can't find the bug and I don't understand why there is no problem when I don't have the colocalization flag.
Thank you very much!

error in FUSION.assoc_test.R step (Error in out.tbl$EQTL.ID[w] <- names(topeqtl) : replacement has length zero)

Hi Alexander Gusev,

Did you figure out this issue (Error in out.tbl$EQTL.ID[w] <- names(topeqtl) : replacement has length zero) raised here (#5) ?
I came across the same issue when using my own weights computed for my leafcutter calculated splicing intron usage, and it only happened for some chromosomes, not for all chromosomes. Pretty weird, can you give me any insight? I already checked that my SNP lables in wgt.matrix and GWAS summary statistics and LDREF1000 are consistent. I used LDSC munge_stats.py to format my GWAS and 1,189,408/1,217,311 of GWAS SNPs are also in my LDREF panel.

Another issue is when using my own LD reference panel, almost all genes skipped, but when I used the LDREF1000 provided in the tutorial, it works. My own LD reference data has more SNPs, but the bim file third column is 0 where there is a value for this column in LDREF1000. But according the the plink (https://www.cog-genomics.org/plink/1.9/formats), the third column--"Position in morgans or centimorgans (safe to use dummy value of '0') " is okay to be 0. Can you please give any insights as well? That will be much appreciated

many thanks,

Fei

Personal functional weights Computing

Hi developers!
I am trying to run the Rscript to compute my own gene weight. I have made the preparation (the PATH, Plink bfile or R etc), but how could i generate right format of the $TMP file from my gene data? There was little information on the Fusion Official site. Kindly, could developers give some suggestions ?
Rscript FUSION.compute_weights.R
--bfile $INP
*--tmp $TMP *
--out $OUT
--models top1,blup,bslmm,lasso,enet

Move away from plink2R?

Hi,

I ran into the same plink2R installation issue as @jluyapan did at gabraham/plink2R#6 (comment) using R 3.6.1. While this issue could be solved by updating the package a little bit, it seems from gabraham/plink2R#7 (comment) that the plink2R project has been abandoned (@anadon also wanted to use it for FUSION TWAS). The author of plink2R (@gabraham) recommends switching to BEDMatrix at https://cran.r-project.org/web/packages/BEDMatrix/index.html by @agrueneberg.

From https://github.com/gusevlab/fusion_twas/search?q=plink2R&unscoped_q=plink2R I know that you use plink2R in three scripts. From https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R I can see that you read BED files using impute = 'avg' and use all 3 components returned by plink2R::read_plink().

Can `BEDMatrix` be used?

I asked myself if BEDMatrix could be used to replace plink2R for FUSION TWAS and it looks like it can, mostly as shown below using R 3.5.3 (where plink2R can be installed without issues).

library('BEDMatrix')
path <- system.file("extdata", "example.bed", 
    package = "BEDMatrix")
m <- BEDMatrix(path)

library('plink2R')
genos <- read_plink(gsub('\\.bed', '', path), impute = 'avg')
lapply(genos, head)


identical(
    ncol(m),
    nrow(genos$bim)
)
identical(
    gsub('_.*', '', colnames(m)),
    genos$bim[, 2]
)

sum(is.na(m2))
sum(is.na(genos$bed))
table(as.vector(m2 - genos$bed), useNA = 'ifany')

## Impute using the average
## equivalent to 
## https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L153-L192
means <- colMeans(m2, na.rm = TRUE)
m3 <- m2
m3[is.na(m2)] <- rep(means, colSums(is.na(m2)))

## Force the names to be the same
## Could also use the bim and fam objects as in
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L20-L21
## to avoid any regex issues
colnames(m3) <- gsub('_.*', '', colnames(m3))
rownames(m3) <- gsub('_', ':', rownames(m3))
identical(
    m3,
    genos$bed
)


## BEDMatrix doesn't read the full fam file as you can see at
## https://github.com/QuantGen/BEDMatrix/blob/master/R/BEDMatrix.R#L30-L61

## plink2R does (although with the slower utils::read.table function)
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L17
famfile <- gsub('\\.bed', '\\.fam', path)
fam <- read.table(famfile, header=FALSE, sep="", stringsAsFactors=FALSE)

## For https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R#L256
identical(
    fam[, c(1,2,6)],
    genos$fam[, c(1,2,6)]
)

## BEDMatrix doesn't read the full bim file as you can see at
## https://github.com/QuantGen/BEDMatrix/blob/master/R/BEDMatrix.R#L75-L108

## plink2R does (although with the slower utils::read.table function)
## https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R#L16
bimfile <- gsub('\\.bed', '\\.bim', path)
bim <- read.table(bimfile, header=FALSE, sep="", stringsAsFactors=FALSE)
## For https://github.com/gusevlab/fusion_twas/blob/0ab190e740f4631bb137bd48e5a54ccb2ddda7c9/FUSION.compute_weights.R#L374
identical(
    bim,
    genos$bim
)

Initial `BEDMatrix` dependent function

I wrote a read_plink_custom function that used BEDMatrix to read in the genotype matrix and that can handle the impute ='avg scenario from plink2R. I didn't implement the impute = 'random' scenario since I don't think that you use it (from https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L194-L255).

read_plink_custom <- function(root, impute = c('none', 'avg', 'random')) {
    if(impute == 'random') {
        stop("The 'impute' random option has not been implemented.", call. = FALSE)
    }
    
    ## structure from https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R
    proot <- path.expand(root)
    
    bedfile <- paste(proot, ".bed", sep="")
    famfile <- paste(proot, ".fam", sep="")
    bimfile <- paste(proot, ".bim", sep="")
    
    ## Could change this code to use data.table
    bim <- read.table(bimfile, header=FALSE, sep="", stringsAsFactors=FALSE)
    fam <- read.table(famfile, header=FALSE, sep="", stringsAsFactors=FALSE)
    ## Set the dimensions
    geno <- BEDMatrix::BEDMatrix(bedfile, n = nrow(fam), p = nrow(bim))
    
    ## Convert to a matrix
    geno <- as.matrix(geno)
    if(impute == 'avg') {
        ## Check if any are missing
        geno_na <- is.na(geno)
        if(any(geno_na)) {
            means <- colMeans(geno, na.rm = TRUE)
            geno[geno_na] <- rep(means, colSums(geno_na))
        }
    }
    colnames(geno) <- bim[,2]
    rownames(geno) <- paste(fam[,1], fam[, 2], sep=":")

    list(bed=geno, fam=fam, bim=bim)    
}


genos_custom <- read_plink_custom(gsub('\\.bed', '', path), impute = 'avg')
identical(
    genos_custom,
    genos
)

data.table flavor

I then wrote a version of the above function that uses data.table which in theory should be faster.

read_plink_custom_fast <- function(root, impute = c('none', 'avg', 'random')) {
    if(impute == 'random') {
        stop("The 'impute' random option has not been implemented.", call. = FALSE)
    }
    if(!requireNamespace("data.table", quietly = TRUE)) {
        stop("Please install the 'data.table' package using install.packages('data.table').", call. = FALSE)
    }
    
    ## structure from https://github.com/gabraham/plink2R/blob/master/plink2R/R/plink2R.R
    proot <- path.expand(root)
    
    bedfile <- paste(proot, ".bed", sep="")
    famfile <- paste(proot, ".fam", sep="")
    bimfile <- paste(proot, ".bim", sep="")
    
    ## Could change this code to use data.table
    bim <- data.table::fread(
        bimfile,
        colClasses = list('character' = c(2, 5, 6), 'integer' = c(1, 3, 4)),
        showProgress = FALSE,
        data.table = FALSE
    )
    fam <- data.table::fread(famfile,
        colClasses = list('character' = 1:2, 'integer' = 3:6),
        showProgress = FALSE,
        data.table = FALSE
    )
    ## Set the dimensions
    geno <- BEDMatrix::BEDMatrix(bedfile, n = nrow(fam), p = nrow(bim))
    
    ## Convert to a matrix
    geno <- as.matrix(geno)
    if(impute == 'avg') {
        ## Check if any are missing
        geno_na <- is.na(geno)
        if(any(geno_na)) {
            means <- colMeans(geno, na.rm = TRUE)
            geno[geno_na] <- rep(means, colSums(geno_na))
        }
    }
    ## Extract data using the data.table syntax
    ## in case fread(data.table = TRUE) was used (which is the default)
    # colnames(geno) <- bim[[2]]
    # rownames(geno) <- paste(fam[[1]], fam[[2]], sep=":")
    colnames(geno) <- bim[,2]
    rownames(geno) <- paste(fam[,1], fam[, 2], sep=":")

    ## If you used fread(data.table = TRUE) then you need to cast
    ## the objects using as.data.frame()
    list(bed=geno, fam=fam, bim=bim)
}

genos_custom_fast <- read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'avg')
identical(
    genos_custom_fast,
    genos
)

However, the microbenchmark results show that that's not the case. At least with the BEDMatrix example BED file, though it could play a role in much larger BED files.

library('microbenchmark')
micro <- microbenchmark(
    read_plink_custom(gsub('\\.bed', '', path), impute = 'avg'),
    read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'avg')
)
options(width = 150)
summary(micro)
#                                                                 expr      min        lq     mean    median        uq      max neval cld
# 1      read_plink_custom(gsub("\\\\.bed", "", path), impute = "avg") 6.009474  6.538338  7.91712  7.738728  8.242563 15.08098   100  a
# 2 read_plink_custom_fast(gsub("\\\\.bed", "", path), impute = "avg") 9.096066 10.006155 10.94152 10.580656 11.017670 17.00687   100   b

In any case both functions work. So yes, BEDMatrix can be used to read in BED files with either the 'impute' = 'none' or 'impute' = 'avg' methods from plink2R.

## Check the 'none' scenario
genos_none <- read_plink(gsub('\\.bed', '', path), impute = 'none')
genos_custom_none <- read_plink_custom(gsub('\\.bed', '', path), impute = 'none')
genos_custom_fast_none <- read_plink_custom_fast(gsub('\\.bed', '', path), impute = 'none')
testthat::expect_equivalent(
    genos_custom_none,
    genos_none
)
testthat::expect_equivalent(
    genos_custom_fast_none,
    genos_none
)

Moving forward

Moving forward, I guess that it would be ideal if BEDMatrix incorporated a version of the read_plink_custom()/read_plink_custom_fast() function such that you would only need to change library('plink2R') to library('BEDMatrix') in your scripts and to enable others to keep using what plink2R provided in a package that is maintained and available from CRAN. The only part missing would be the 'impute' = 'random' part of https://github.com/gabraham/plink2R/blob/master/plink2R/src/data.cpp#L194-L255 though I don't know what drand48() returns (I imagine that it could be replaced with some R code). This is likely better than creating another GitHub R package with this function, as it could likely end up being in "abandonware" mode as plink2R did.

Let me know what you think.

Best,
Leo

Reproducibility

library('sessioninfo')
session_info()
# ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#  setting  value
#  version  R version 3.5.3 Patched (2019-03-11 r76311)
#  os       Red Hat Enterprise Linux Server release 6.9 (Santiago)
#  system   x86_64, linux-gnu
#  ui       X11
#  language (EN)
#  collate  en_US.UTF-8
#  ctype    en_US.UTF-8
#  tz       US/Eastern
#  date     2019-08-22
#
# ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#  package        * version   date       lib source
#  assertthat       0.2.1     2019-03-21 [2] CRAN (R 3.5.1)
#  BEDMatrix      * 1.6.1     2019-06-21 [1] CRAN (R 3.5.3)
#  cli              1.1.0     2019-03-19 [1] CRAN (R 3.5.3)
#  codetools        0.2-16    2018-12-24 [3] CRAN (R 3.5.3)
#  colorout       * 1.2-0     2018-05-02 [1] Github (jalvesaq/colorout@c42088d)
#  colorspace       1.4-1     2019-03-18 [2] CRAN (R 3.5.1)
#  crayon           1.3.4     2017-09-16 [1] CRAN (R 3.5.0)
#  crochet          2.2.0     2019-06-13 [1] CRAN (R 3.5.3)
#  data.table       1.12.2    2019-04-07 [1] CRAN (R 3.5.3)
#  digest           0.6.20    2019-07-04 [1] CRAN (R 3.5.3)
#  dplyr            0.8.3     2019-07-04 [1] CRAN (R 3.5.3)
#  ggplot2          3.2.1     2019-08-10 [1] CRAN (R 3.5.3)
#  glue             1.3.1     2019-03-12 [1] CRAN (R 3.5.1)
#  gtable           0.3.0     2019-03-25 [2] CRAN (R 3.5.1)
#  htmltools        0.3.6     2017-04-28 [2] CRAN (R 3.5.0)
#  htmlwidgets      1.3       2018-09-30 [1] CRAN (R 3.5.1)
#  httpuv           1.5.1     2019-04-05 [2] CRAN (R 3.5.3)
#  jsonlite         1.6       2018-12-07 [2] CRAN (R 3.5.1)
#  later            0.8.0     2019-02-11 [2] CRAN (R 3.5.1)
#  lattice          0.20-38   2018-11-04 [3] CRAN (R 3.5.3)
#  lazyeval         0.2.2     2019-03-15 [2] CRAN (R 3.5.1)
#  magrittr         1.5       2014-11-22 [1] CRAN (R 3.5.0)
#  MASS             7.3-51.1  2018-11-01 [3] CRAN (R 3.5.3)
#  Matrix           1.2-15    2018-11-01 [3] CRAN (R 3.5.3)
#  microbenchmark * 1.4-6     2018-10-18 [1] CRAN (R 3.5.3)
#  multcomp         1.4-10    2019-03-05 [2] CRAN (R 3.5.1)
#  munsell          0.5.0     2018-06-12 [2] CRAN (R 3.5.1)
#  mvtnorm          1.0-11    2019-06-19 [2] CRAN (R 3.5.3)
#  pillar           1.4.2     2019-06-29 [1] CRAN (R 3.5.3)
#  pkgconfig        2.0.2     2018-08-16 [1] CRAN (R 3.5.1)
#  plink2R        * 1.1       2018-12-06 [1] Github (gabraham/plink2R@d74be01)
#  png              0.1-7     2013-12-03 [2] CRAN (R 3.5.0)
#  promises         1.0.1     2018-04-13 [2] CRAN (R 3.5.0)
#  purrr            0.3.2     2019-03-15 [2] CRAN (R 3.5.1)
#  R6               2.4.0     2019-02-14 [2] CRAN (R 3.5.1)
#  Rcpp           * 1.0.2     2019-07-25 [1] CRAN (R 3.5.3)
#  RcppEigen      * 0.3.3.5.0 2018-11-24 [1] CRAN (R 3.5.1)
#  rlang            0.4.0     2019-06-25 [1] CRAN (R 3.5.3)
#  rmote          * 0.3.4     2018-05-02 [1] deltarho (R 3.5.0)
#  sandwich         2.5-1     2019-04-06 [2] CRAN (R 3.5.3)
#  scales           1.0.0     2018-08-09 [2] CRAN (R 3.5.1)
#  servr            0.15      2019-08-07 [1] CRAN (R 3.5.3)
#  sessioninfo    * 1.1.1     2018-11-05 [1] CRAN (R 3.5.1)
#  survival         2.43-3    2018-11-26 [3] CRAN (R 3.5.3)
#  testthat         2.2.1     2019-07-25 [1] CRAN (R 3.5.3)
#  TH.data          1.0-10    2019-01-21 [2] CRAN (R 3.5.1)
#  tibble           2.1.3     2019-06-06 [1] CRAN (R 3.5.3)
#  tidyselect       0.2.5     2018-10-11 [2] CRAN (R 3.5.1)
#  withr            2.1.2     2018-03-15 [2] CRAN (R 3.5.0)
#  xfun             0.9       2019-08-21 [1] CRAN (R 3.5.3)
#  zoo              1.8-6     2019-05-28 [2] CRAN (R 3.5.3)
#
# [1] /users/lcollado/R/x86_64-pc-linux-gnu-library/3.5.x
# [2] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/site-library
# [3] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/library

system('plink --version')
# PLINK v1.90b6.6 64-bit (10 Oct 2018)

perform TWAS using weights of GTEx(version 8)

Hi
I'm performing the TWAS analysis using weights of GTEx(version 8).
The following datasets are based on hg19:

The LD reference data (https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2),
The glist-hg19 (https://github.com/gusevlab/fusion_twas/blob/master/glist-hg19),
My own GWAS summary data.

But the weights of GTEx(version 8) were based on the hg38.
Therefore, to update the LD reference data, glist-hg19 ,and the GWAS data to hg38, I do the following steps:

Download the LD reference data (https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/GRCh38/plink_files.tgz) from https://alkesgroup.broadinstitute.org/LDSCORE/GRCh38/;
Download the glist-hg38 (https://www.cog-genomics.org/static/bin/plink/glist-hg38) from https://www.cog-genomics.org/plink/1.9/resources;
Using the liftover tool (http://hgdownload.cse.ucsc.edu/downloads.html) to update the GWAS data to hg38.

After finishing the steps, I perform the TWAS and further analysis (Joint/conditional tests, permutation test, and so on).
I'm wondering the above steps are correct or not?

Thanks!

Training Weights Error

Hi,

I'm trying to train weights and I'm getting an error on a few genes.
The error seems to happen after using LASSO regression

tmp/1_148/<FILENAME>.cv.LASSO
Error in reg$coef[2, 4] : subscript out of bounds
Execution halted

GCTA version 1.21 (Oct 2013) vs 1.923beta (Aug 2019): is it safe to upgrade?

Hi,

I just realized today that there's a much newer version of GCTA (version 1.92.3beta; Aug 20, 2019) available from https://cnsgenomics.com/software/gcta/#Download than version 1.21 (Oct 16, 2013). Do you expect to run into any issues when using the latest version of GCTA compared to 1.21? I'm not familiar with GTCA and maybe you can identify if any of the changelog updates are relevant.

I found about this newer GCTA version because I was trying to find information on whether GCTA is multi-threaded or not, which I see at https://cnsgenomics.com/software/gcta/#Multi-threadcomputing that it is (and has been since version 1.1; though version 1.91.4beta added control through the OMP_NUM_THREADS environment variable).

Best,
Leo

What is the best way to create Plink files for FUSION.compute_weights.R?

Hello,

I plan to analyze my RNAseq data (paired end dataset) and to convert it to Plink format in order to use it with: FUSION.compute_weights.R

Can you please let me know is the following set of commands optimal for that purpose, if not can you please let me knw what you would change?

#alignement file (done with STAR) sorted
samtools sort $OUTPUT.bam -o $OUTPUT.sorted.bam

#Index the genome assembly
samtools faidx Homo_sapiens.GRCh38.dna.primary_assembly.fa

#Run 'mpileup' to generate VCF format
samtools mpileup -g -f Homo_sapiens.GRCh38.dna.primary_assembly.fa $OUTPUT.sorted.bam > my-raw.bcf

#call SNPs
bcftools view -bvcg my-raw.bcf > my-var.bcf

#filter SNPs
bcftools view my.var.bcf | vcfutils.pl varFilter - > my.var-final.vcf

#convert VCF to Plink format
vcftools --vcf my.var-final.vcf --out my_data --plink

Thank you so much!

All genes skipped when computing weights

I have gene expression data from 11,535 microarray probes in 75 individuals and a covariates file containing demographic information, 3 principle components, and 10 PEER factors. I've been able to successfully run the compute_weights R script, but all of the genes ended up being skipped, so there was no output. Approximately 4500 were returned with 'GCTA could not converge' and the remaining genes did not pass the default heritability cutoff. Do you have any ideas on why this may have happened and how to proceed?

Using FUSION with R 3.6.1

Since upgrading to R 3.6.1, I have been using the following packages to get FUSION to run for me. This may work for others as a stopgap until the packages upgrade on CRAN. Of note, plink2R sounds as though they don't plan to support R 3.6.1, though I have opened a pull request with the trivial changes needed to fix the issue. (Also noted under #13 .)

plink2R: devtools::install_github("carbocation/plink2R/plink2R", ref="carbocation-permit-r361")

optparse: remotes::install_github("trevorld/r-optparse")

coloc: devtools::install_github("chr1swallace/coloc")

Uncertainty estimate?

Running FUSION.assoc_test.R yields a Z score and a P value, but aside from the directionality yielded by the Z score, I interpret these as effectively yielding the same information. Is it possible to obtain either a standard error or a beta in the output (either of which, with a Z score, would permit calculation of the other)?

Error in layout(lay.gwas): layout matrix must contain at least one reference to each of the values {1 ... 2} Execution halted

Hi all, has any of you ever come across the following error? I am obtaining it when running the conditional analysis using custom-made TWAS weights. Considering there were many steps to generate my own weights, I imagine there can be a bug somewhere in an earlier step of my code causing this. However, what I don't understand is why it would be specific to a single chromosome (all other chromosomes work). Have you ever come across it?

Rscript ${fusion_software_dir}/FUSION.post_process.R \
        --input ${significant_file} \
        --sumstats ${sumstats} \
        --ref_ld_chr ${ref_ld} \
        --out ${significant_file}.PostProc.${CHR} \
        --chr ${CHR} \
        --save_loci \
        --plot \
        --locus_win 100000

6 strictly non-overlapping loci
consolidated to 6 non-overlapping loci with 100000 bp buffer
locus 1 best GWAS Chisq 33.1
locus 1 best GWAS Chisq conditioned     8.3
locus 1 best conditioned Chisq  21.1
locus 2 best GWAS Chisq 25.4
locus 2 best GWAS Chisq conditioned     8.42
locus 2 best conditioned Chisq  9.34
locus 3 best GWAS Chisq 33.9
locus 3 best GWAS Chisq conditioned     2.57
locus 3 best conditioned Chisq  50.5
Error in layout(lay.gwas) :
  layout matrix must contain at least one reference
to each of the values {1 ... 2}
Execution halted

[update]
I was able to run the conditional analysis without generating the plot (by removing --plot --locus_win 100000).

Unable to read BSLMM output files

Hi,

I have noticed that the gemma output, when used bslmm option, is written to a folder "output/". But the FUSION.compute_weights.R trying to read from the current directory instead of output folder. This was explained in gemma manual (section 4.3.3)

The exact error I get is:

- Crossval fold 1 
[1] "/rds/general/user/gatla/home/softs/Gemma_0.98.1/gemma -miss 1 -maf 0 -r2 1 -rpace 1000 -wpace 1000 -bfile ENSG00000134256_tmp_file.cv -bslmm 1 -o ENSG00000134256_tmp_file.cv.BSLMM"

Error in file(file, "rt") : cannot open the connection
Calls: weights.bslmm -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'ENSG00000134256_tmp_file.cv.BSLMM.param.txt': No such file or directory
Execution halted

Because the path to file is "output/ENSG00000134256_tmp_file.cv.BSLMM.param.txt". Could you let me know how can I fix this ? Can I just edit the FUSION.compute_weights.R script and append "output/" ? Not sure if it affects other parts of the analysis.

the beta of model top1

Marginal Z-scores (used for top1)

weights.marginal = function( genos , pheno , beta=F ) {
if ( beta ) eff.wgt = t( genos ) %% (pheno) / ( nrow(pheno) - 1)
else eff.wgt = t( genos ) %% (pheno) / sqrt( nrow(pheno) - 1 )
return( eff.wgt )
}

For above code, I want to know I can how to select the value of beta : T or F? If T, not need sqrt.

Error when there is a single SNP in wgt.matrix

I found a bug in FUSION.assoc_test.R when there is a single SNP in the wgt.matrix. The names() function does not get the correct variant ID and causes the script to exit.

Changing the following line:
out.tbl$ETQL.ID[w] = names( topeqtl )
to
out.tbl$EQTL.ID[w] = rownames(wgt.matrix)[topeqtl]

appears to fix the issue.

Questions for Custom LD Reference Data

Hi,

First of all, thanks once again @sashagusev for developing & maintaining this great tool, I have been using it especially past year quite a lot. Lately I stopped using the main LD reference data available here on FUSION website and decided to create a custom LD reference data for following reasons:

I've switched to hg38/GRCh38 for QTL/TWAS analyses (provided LD reference data is on hg19/GRCh37).
Provided LD reference data contains n=489 samples, however n=99 of these are Finnish samples and I plan to exclude these (as a side note here, actually in 1KG there are n=14 additional EUR non-Finnish samples that are missing in this n=489 provided LD reference data).
My QTL mapping is based on a very large mainly EU-ancestry WGS dataset, and a total of ~14M variants (all MAF > 1% and genotype & variant QCed) are actually tested. The available LD reference data only contains 1,190,321 variants (98.8% of these variants are matching & tested in my QTL catalogues), so I want to account for the remaining tested (and for TWAS -> testable) variants.

I looked into different scenarios for creating a custom LD reference data, I also followed a few issues/comments already mentioned in this repository, however I still have several questions.

On 1KG ftp server, for the same 1KG samples, you can find one "old style" GRCh38 1KG VCF and one "new style" GRCh38 1KG VCF. If I am not wrong, the former is based on old low-coverage WGS + WES data (guess provided LD reference data is based on this as well?) and the latter is recent PCR-free high-coverage WGS (resequenced by NYGC).

The old style 1KG VCF have only phased (with shapeit2) genotypes and contains only GT information in the FORMAT field. The new style 1KG VCF have all modern FORMAT field values: GT, DP, AB, AD, GQ; therefore it is possible to do genotype QC filters such as assigning genotypes missing if GQ < 20 or if allele reads are not balanced.

The new style 1KG VCF overlaps much better with my WGS data in terms of list of variants (97.68%) , though the old style VCF is also still good (87.16%).

My questions here:

For custom LD reference data (generated only for TWAS), is it recommended to prefer phased data over unphased data? In this case, as a reminder, the old style VCF is actually phased. However, I think for TWAS this does not matter as it accepts PLINK binary files and at this level phase information of genotypes are already lost. Is this true?
Do you recommend doing genotype & variant QC on the LD reference data? In my case I only filtered out the variants tagged by GATK VQSR pipeline for the new style 1KG VCF. However, I also consider genotype filters mentioned above, and maybe also HWE (as you indicated in this comment, however I don't understand why HapMap3 restriction was applied, can you please elaborate more on this decision? Maybe this is not required for the new PCR-free high-coverage WGS VCF, as no imputation needed different than last time?)
Should the genotyping rate of LD reference data be always 100%, meaning that none of the genotypes are missing? This is the case in the provided LD reference data, and also for the old style 1KG VCF I've mentioned, but not for the new one. I am not sure if FUSION is compatible with these missing genotypes in LD reference data? As a side note, the new VCF currently has only a small portion of missingness (total genotyping rate 99.96%).
Should I remove monomorphic sites in the LD reference data? (even if they match with my list of tested QTL variants). Of note, currently about 10% of these variants matching between LD reference data <-> my testable QTL variant list are with MAC = 0.

Thanks in advance!
Fahri

issue with FUSION.post_process.R --omnibus

Dear prof：
I want to use FUSION.post_process.R –omnibus, namely Testing for effect in multiple reference panels (omnibus test). But I'm having some trouble.
Firstly,
I use Official data-PGC2.SCZ.sumstats and two reference panels(GTEx.Whole_Blood.pos and YFS.BLOOD.RNAARR.pos). The script is as follows:

Rscript FUSION.assoc_test.R
--sumstats PGC2.SCZ.sumstats
--weights ./WEIGHTS/GTEx.Whole_Blood.pos
--weights_dir ./WEIGHTS/
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 22
--out 1.dat

Rscript FUSION.assoc_test.R
--sumstats PGC2.SCZ.sumstats
--weights ./WEIGHTS/YFS.BLOOD.RNAARR.pos
--weights_dir ./WEIGHTS/
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 22
--out 2.dat

Secondly,
To maximize the retention of meaningful genes, we set the standard at 0.5. The script is as follows:
cat 1.dat | awk 'NR == 1 || $NF < 0.5' > 1.top
cat 2.dat | awk 'NR == 1 || $NF < 0.5' > 2.top
We note that the official website indicate The flag looks for duplicate entries of the ID field in the --input file to select which predictors get combined into a single omnibus test, so we check. The script is as follows:
cat 1.top 2.top|awk '{print $3}'|sort|uniq -d |wc -l ##20

Thirdly, we execute these script
Rscript FUSION.post_process.R --omnibus
--sumstats PGC2.SCZ.sumstats
--input 1.top --input 2.top
--out 12.top.analysis
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 22
--plot --locus_win 100000

I got two files 12.top.analysis.CORR.dat and 12.top.analysis.omnibus.pv.
But these files are empty. I suspect this script have some problem, but did not know how can change, can you help me?

GCTA could not converge

Hello

We previously used fusion for our RNA-seq data for TWAS analysis and it work pretty well. However, we want to extend the TWAS analysis to EWAS analysis by using replacing EPIC array (450k) data. We tried to calculate the functional weight for each probe based on the normalized beta value (0-1). However, we fund it's alway skipped the calculation due to the following error:
"test_cg19923810 does not exist, likely GCTA could not converge, skipping gene".

Then we did some analysis and found we found there is aways a error during the gcta program:

gcta_nr_robust --grm test_cg19923810 --pheno test_cg19923810.pheno --qcovar mQTL_methylation.combined_covariates.txt --out test_cg19923810 --reml --reml-no-constrain --reml-lrt 1

Genome-wide Complex Trait Analysis (GCTA)
version 1.21
(C) 2010 Jian Yang, Hong Lee, Michael Goddard and Peter Visscher
The University of Queensland

Analysis started: Mon Jun 3 13:44:11 2019

Options:
--grm test_cg19923810
--pheno test_cg19923810.pheno
--qcovar mQTL_methylation.combined_covariates.txt
--out test_cg19923810
--reml
--reml-no-constrain
--reml-lrt 1

Note: This is a multi-thread program. You could specify the number of threads by the --thread-num option to speed up the computation if there are multiple processors in your machine.

Reading IDs of the GRM from [test_cg19923810.grm.id].
108 IDs read from [test_cg19923810.grm.id].
Reading the GRM from [test_cg19923810.grm.bin].
Reading the number of SNPs for the GRM from [test_cg19923810.grm.N.bin].
Pairwise genetic relationships between 108 individuals are included from [test_cg19923810.grm.bin].
Reading phenotypes from [test_cg19923810.pheno].
Nonmissing phenotypes of 108 individuals are included from [test_cg19923810.pheno].
Reading quantative covariates from [mQTL_methylation.combined_covariates.txt].
18 quantative covarites of 109 individuals read from [mQTL_methylation.combined_covariates.txt].

18 quantitative variable(s) included as covariate(s).
108 individuals are in common in these files.

Performing REML analysis ... (Note: may take hours depending on sample size).
108 observations, 19 fixed effect(s), and 2 variance component(s)(including residual variance).
Calculating prior values of variance components by EM-REML ...

Error: the X^t * V^-1 * X matrix is not invertible. Please check the covariate(s) and/or the environmental factor(s).

Analysis finished: Mon Jun 3 13:44:11 2019
Computational time: 0:0:0

I attached this example data and see if you can help us to figure out why? Thanks so much and looking forward to your reply.
test.tar.gz

File format of GTEx.v7.meta.id and GTEx.v7.meta.covar

Hello,

I am attempting to match the FUSION GTEx v7 pipeline (GTEX_v7.sh). What are the file formats of GTEx.v7.meta.id and GTEx.v7.meta.covar, and how were these files generated/obtained?

Thank you

Perform TWAS with GWAS data from non-European ancestry

Hi,

Recently, we plan to perform a TWAS using a GWAS summary statistics (about cancer research) from a population of non-European ancestry (e.g., Asian population), I was wondering whether I can use a LD reference data from Asian population, and is it appropriate to conduct a TWAS using non-European population GWAS data?

Thank you, hope you have a nice day!

Baike

Error in wgt.matrix[qc$flip, ] : incorrect number of dimensions (a solution)

A error was reported as following if there is no TRUE in qc$flip
Error in wgt.matrix[qc$flip, ] : incorrect number of dimensions

If you also meet this error message, try to modify the row
wgt.matrix[qc$flip,] = -1 * wgt.matrix[qc$flip,]

if(sum(qc$flip) > 0){
wgt.matrix[qc$flip,] = -1 * wgt.matrix[qc$flip,]
}

.pos File For Splicing Molecular Phenotypes

Hi,

First of all, thank you developing TWAS/FUSION. I am running it also for integrating GWAS and sQTL catalogues (including CMC splicing weights shared here: http://gusevlab.org/projects/fusion/#reference-functional-data ; along with the splicing weights I've calculated from other datasets publicly available or generated in-house).

My question is about the .pos file. As far as I understand, P0 & P1 columns there actually indicate TSS and TES sites of the gene for expression weights. Similarly, there is always one gene name on the second column as expression phenotypes are straightforward usually.

However, for splicing molecular phenotypes this is a bit different: as we calculate splicing weights for a X junction in a Y gene (that, let's say, has 5 other junctions in the same splicing cluster or in different splicing cluster), should we assign P0 and P1 now based on the coordinates of X junction? Or P0 and P1 should be TSS and TES of Y gene again?

I realized that .pos file of CMC weights still has TSS & TES. This is as it should be?

I think for splicing weights P0 and P1 in the .pos file should be coordinates of the exact splice junction for which we calculated RDat, but can you please comment on that if this is correct? (and then, should CMC .pos file be modified? Or this does not have any affect on the results?)

Similarly, for complex splicing events where Leafcutter (or any other splice analysis tools) annotates 2 genes for the respective cluster, is it safe to annotate gene names in the .pos file as "HERPUD2,AC007551.3"? Also some clusters have "NA" gene annotation.

Thanks in advance!

Error with LDREF

Hi there,

I keep getting an error when following the sample code for some reason.

Rscript FUSION.assoc_test.R --sumstats psych-vs-bd_2018.sumstats.sumstats.gz --weights final.psych.encode.genes.rdat.list --weights_dir ~/fusion_twas-master/WEIGHTS/ --ref_ld_chr ./LDREF/1000G.EUR. --chr 22 --out psych.dat
Error in file(file, "rt") : cannot open the connection
Calls: read_plink -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/Users/CL/fusion_twas-master/LDREF/1000G.EUR..bim': No such file or directory
Execution halted

I'm unsure why this may be occurring. I would appreciate any insight.

I also get this error if I try putting the 22 ./LDREF/1000G.EUR.22

Error in load(wgt.file) : restore file may be empty -- no data loaded
In addition: Warning messages:
1: In readChar(con, 5L, useBytes = TRUE) : error reading the file
2: file ‘WEIGHTS’ has magic number ''
Use of save versions prior to 2 is deprecated
Execution halted

Thank you so much.

issue with genes skipped

Hi Alexander,

I was running fusion via:

Rscript FUSION.assoc_test.R
--sumstats c7.txt
--weights ./weight/retina.2pos
--weights_dir ./weight/
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 7
--out CHR.7.dat

my c7.txt looks like this (Z scores come from my UKB GWAS done in plink2):
SNP A1 A2 Z
rs62429403 C A -0.317767
rs62429404 C G -0.0968181
rs73260517 C G 0.362803
rs56339393 C T 0.361902

and my weights were precomputed from this paper
https://www.nature.com/articles/s41588-019-0351-9#Sec21

but when I run this I am getting for chr 7 and the other ones a lot of missing genes, for example for chr 7
NOTE: 836 / 916 genes were skipped
If a large number of genes were skipped, verify that your GWAS Z-scores, expression weights, and LDREF data use the same SNPs (or nearly)
Or consider pre-imputing your summary statistics to the LDREF markers using summary-imputation software such as [http://bogdan.bioinformatics.ucla.edu/software/impg/]

If I check how many SNPs overlap between c7.txt and 1000G.EUR.7.bim
I am getting 30047 SNPs out of 231525 which can be found in c7.txt.

Also this website is not available anymore http://bogdan.bioinformatics.ucla.edu/software/impg/

Can you please advise what to do in this case?

Thanks
Ana

Code for checking that the model inputs are valid for the compute weights step

Hi,

Adding the following code after https://github.com/gusevlab/fusion_twas/blob/master/FUSION.compute_weights.R#L54 would be useful for making sure that the user specified correct model options.

stopifnot(all(models %in% c('blup', 'lasso', 'top1', 'enet', 'bslmm')))

I just realized that in a TWAS run I specified blump instead of blup and no errors were reported because of this.

Best,
Leo

Error in check_dataset(d = dataset1, 1) :

		Rscript ${pw_scr}/2.FUSION.assoc_test.R \
		--sumstats ${pw_result}/${disease_name}.sumstats.gz \
		--weights ${pw_tissue}/${tissue}.P01.pos \
		--weights_dir ${pw_tissue} \
		--ref_ld_chr LDREF/1000G.EUR. \
		--chr ${i} \
		--coloc_P 0.05 \
		--PANELN PANELN.file  \
		--GWASN 353388 \
		--out ${disease_name}.${tissue}.${i}.dat

PANELN.file:
PANEL N
OA_synovium_v1.WH_fusion 202
OA_cartilage_v1.WH_fusion 204

Error in check_dataset(d = dataset1, 1) :
dataset 1: missing required element(s) snp
Calls: suppressMessages ... capture.output -> withVisible -> coloc.abf -> check_dataset
Execution halted

We used the top1 ，lasso and enet models to modeling, so fewer SNPs were used，before I added coloc analysis，although some snps which calling：(WARNING : OA_synovium_v1 OA_synovium_v1.WH_fusion/ENSG00000136731.8.fusion_model.wgt.RDat UGGT1 2 128848774 128953251 202 had 472 overlapping SNPs, but none with non-zero expression weights, try more SNPS or a different model ) have no results, the results of fusion analysis : All 22 chromosomes have their own result .dat file .

But when I add coloc, there are only one chromosome have its .dat file, and the values of COLOC.PP0 COLOC.PP1 COLOC.PP2 COLOC.PP3 COLOC.PP4 columns are presented as "NA".

I wonder how can I solve this problem？Can anyone help me？Thanks!

Best,
miao

Using Genotype Dosages for Weight Calculation

Hi,

Is it possible to use genotype dosages (for instance, "DS" values in an imputed VCF) for TWAS weight calculations? I try to refrain from converting these to genotypes in PLINK file format that seems to be the required input of FUSION.compute_weights.R.

Thanks in advance!

Getting weights for all genes

Hi!
When I try to train weights for genes, some of them will be skipped. I understand there maybe some genes with low heritability and cannot pass the p-value threshold, but I'd like to get the results for all genes without skipping. How should I set the options?
Thanks a lot!

Self-generated LDREF does not work

Hi!

Thank for providing this wonderful tool. However, I had some difficulty running FUSION.assoc_test.R with self-generated LDREF data.

I download .pgen, pvar, and psam files from https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3. Then I used plink to select EAS samples only. As a control step, I did the same thing for EUR as well.

Using EUR gives me similar results as your example. When using EAS, I got warnings saying that 72/72 genes were skipped, because those genes had missing GWAS Z-scores that could not be imputed.

So my questions are:

The sample size between EAS and EUR are similar, the SNP list is exactly same, but the TWAS results are so much different. Have you ever tested your example files using LD reference data of other populations? Is it simply because EUR and EAS are genetically different? If it's not the reason,
Could you please give me a snapshot of how your LDREF_EUR data was generated?

Thank you so much in advance!
Best,
Hu

linux or window system to run FUSION.test.R?

I guess that you use linux system to run FUSION.test.R. Do you have a window version?

Finally, we run FUSION.test.R using this data on chromosome 22:

Rscript FUSION.assoc_test.R
--sumstats PGC2.SCZ.sumstats
--weights ./WEIGHTS/GTEx.Whole_Blood.pos
--weights_dir ./WEIGHTS/
--ref_ld_chr ./LDREF/1000G.EUR.
--chr 22
--out PGC2.SCZ.22.dat

Unable to run an sCCA+ACAT TWAS

I want to perform an sCCA+ACAT TWAS, but the cross-tissue (sCCA) weights provided in your page were calculated using GTeX versions 8 or 6, while the individual tissue weights provided were calculated using GTeX 7. Do you think there should be any major problems with the version differences? For example, I noticed that in the sCCA weights the gene symbol is not included in the .pos files.

I haven't been successful at running this analysis yet (see fenghelian/sCCA-ACAT_TWAS#1), not even with the tutorial on the sCCA-ACAT_TWAS github page, so I am trying to narrow down what could be causing the problem. (Although, in that instance, I think it's something to do with the spread() function used in compute_acatP.r.) Nevertheless, I posted this as a separate issue because I think it raises an interesting point?

MHC locus selection

Hello,

Just a heads up regarding the MHC locus.

I had a couple of hits on Chr 6 that weren't saved separately on ".MHC". I realized that since P0 and P1 are character vectors, R wasn't correctly selecting MHC.
I solved my problem by simply adding as.numeric:
mhc = out.tbl$CHR == 6 & as.numeric(out.tbl$P0) > 26e6 & as.numeric(out.tbl$P1) < 34e6

Dummy example in R version 4.0.5:

"28000000"<34e6
[1] TRUE
"31615212"<34e6
[1] FALSE

Best wishes,
Sara

No *.RDat output generate for compute weights script

Dear all,

I am a new user for FUSION. I want compute weights using my own expressional dataset, and I am try to repeat the example script. There is only .hsq files (no .RDat) generated after I run the commend line 59.

fusion_twas/examples/GEUV.compute_weights.sh

Line 59 in ef9c475

    
           Rscript FUSION.compute_weights.R --bfile $OUT --tmp $OUT.$pop.tmp --out $FINAL_OUT --verbose 0 --save_hsq --PATH_gcta $GCTA --PATH_gemma $GEMMA --models blup,lasso,top1,enet

However, the manual mentioned "Assuming $TMP, $INP, and $OUT are defined, this will take as input the $INP.bed/bim/fam file and generate a $OUT.RDat file which contains the expression weights and performance statistics for this gene."

In addition, I did not figure out how to generate $TMP file in the script.

Is any important steps were missing?
The software version are below (tested all software work well)
GEMMA: 0.98.1
PLINK: 1.90
gcta: https://github.com/gusevlab/fusion_twas/blob/master/gcta_nr_robust

Thank you very much!

Best,
Bochao Lin

Explicit control of the number of threads used

Hi,

I typically use fusion_twas on a compute cluster managed by Son of Grid Engine (SGE), and thus need to be careful about the amount of resources I request such as memory and the number of threads per job. The FUSION.compute_weights.R script here has several dependencies that can use more than 1 thread. However, currently there's no way to set them in the FUSION.compute_weights.R script.

Threads use in dependencies

Here's what I've found about the threads in the dependencies:

glmnet: I see that glmnet::cv.glmnet() has the argument parallel set to FALSE by default. Could leave as is to keep it simple.
plink: you can specify the number of threads using the --threads option which I manually set in January to use 1 thread LieberInstitute@f677aaf#diff-a62928a27aae00e040d6f53f1ed7f256 in our fork of the fusion_twas repository. More information at https://www.cog-genomics.org/plink/1.9/other#threads. Note that there is a known limitation of --threads which you get a warning for (using verbose=2) when running the lasso model at https://github.com/gusevlab/fusion_twas/blob/master/FUSION.compute_weights.R#L84 if you set --threads 1. Since it's an OpenBLAS threads option, setting the environment variable OPT_NUM_THREADS could resolve this.
gcta version 1.21 https://github.com/gusevlab/fusion_twas/blob/master/FUSION.compute_weights.R#L217-L219 which I see is quite outdated from looking at https://cnsgenomics.com/software/gcta/#Overview. From the changelog history, I can see that --thread-num existed prior to version 1.21 though I'm not sure how it gets the defaults https://cnsgenomics.com/software/gcta/#Download. As of version 1.91.4 it uses the environment variable OMP_NUM_THREADS to detect the number of threads if the option --threads is not specified https://cnsgenomics.com/software/gcta/#Multi-threadcomputing. From https://cnsgenomics.com/software/gcta/pre_gcta/ I downloaded the source for version 1.21 https://cnsgenomics.com/software/gcta/pre_gcta/gcta_1.21_src.zip and I see at lines 834-841 from option.ccp that it sets the OMP_NUM_THREADS variable (it never reads it in version 1.21).
gemma version 0.98.1 (2018-12-10) (version I'm using): I haven't found any options for controlling the number of threads it uses. The closest thing I found is genetics-statistics/GEMMA#203 where they mention linking to single-threaded C OpenBLAS libraries and https://groups.google.com/forum/#!searchin/gemma-discussion/threads%7Csort:date/gemma-discussion/YCm25_ZJLuU/sv233pRmBgAJ where a user requested a way to control the number of threads (though setting OMP_NUM_THREADS seems to help).

Sooo.... GCTA, Gemma and plink for the lasso model all have multi-thread capabilities that are set automatically and maybe can be controlled with OMP_NUM_THREADS plus using --threads for plink.

Suggestion

I suggest adding something like:

make_option("--threads", action="store_true", default=1,
              help="The number of threads to use (affects plink, GCTA, Gemma) [default: %default]"),

Then, in the plink calls, explicitly pass that argument. For example change https://github.com/gusevlab/fusion_twas/blob/master/FUSION.compute_weights.R#L84 from

arg = paste( opt$PATH_plink , " --allow-no-sex --bfile " , input , " --lasso " , hsq , " --out " , out , sep='' )

arg = paste( opt$PATH_plink , " --allow-no-sex --bfile " , input , " --lasso " , hsq , " --out " , out , "--threads", opt$threads, sep='' )

(Actually, something could be similarly done for controlling the plink memory as in https://github.com/LieberInstitute/fusion_twas/blob/jhpce/FUSION.compute_weights.R#L85)

Then also set the OPT_NUM_THREADS environment variable which should work while the process is alive based on ?Sys.setenv() and https://stackoverflow.com/questions/20647493/accessing-environment-variables-set-in-r-session-from-shell.

$ Rscript -e "Sys.getenv('OPT_NUM_THREADS'); opt <- list(threads = 1); Sys.setenv('OPT_NUM_THREADS' = opt$threads); Sys.getenv('OPT_NUM_THREADS')"
[1] ""
[1] "1"
$ Rscript -e "Sys.getenv('OPT_NUM_THREADS')"
[1] ""

What do you think? Also, if you prefer, I can submit a PR with these changes.

Best,
Leo

Error with locus_win setting.

I am getting this error, when running conditional analysis. Can you please explain what this error means and how one should set locu_win argument according to the results ?

14 strictly non-overlapping loci
consolidated to 1 non-overlapping loci with 100000 bp buffer
Error: dims [product 501] do not match the length of object [0]
Execution halted

Issue with post_process.R that gets resolved by forcing the code to run through a while loop at least once

Hi,

While running FUSION.post_process.R I encountered the following error:

Error in solve.default(ge_g.ld[joint.keep, joint.keep]) : 'a' is 0-diml

After running it line by line, I noticed that the error was prompted by https://github.com/gusevlab/fusion_twas/blob/master/FUSION.post_process.R#L432

That's because sum(cond.z^2 > zthresh^2) is equal to 0 and thus the code inside the while loop at https://github.com/gusevlab/fusion_twas/blob/master/FUSION.post_process.R#L414 doesn't run. Thus joint.keep is all set to FALSE by https://github.com/gusevlab/fusion_twas/blob/master/FUSION.post_process.R#L412 and the error at https://github.com/gusevlab/fusion_twas/blob/master/FUSION.post_process.R#L432 is really caused by running solve() on a 0 by 0 matrix

> ge_g.ld[joint.keep,joint.keep]
<0 x 0 matrix>

I was able to bypass this error by making the following edit:

initiate_flag <- TRUE
while ( sum(cond.z^2 > zthresh^2) != 0 || initiate_flag) {
    initiate_flag <- FALSE

at the neighboring lines to (currently) line 414 at https://github.com/gusevlab/fusion_twas/blob/master/FUSION.post_process.R#L414

Does this edit seems reasonable to you? Basically, I'm telling it to add the most conditional feature and use that one only if all else fails. In my case, sum(cond.z^2 > zthresh^2) was still equal to 0 after running through the code in the while loop once.

Best,
Leonardo

Reproducibility information

> library('sessioninfo')
> options(width = 120)
> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.5.1 Patched (2018-10-29 r75535)
 os       Red Hat Enterprise Linux Server release 6.9 (Santiago)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       US/Eastern
 date     2019-01-22

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version   date       lib source
 assertthat     0.2.0     2017-04-11 [2] CRAN (R 3.5.0)
 bindr          0.1.1     2018-03-13 [1] CRAN (R 3.5.0)
 bindrcpp       0.2.2     2018-03-29 [1] CRAN (R 3.5.0)
 cli            1.0.1     2018-09-25 [1] CRAN (R 3.5.1)
 colorout     * 1.2-0     2018-05-02 [1] Github (jalvesaq/colorout@c42088d)
 colorspace     1.4-0     2019-01-13 [2] CRAN (R 3.5.1)
 corrplot     * 0.84      2017-10-16 [2] CRAN (R 3.5.0)
 crayon         1.3.4     2017-09-16 [1] CRAN (R 3.5.0)
 digest         0.6.18    2018-10-10 [1] CRAN (R 3.5.1)
 dplyr          0.7.8     2018-11-10 [1] CRAN (R 3.5.1)
 getopt         1.20.2    2018-02-16 [1] CRAN (R 3.5.0)
 ggplot2        3.1.0     2018-10-25 [1] CRAN (R 3.5.1)
 glue           1.3.0     2018-07-17 [1] CRAN (R 3.5.1)
 gtable         0.2.0     2016-02-26 [2] CRAN (R 3.5.0)
 htmltools      0.3.6     2017-04-28 [2] CRAN (R 3.5.0)
 htmlwidgets    1.3       2018-09-30 [1] CRAN (R 3.5.1)
 httpuv         1.4.5.1   2018-12-18 [2] CRAN (R 3.5.1)
 later          0.7.5     2018-09-18 [2] CRAN (R 3.5.1)
 lattice        0.20-38   2018-11-04 [3] CRAN (R 3.5.1)
 lazyeval       0.2.1     2017-10-29 [2] CRAN (R 3.5.0)
 magrittr       1.5       2014-11-22 [1] CRAN (R 3.5.0)
 Matrix         1.2-15    2018-11-01 [3] CRAN (R 3.5.1)
 munsell        0.5.0     2018-06-12 [2] CRAN (R 3.5.0)
 optparse     * 1.6.1     2019-01-15 [2] CRAN (R 3.5.1)
 pillar         1.3.1     2018-12-15 [1] CRAN (R 3.5.1)
 pkgconfig      2.0.2     2018-08-16 [1] CRAN (R 3.5.1)
 plink2R      * 1.1       2018-12-06 [1] Github (gabraham/plink2R@d74be01)
 plyr           1.8.4     2016-06-08 [2] CRAN (R 3.5.0)
 png            0.1-7     2013-12-03 [2] CRAN (R 3.5.0)
 promises       1.0.1     2018-04-13 [2] CRAN (R 3.5.0)
 purrr          0.2.5     2018-05-29 [2] CRAN (R 3.5.0)
 R6             2.3.0     2018-10-04 [2] CRAN (R 3.5.1)
 RColorBrewer * 1.1-2     2014-12-07 [2] CRAN (R 3.5.0)
 Rcpp         * 1.0.0     2018-11-07 [1] CRAN (R 3.5.1)
 RcppEigen    * 0.3.3.5.0 2018-11-24 [1] CRAN (R 3.5.1)
 rlang          0.3.1     2019-01-08 [1] CRAN (R 3.5.1)
 rmote        * 0.3.4     2018-05-02 [1] deltarho (R 3.5.0)
 scales         1.0.0     2018-08-09 [2] CRAN (R 3.5.1)
 servr          0.11      2018-10-23 [1] CRAN (R 3.5.1)
 sessioninfo  * 1.1.1     2018-11-05 [1] CRAN (R 3.5.1)
 tibble         2.0.1     2019-01-12 [1] CRAN (R 3.5.1)
 tidyselect     0.2.5     2018-10-11 [2] CRAN (R 3.5.1)
 withr          2.1.2     2018-03-15 [2] CRAN (R 3.5.0)
 xfun           0.4       2018-10-23 [1] CRAN (R 3.5.1)

[1] /users/lcollado/R/x86_64-pc-linux-gnu-library/3.5.x
[2] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/site-library
[3] /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.5.x/R/3.5.x/lib64/R/library
>

Understanding top1 output in wgt.matrix

Hello,

I am seeking clarification regarding the values represented in the wgt.matrix of a gene weight for the top1 model. Specifically, how is the top SNP chosen for the gene-level model (assuming top1 is the best model), in regards to the data given in this matrix? Are you simply taking the SNP with the largest absolute value of the top1 column?

Thank you.

LASSO output did not exist

Hi,
I am a new user for fusion. I want to compute weights using my own expressional dataset. But I cannot find lasso output. The exact error I get is:

`### Performing 5 fold cross-validation

Crossval fold 1
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.cv.LASSO.lasso LASSO output did not exist
Crossval fold 2
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.cv.LASSO.lasso LASSO output did not exist
Crossval fold 3
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.cv.LASSO.lasso LASSO output did not exist
Crossval fold 4
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.cv.LASSO.lasso LASSO output did not exist
Crossval fold 5
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.cv.LASSO.lasso LASSO output did not exist
lasso top1
rsq NA 0.41930171599824
pval NA 2.41386094625509e-15
Computing full-sample weights
${path}/temporary/NM_000287:PEX6:chr6:-.txt.tmp.lasso LASSO output did not exist
Cleaning up`

Did I do something wrong?

Best,
Maud

Imputing Individual Sample Expression Values

I was wondering what the best way to export imputed gene expression values for each genotype into a matrix to analyze. Is that something I can change FUSION.assoc_test.R to do this?

gusevlab / fusion_twas Goto Github PK

fusion_twas's Introduction

Functional Summary-based Imputation

fusion_twas's People

Contributors

Stargazers

Watchers

Forkers

fusion_twas's Issues

Can BEDMatrix be used?

Initial BEDMatrix dependent function

data.table flavor

Moving forward

Reproducibility

Marginal Z-scores (used for top1)

Threads use in dependencies

Suggestion

Reproducibility information

Recommend Projects

Recommend Topics

Recommend Org

Can `BEDMatrix` be used?

Initial `BEDMatrix` dependent function