The matrixeqtl from andreyshabalin

`findInterval` error in certain workflows

We've only seen this error when running MatrixEQTL on the exon-exon splice junction matrix. All of the junctions are sorted


2018-01-25 14:16:09 running MatrixEQTL
Matching data files and location files
4360453of4360453 genes matched
58109566of58109566 SNPs matched

Task finished in 36.62 seconds
Reordering SNPs

Task finished in 286.049 seconds
Reordering genes

Task finished in 33.687 seconds
Processing covariates
Task finished in 0.00199999999995271 seconds
Processing gene expression data (imputation, residualization)
Task finished in 7.25300000000004 seconds
Creating output file(s)
Error in findInterval(sn.l, ge.r + cisDist + 1) : 
  'vec' must be sorted non-decreasingly and not contain NAs
Calls: Matrix_eQTL_main -> findInterval
Execution halted

We've seen this a few times for different datasets, and usually removing the more lowly expressed junctions will allow the code to run but we'd like to better troubleshoot this issue.

Help understanding option 'noFDRsaveMemory' in Matrix eQTL

Dear Dr.Shabalin,

I am writing to seek your help in understanding the 'noFDRsaveMemory' option in Matrix eQTL function. Using this option as FALSE and TRUE, results in outputs with significantly different raw p-values. For example, setting this option as FALSE, returns p-values as low as e-300, while TRUE returns p-values ~e-15

I've read the paper and looked at the GitHub page, but couldn't get a better understanding of this behaviour. I would really appreciate it if you could shed some light on this, or point me to literature that can help to explain this behaviour.

Thank you

`noFDRsaveMemory = TRUE` no longer works with `output_file_name.cis`

Minimal reproducible example:

First, load the package example data into the appropriate data structures:

library(MatrixEQTL)

# Toy data locations
base.dir = find.package('MatrixEQTL');
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Load toy data as per MatrixEQTL vignette
snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
  cvrt$LoadFile(covariates_file_name);
}

Matrix_eQTL_main parameters that cause the error:

eQTLs <- Matrix_eQTL_main(
    snps = snps,
    gene = gene,
    cvrt = cvrt,
    pvOutputThreshold = 0, # ignore Trans eQTLs 
    pvOutputThreshold.cis = 1, # but keep all cis summary stats
    snpspos = snpspos,
    genepos = genepos,
    output_file_name=NULL,
    output_file_name.cis="test.txt",
    noFDRsaveMemory=TRUE
)

Matching data files and location files
10of10 genes matched
15of15 SNPs matched

Task finished in 0.00499999999999545 seconds
Processing covariates
Task finished in 0.00200000000006639 seconds
Processing gene expression data (imputation, residualization)
Task finished in 0.000999999999976353 seconds
Creating output file(s)
Task finished in 0.0040000000000191 seconds
Performing eQTL analysis
100.00% done, 100 cis-eQTLs
Error in saver.cis$getResults(gene, snps, n.tests.cis) : 
  unused arguments (gene, snps, n.tests.cis)

The final error message is not returned if we set noFDRsaveMemory = FALSE.

p-values very small

Why do extremely small p-values appear in the Matrix results? Are these due to calculation issues, and can they be used directly?
head cis_eqtlout
SNP gene beta t-stat p-value FDR
Chr1_1025645 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1025645 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1307252 MSTRG.341.1 4.35666666666667 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1307252 MSTRG.343.1 1.36333333333333 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1505533 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1516057 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1505533 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1516057 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1573233 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304

Data set too large?

Hello,

I am trying to use MatrixEQTL to detect DNA methylation QTL. I am currently using a window size of 1 kb to detect cis-QTL.

I am suspicious that I may have more data than MatrixEQTL is designed to handle. I am currently using only chr1, but have 1.1 million CpG sites I am trying to test. During QTL mapping I can only progress to 7.71% done, not 100%. This is my last line:

7.71% done, 4,185,210 cis-eQTLs
7.71% done, 4,185,817 cis-eQTLs
Task finished in 489.067 seconds

Do you have any insight if this could be due to the large number of CpG sites or could it be another issue? Thanks!

FDR: how is it calculated by the package?

hi there devs,

Thanks for this awesome package. I'm using ~120 genes and ~44K SNPs to perform cis eqtl analysis. I use 2Mb as cis eqtl boundary.

pvOutputThreshold_cis = 0.1;
pvOutputThreshold_tra = 0.1;
pvOutputThreshold = 0.1;

me = Matrix_eQTL_main(
    snps=snps,    gene=gene, 
    output_file_name = NULL,
    pvOutputThreshold=pvOutputThreshold,
    useModel=useModel,
    errorCovariance=errorCovariance,
    verbose=TRUE,pvalue.hist=FALSE,min.pv.by.genesnp = FALSE,noFDRsaveMemory = FALSE,
    snpspos=snpspos,cisDist=2e+06,genepos=genepos,pvOutputThreshold.cis=pvOutputThreshold_cis);

df_eqtl<-me$cis$eqtls

After analysis, I extract cis eqtls. However, I'd like to replicate the FDr generated by the matrix eqtl function. First, I use p.adjust method FDR in the output the p-value are little off (better).

df_eqtl_jointed$TEST_FDR<-p.adjust(p=df_eqtl$pvalue,n=nrow(df_eqtl),method="fdr")

snps	gene	chr	pos	statistic	pvalue	FDR	beta	gene_type	TEST_FDR
16:64975370:G:A	ENSG00000166592	16	64975370	8.694963	6.32E-15	7.87E-09	17.23002	protein_coding	8.17E-10
16:65320793:G:C	ENSG00000166592	16	65320793	7.939682	4.80E-13	1.79E-08	10.07585	protein_coding	1.85E-09
16:65321249:C:T	ENSG00000166592	16	65321249	7.939682	4.80E-13	1.79E-08	10.07585	protein_coding	1.85E-09
16:65321591:T:G	ENSG00000166592	16	65321591	7.939682	4.80E-13	1.79E-08	10.07585	protein_coding	1.85E-09
16:65322806:T:C	ENSG00000166592	16	65322806	7.939682	4.80E-13	1.79E-08	10.07585	protein_coding	1.85E-09

Number of rows in the cis eqtl output= 129196

If I use the number of tests when calculating FDR, head(me$param)

$ntests
[1] 30385434

p.adjust(p=df_eqtl_jointed$pvalue,n=30385434,method="fdr")

The output looks like:

snps	gene	start_hg19	end_hg19	pos	statistic	pvalue	FDR	beta	chrom	TEST_FDR_NTEST
16:64975370:G:A	ENSG00000166592	66955582	66959547	64975370	8.694963	6.32E-15	7.87E-09	17.23002	chr16	1.92E-07
16:65320793:G:C	ENSG00000166592	66955582	66959547	65320793	7.939682	4.80E-13	1.79E-08	10.07585	chr16	4.36E-07
16:65321249:C:T	ENSG00000166592	66955582	66959547	65321249	7.939682	4.80E-13	1.79E-08	10.07585	chr16	4.36E-07
16:65321591:T:G	ENSG00000166592	66955582	66959547	65321591	7.939682	4.80E-13	1.79E-08	10.07585	chr16	4.36E-07

I'm interested to know on FDR because I'd subset genes from the analysis. So, knowing correct method would help me to calculate FDR for them separately.

Also, in the output, is FDR calculated for each gene individually? Just curious.

Thank you.

Thousands of results are NA

Hi,

Using MatrixEQTL version 2.2 I'm running into a weird result that might be resolved by lowering the p-value threshold (as in #1). Still, maybe you've seen this before and have some idea on what could be done.

The issue is that many of the results in me$cis$eqtls (I saved the output of Matrix_eQTL_main() to the object me) are NA, including the snp and gene ids. So I can't check if either the snp or the gene id has some weird property.

Weird results

> table(!is.na(me$cis$eqtls$snps))

  FALSE    TRUE
9431253   23828
> head(me$cis$eqtls[is.na(me$cis$eqtls$snps), ])
      snps gene statistic pvalue FDR beta
23829 <NA> <NA>        NA     NA  NA   NA
23830 <NA> <NA>        NA     NA  NA   NA
23831 <NA> <NA>        NA     NA  NA   NA
23832 <NA> <NA>        NA     NA  NA   NA
23833 <NA> <NA>        NA     NA  NA   NA
23834 <NA> <NA>        NA     NA  NA   NA

Call used:

## Call used:
me <- Matrix_eQTL_main(snps = meth, gene = exprinfo, 
    output_file_name.cis = paste0('.', cpg, '_', opt$feature,
        '.txt'), # invis file, temporary
    pvOutputThreshold = 0, pvOutputThreshold.cis = 5e-4, 
	useModel = modelLINEAR,
	snpspos = methpos, genepos = exprpos, cisDist = 1000)

File checks

> table(is.na(exprpos$spliceid))

FALSE
60252
> table(is.na(exprpos$chr))

FALSE
60252
> sapply(exprpos, class)
   spliceid         chr       start         end
"character" "character"   "integer"   "integer"

> table(is.na(methpos$cname))

   FALSE
18664892
> table(is.na(methpos$chr))

   FALSE
18664892
> sapply(methpos, class)
      cname         chr         pos
"character" "character"   "integer"

> identical(methpos$cname, rownames(meth))
[1] TRUE
> identical(exprpos$spliceid, rownames(exprinfo))
[1] TRUE

The code works as expected with one methylation data set but not in a second one: both use the same gene information and have the same number of samples. The second one has more non-zero entries in the meth object and we had to use a lower file slice size (meth$fileSliceSize <- 300 instead of 2000) to keep the memory from blowing up. That's why I'm guessing that lowering the p-value threshold could help. If needed, I can see if this happens with the data from a small chromosome and send you the files via email (like we did for #3).

Best,
Leonardo

Your thoughts on some unusual cross-tissue results - possibly due to dummy covariates?

Hi!
Thanks for this great tool - It's super easy to use and I've had a fun time learning how to apply it on my data. I've run an eQTL analysis I'm pretty happy with on two tissues separately, lets call them tissue 1 and tissue 2. For the most part, most ~73% high-confidence eQTLs are common to both tissue 1 and tissue 2. I've plotted the results and they generally look very nice. Interestingly, a sizeable chunk of the remaining cis-eQTLs were specific to tissue 2. This was very exciting. However, when I've plotted those results in both tissues I am seeing that the expression patterning looks almost identical in both tissues, and isn't tissue 2-specific at all - however tissue 2 is nonetheless highly significant whereas tissue 1 for some reason was not significant at all. For example, the image below is what basically all the "tissue 2-specific" eQTLs look like;

Stats for tissue 1 are NS/NA because they were never saved in the eQTL output to begin with, due to not passing my specified p value threshold (p < 1e-8). My only guess is that the differential significance (despite very similar expression) is due to my covariates, most likely sequencing batch effect, due to the others (sex, BMI, age) being identical across tissues 1 and 2. Furthermore, I wonder if this is happening because I coded sequencing batch using dummy variables, which I've never done before. I just wanted to know if you think the differences in significance arise from the batch effect or if there is something else I ought to consider.

Thanks again!
D

Is this P-value comparable to that of GWAS analysis?

Thanks for your useful software. The p values I get by using Matrix_eQTL are very small, and some of them can reach p=2.2250738585072e-308, but the p values of GWAS or eGWAS are generally not more than 1e-10. Are the two methods the same? Can the p values obtained by the two methods be calculated together? How should the significance threshold of Matrix_eQTL be determined?

Looking forward to your reply

Wrong number of samples in the matrix of covariates

The number of samples is right, but is error I have 19 samples in covariates

Why pvOutputThreshold affect p value?

Hi,

Some strange things happen when I change pvOutputThreshold.

firt run

library(MatrixEQTL)
useModel = modelLINEAR
SNP_file_name = "GoodGene.exp.geno.numeric.xls"
expression_file_name = "GoodGene.exp.ra.exp.xls"
covariates_file_name = "GoodGene.exp.cov2.xls"
output_file_name = "GoodGene.exp.0.05.results.xls"
pvOutputThreshold = 1.43771984533585e-08

snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)

gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)

cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name)
}

me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
verbose = TRUE,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)

pdf(file = paste0(output_file_name, ".pdf"), width = 8, height = 8)
print(me$param$dfFull)

##second run
library(MatrixEQTL)
useModel = modelLINEAR
SNP_file_name = "GoodGene.exp.geno.numeric.xls"
expression_file_name = "GoodGene.exp.ra.exp.xls"
covariates_file_name = "GoodGene.exp.cov2.xls"
output_file_name = "GoodGene.exp.0.01.results.xls"
pvOutputThreshold = 2.8754396906717e-09

snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)

gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)

cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name)
}

me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
verbose = TRUE,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)

pdf(file = paste0(output_file_name, ".pdf"), width = 8, height = 8)
print(me$param$dfFull)

The strange thing is that there are 503951 eQTL in GoodGene.exp.0.05.results.xls (and 294781 eQTL' p-value <= 2.8754396906717e-09), but there are only 1577 eQTL in GoodGene.exp.0.01.results.xls. Why pvOutputThreshold affect p value, is this a bug of matrixEQTL?
GoodGene.exp.0.01.results.xls.gz
GoodGene.exp.0.05.results.xls.gz

Best,
Kun

Excluding NA instead of imputing

Hi, if I understand correctly, missing values are imputed using the mean. In my data, the same genes are not measured for each person, so I have a lot of missing data. I do not want to impute this data, I simply want to exclude the people that don't have this gene measured. Right now, my only solution to this is to run MatrixEQTL separately per gene, so that I can exclude specific people that were not measured for each gene. By doing this, however, I kind of lose the benefit of how fast MatrixEQTL is for large data, and it makes my whole pipeline take much longer.

Understanding "Colinear or zero covariates detected" error message

Hi @andreyshabalin,

Thank you for an amazing resource!

I am troubleshooting what I think is an environment-related issue when running Matrix_eQTL_Main(). In one environment (generic set of R libraries on an HPC cluster), the code runs as expected without errors, but when I run the same code and data in a different environment (conda or singularity container), I receive: "Colinear or zero covariates detected"

I believe this happens here under Covariates processing:

MatrixEQTL/R/Matrix_eQTL_engine.R

Lines 1502 to 1527 in 7e5c1cb

    
           ########################## Covariates processing ########################## 
        
           { 
        
               status("Processing covariates"); 
        
               if( useModel == modelLINEAR_CROSS ){ 
        
                   last.covariate = as.vector(tail( cvrt$getSlice(cvrt$nSlices()), 1)); 
        
               }         
        
               if( cvrt$nRows()>0 ){ 
        
                   cvrt$SetNanRowMean(); 
        
                   cvrt$CombineInOneSlice(); 
        
                   cvrt = rbind(matrix(1,1,snps$nCols()), cvrt$getSlice(1)); 
        
               } else { 
        
                   cvrt = matrix(1,1,snps$nCols()); 
        
               } 
        
               # Correct for the error covariance structure 
        
               if( length(correctionMatrix)>0 ){ 
        
                   cvrt = cvrt %*% correctionMatrix; 
        
               } 
        
               # Orthonormalize covariates 
        
               # status("Orthonormalizing covariates"); 
        
               q = qr(t(cvrt)); 
        
               if( min(abs(diag(qr.R(q)))) < .Machine$double.eps * snps$nCols() ){ 
        
                   stop("Colinear or zero covariates detected"); 
        
               } 
        
               cvrt = t( qr.Q(q) ); 
        
               rm(q); 
        
           }

My .Machine$double.eps is the same across environments and it looks like the qr functions are referencing the same namespace.

I am not sure where to go or what to try next so if you could provide additional insights or suggestions, that would be greatly appreciated!

Matrix_eQTL_main() outputs error when `output_file_name` is not provided

Minimal reproducible example:

First, load the package example data into the appropriate data structures:

library(MatrixEQTL)

# Toy data locations
base.dir = find.package('MatrixEQTL');
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Load toy data as per MatrixEQTL vignette
snps = SlicedData$new();
snps$fileDelimiter = "\t";      # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1;          # one row of column labels
snps$fileSkipColumns = 1;       # one column of row labels
snps$fileSliceSize = 2000;      # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);

gene = SlicedData$new();
gene$fileDelimiter = "\t";      # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1;          # one row of column labels
gene$fileSkipColumns = 1;       # one column of row labels
gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t";      # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1;          # one row of column labels
cvrt$fileSkipColumns = 1;       # one column of row labels
if(length(covariates_file_name)>0) {
  cvrt$LoadFile(covariates_file_name);
}

Run Matrix_eQTL_main() with the data but otherwise default arguments:

eQTLs <- Matrix_eQTL_main(snps = snps, gene = gene, cvrt = cvrt)

Error in Matrix_eQTL_main(snps = snps, gene = gene, cvrt = cvrt) : 
  nzchar(output_file_name) is not TRUE

It looks like the default for the argument, has either changed recently to "", or it has always been output_file_name = "" but this is no longer interpreted internally the same as output_file_name = NULL.

Preventing imputation of missing snps?

Thanks for the great package!

I would like to run MatrixEQTL on only samples that are heterozygous for the respective snp of interest.
For this I input a vcf file in which all homozygous snps are marked as missing values. However, since Matrix_eQTL_main() uses SetNanRowMean() to impute missing values during the preprocessing of the snps file this approach doesn't really work.

Is there any way to prevent the imputation of missing snp values in the vcf file or to make MatrixEQTL selectively only compare heterozygous samples for each snp-gene combination tested?
Thanks for your help!

Non-parametric linear regression

Hello Andreyshabalin,
The MatrixEQTL is very classical and robust in our eQTL analysis. My question here is more like a request instead of a issue. Since I am using single cell sequencing data for QTL analysis, is there a way to add useModel = "non-parametric linear model" to your options to fit it to single cell data distribution?
Thanks,
Yijun

Order of Matrix eQTL output

Hi Andrey,
Thanks for this amazing tool! In the output of Matrix eQTL, the most significant association ranks the first. When I was using it to map eQTLs in my simulated datasets, I found that some associations had different t-statistics but the same p-value (I think they may reach the minimum that can be calculated/displayed by R: 2.225074e-308). For these associations (or SNPs, they were for the same eGene), they are ordered by genomic locations. I was wondering whether it would make more sense if these SNPs could be sorted by t-statstics, which indicate significance levels.
Best,
Qinqin

pvOutputThreshold

if pvOutputThreshold and pvOutputThreshold.cis are not set, what will be the outcome？
What are the default values of pvOutputThreshold and pvOutputThreshold.cis?

Significant trans eQTLs are no longer significant when cis window increases

Hi there,

I'm currently running MatrixEQTL to do methQTLs and I noticed something peculiar in my results.

Last week, I ran MatrixEQTL with separate cis and trans p-value thresholds. I used a cis p-value threshold of 1e-8 and a trans p-value threshold of 1e-14. I originally set a cis window of 10 kB. When I got my results back, there were 40,000 hits for trans, but zero hits for cis. I noticed that for a few trans hits, the associated SNP and methylation site were less than 50 kB apart. For one trans hit, they were 32 kB apart, for another, 22 kB, etc.

I reran MatrixEQTL with the same p-values, but I changed the cis window to 50 kB. I still have 0 cis hits. I am confused because my trans hits that were within 50 kB should now be cis hits, given I increased the cis window and the trans p-value is more stringent than the cis p-value.

I was wondering if you could help me understand why this might be occurring. Furthermore, if I ran MatrixEQTL without specifying a cis window, do you think this problem would be solved? Thank you in advance for your help!

beta value of MatrixEQTL results

Hi
Using MatrixEQTL R package, I am performing both cis and trans eQTL analyses in order to find some SNPs associated with some genes of interest. I try to understand the Beat value in the results, What does "Effect size estimate" mean here? I am wondering whether it can represent a trend in changes in gene expression levels in different genotypes.

trans-QTL was defined as cis-QTL

Dear Andrey Shabalin

I have met a strange problem while using MatrixEQTL. In my study, some results that should have been defined as trans QTL were defined as cis QTLs. In other words, some cis-QTLs have a distance over 1e6 bp between SNP and its gene. The distance is the minimum value of the absolute value of the distance between SNP and both ends of the gene.

    distance = min( abs( snp_position - gene_start ), abs( snp_position - gene_end )  )

there are some of the wrong QTLs:

qtl_type	SNP_ID	SNP_chr	SNP_loc	ref	alt	gene_ID	gene_chr	gene_start	gene_end	gene_strand	beta	t_stat	p	distance
cis	rs7088262	chr10	49836108	A	G	BMS1P5	chr10	46737615	46762892	-;chr10	-0.23	-3.5	4.88E-04	3.073216
cis	rs61838734	chr10	49681118	T	C	BMS1P5	chr10	46737615	46762892	-;chr10	0.14	3.74	1.91E-04	2.918226
cis	rs72783091	chr10	49680879	A	C	BMS1P5	chr10	46737615	46762892	-;chr10	0.13	3.63	2.92E-04	2.917987
cis	rs200554992	chr10	49680859	TA	T	BMS1P5	chr10	46737615	46762892	-;chr10	0.13	3.47	5.34E-04	2.917967
cis	rs56342922	chr10	49680623	A	G	BMS1P5	chr10	46737615	46762892	-;chr10	0.14	3.68	2.46E-04	2.917731
cis	rs7093726	chr10	49678147	T	G	BMS1P5	chr10	46737615	46762892	-;chr10	0.14	3.68	2.46E-04	2.915255
cis	rs7069220	chr10	49678060	C	T	BMS1P5	chr10	46737615	46762892	-;chr10	0.13	3.51	4.59E-04	2.915168
cis	rs3906889	chr10	49677182	G	C	BMS1P5	chr10	46737615	46762892	-;chr10	0.13	3.63	3.02E-04	2.91429

some of the QTLs even have SNP and gene in different chromosomes:

cis	rs2869546	chr15	78907345	C	T	SNORA63	chr3	186505088	186505222	+	-0.15	-3.82	1.41E-04
cis	rs7182583	chr15	78899210	G	C	SNORA63	chr3	186505088	186505222	+	0.14	3.57	3.77E-04
cis	rs9920189	chr15	78666311	G	T	SNORA63	chr3	186505088	186505222	+	0.14	3.62	3.07E-04

Do you know why does it happen and how to solve it?

Long Vector Error when compiling results

Hi,

I've ran MatrixEQTL (SNPs = 561,963, Expression = 422,070, No of Samples = 87). At the last stage, after the 100% completion message, I get the following error:

Error in pmin.int(x, val) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138

I've tried in Rmd and R scripts, to see if that made the difference.

Any suggestions?

R Version: 3.3.2 (Sincere Pumpkin Patch)
RAM on machine: 256GB
OS: linux-gnu (x86_64) - Ubuntu 16.04.1 LTS

sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcpp_0.12.9      ggplot2_2.2.1    dplyr_0.5.0      readr_1.0.0      MatrixEQTL_2.1.1

loaded via a namespace (and not attached):
 [1] assertthat_0.1   grid_3.3.2       R6_2.2.0         plyr_1.8.4       gtable_0.2.0     DBI_0.5-1        magrittr_1.5     scales_0.4.1     lazyeval_0.2.0   tools_3.3.2      munsell_0.4.3    colorspace_1.2-6 tibble_1.2

Error in findInterval

Hello, I am encountering this error when running Matrix_eQTL_main:

Error in findInterval(x = stats.for.hist, vec = statbins1) :
'vec' must be sorted non-decreasingly and not contain NAs
Calls: Matrix_eQTL_main -> -> tabulate -> findInterval

Another dataset I generated using the same script works without any issue when I run Matrix_eQTL_main. Basically, I splitted patients and controls samples from the main data files to run the eQTL separately. The patient group files run correctly, but not the controls.
I cannot understand the problem.
Can I please share by email my data and script? I was trying here but probably the files are too big to be shared.
Thanks
Ignazio

calculation only cis-eQTL

Is there any option to calculate cis-eQTL ?
currently is does both cis and trans at the same time, but in many case we need to do only one of them.
Thanks!

significant p value

Hi,
I am currently using matrix-eQTL for eQTL analysis of my wheat dataset. Is there a way to determine the best threshold for identifying cis and trans eQTLs? For example, how can I choose the value for pvOutputThreshold_cis and pvOutputThreshold_trans? I used 10-5 in both cases, but it provided many cis and trans eqtl, which may not be true. Also, I used the min.pv.by.genesnp = =T option, it provided me the p value number for each gene and snp, but I am confused about which one I should choose for the threshold. maybe I am missing something. Please guide me in this regard.
Thank you.

how to add kinship matrix

In the matirxEQTL tutorial, this process does not add a kinship matrix to run the eQTL analysis. However, My data is family-level, and I want to add the kinship to run the eQTL analysis, How to do it? Thanks~

Why are the FDR and P-value values the same for many SNP loci?

hello,teacher,my result that have a lot SNP loci which the FDR and P-value values are same,i think what one reason is that the genotyping results for some SNPs are consistent，while some SNPs with inconsistent genotyping results also have the same FDR and P-value values.thanks

'long vectors not supported yet' on larger dataset

Hello,

I've had a lot of success running MatrixEQTL on fairly large datasets in the past. Currently, I'm attempting to run "Matrix_eQTL_engine" for ~600k SNPs and ~30k genes with 500 subjects (CentoOS with 256GB RAM). I did read your thread with @AndrewSkelton, and my first reaction was to lower the p-value. I tried lowering to 1e-4,1e-6, and 1e-9; however, I still get the following error: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138. I also reduced my dataset down to 100k SNPs and 20k genes and that ran with no issues. So, I'm really not sure how to proceed.

Please let me know if I can provide any additional information. Any help is greatly appreciated.

Best,
John

Here is the sessionInfo:

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] vegan_2.4-6 lattice_0.20-35
[3] permute_0.9-4 snpStats_1.26.0
[5] Matrix_1.2-9 survival_2.41-3
[7] plyr_1.8.4 VariantAnnotation_1.22.0
[9] Rsamtools_1.28.0 Biostrings_2.44.0
[11] XVector_0.16.0 SummarizedExperiment_1.6.1
[13] DelayedArray_0.2.0 matrixStats_0.53.1
[15] Biobase_2.36.1 GenomicRanges_1.28.1
[17] GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0
[21] MatrixEQTL_2.2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.4.0 GenomicFeatures_1.28.0
[4] prettyunits_1.0.2 bitops_1.0-6 tools_3.4.0
[7] zlibbioc_1.22.0 progress_1.1.2 biomaRt_2.35.11
[10] digest_0.6.15 nlme_3.1-131 BSgenome_1.44.0
[13] RSQLite_1.1-2 memoise_1.1.0 mgcv_1.8-17
[16] DBI_0.6-1 GenomeInfoDbData_0.99.0 cluster_2.0.6
[19] rtracklayer_1.36.6 stringr_1.3.1 httr_1.3.1
[22] grid_3.4.0 R6_2.2.2 AnnotationDbi_1.38.2
[25] XML_3.98-1.10 BiocParallel_1.10.1 magrittr_1.5
[28] MASS_7.3-47 splines_3.4.0 GenomicAlignments_1.12.0
[31] assertthat_0.2.0 stringi_1.2.3 RCurl_1.95-4.10

Multiallelic variants

Dear Dr. Shabalin,

I amtrying to perform an eQTL analysis using your tool, using genotypes obtained for HLA genes. These genes have many different possible alleles, which makes annotating them with either 0,1 or 2 problematic.

Is there a way to input such multiallelic variants in MatrixEQTL?

Kind regards,
Tijs

Categorical covariates with >2 groups?

Hello!

We would like to use MatrixEQTL to test for the interaction between genotype and a categorical covariate with 4 groups. To do this, will we have to use multiple dummy variables, or is there an option to force Matrix EQTL to treat a covariate with >2 groups as categorical?
Thank you!
Elyssa

	########################## Covariates processing ##########################
	{
	status("Processing covariates");
	if( useModel == modelLINEAR_CROSS ){
	last.covariate = as.vector(tail( cvrt$getSlice(cvrt$nSlices()), 1));
	}
	if( cvrt$nRows()>0 ){
	cvrt$SetNanRowMean();
	cvrt$CombineInOneSlice();
	cvrt = rbind(matrix(1,1,snps$nCols()), cvrt$getSlice(1));
	} else {
	cvrt = matrix(1,1,snps$nCols());
	}
	# Correct for the error covariance structure
	if( length(correctionMatrix)>0 ){
	cvrt = cvrt %*% correctionMatrix;
	}
	# Orthonormalize covariates
	# status("Orthonormalizing covariates");
	q = qr(t(cvrt));
	if( min(abs(diag(qr.R(q)))) < .Machine$double.eps * snps$nCols() ){
	stop("Colinear or zero covariates detected");
	}
	cvrt = t( qr.Q(q) );
	rm(q);
	}

andreyshabalin / matrixeqtl Goto Github PK

matrixeqtl's People

Contributors

Stargazers

Watchers

Forkers

matrixeqtl's Issues

Weird results

Call used:

File checks

firt run

Recommend Projects

Recommend Topics

Recommend Org