andreyshabalin / matrixeqtl Goto Github PK
View Code? Open in Web Editor NEWMatrix eQTL: Ultra fast eQTL analysis via large matrix operations
Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
We've only seen this error when running MatrixEQTL on the exon-exon splice junction matrix. All of the junctions are sorted
2018-01-25 14:16:09 running MatrixEQTL
Matching data files and location files
4360453of4360453 genes matched
58109566of58109566 SNPs matched
Task finished in 36.62 seconds
Reordering SNPs
Task finished in 286.049 seconds
Reordering genes
Task finished in 33.687 seconds
Processing covariates
Task finished in 0.00199999999995271 seconds
Processing gene expression data (imputation, residualization)
Task finished in 7.25300000000004 seconds
Creating output file(s)
Error in findInterval(sn.l, ge.r + cisDist + 1) :
'vec' must be sorted non-decreasingly and not contain NAs
Calls: Matrix_eQTL_main -> findInterval
Execution halted
We've seen this a few times for different datasets, and usually removing the more lowly expressed junctions will allow the code to run but we'd like to better troubleshoot this issue.
Dear Dr.Shabalin,
I am writing to seek your help in understanding the 'noFDRsaveMemory' option in Matrix eQTL function. Using this option as FALSE and TRUE, results in outputs with significantly different raw p-values. For example, setting this option as FALSE, returns p-values as low as e-300, while TRUE returns p-values ~e-15
I've read the paper and looked at the GitHub page, but couldn't get a better understanding of this behaviour. I would really appreciate it if you could shed some light on this, or point me to literature that can help to explain this behaviour.
Thank you
Minimal reproducible example:
First, load the package example data into the appropriate data structures:
library(MatrixEQTL)
# Toy data locations
base.dir = find.package('MatrixEQTL');
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
# Load toy data as per MatrixEQTL vignette
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}
Matrix_eQTL_main parameters that cause the error:
eQTLs <- Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
pvOutputThreshold = 0, # ignore Trans eQTLs
pvOutputThreshold.cis = 1, # but keep all cis summary stats
snpspos = snpspos,
genepos = genepos,
output_file_name=NULL,
output_file_name.cis="test.txt",
noFDRsaveMemory=TRUE
)
Matching data files and location files
10of10 genes matched
15of15 SNPs matched
Task finished in 0.00499999999999545 seconds
Processing covariates
Task finished in 0.00200000000006639 seconds
Processing gene expression data (imputation, residualization)
Task finished in 0.000999999999976353 seconds
Creating output file(s)
Task finished in 0.0040000000000191 seconds
Performing eQTL analysis
100.00% done, 100 cis-eQTLs
Error in saver.cis$getResults(gene, snps, n.tests.cis) :
unused arguments (gene, snps, n.tests.cis)
The final error message is not returned if we set noFDRsaveMemory = FALSE
.
Why do extremely small p-values appear in the Matrix results? Are these due to calculation issues, and can they be used directly?
head cis_eqtlout
SNP gene beta t-stat p-value FDR
Chr1_1025645 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1025645 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1307252 MSTRG.341.1 4.35666666666667 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1307252 MSTRG.343.1 1.36333333333333 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1505533 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1516057 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1505533 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1516057 MSTRG.343.1 4.09 Inf 2.2250738585072e-308 2.91956304784701e-304
Chr1_1573233 MSTRG.341.1 13.07 Inf 2.2250738585072e-308 2.91956304784701e-304
Hello,
I am trying to use MatrixEQTL to detect DNA methylation QTL. I am currently using a window size of 1 kb to detect cis-QTL.
I am suspicious that I may have more data than MatrixEQTL is designed to handle. I am currently using only chr1, but have 1.1 million CpG sites I am trying to test. During QTL mapping I can only progress to 7.71% done, not 100%. This is my last line:
7.71% done, 4,185,210 cis-eQTLs
7.71% done, 4,185,817 cis-eQTLs
Task finished in 489.067 seconds
Do you have any insight if this could be due to the large number of CpG sites or could it be another issue? Thanks!
hi there devs,
Thanks for this awesome package. I'm using ~120 genes and ~44K SNPs to perform cis eqtl analysis. I use 2Mb as cis eqtl boundary.
pvOutputThreshold_cis = 0.1;
pvOutputThreshold_tra = 0.1;
pvOutputThreshold = 0.1;
me = Matrix_eQTL_main(
snps=snps, gene=gene,
output_file_name = NULL,
pvOutputThreshold=pvOutputThreshold,
useModel=useModel,
errorCovariance=errorCovariance,
verbose=TRUE,pvalue.hist=FALSE,min.pv.by.genesnp = FALSE,noFDRsaveMemory = FALSE,
snpspos=snpspos,cisDist=2e+06,genepos=genepos,pvOutputThreshold.cis=pvOutputThreshold_cis);
df_eqtl<-me$cis$eqtls
After analysis, I extract cis eqtls. However, I'd like to replicate the FDr generated by the matrix eqtl function. First, I use p.adjust
method FDR
in the output the p-value are little off (better).
df_eqtl_jointed$TEST_FDR<-p.adjust(p=df_eqtl$pvalue,n=nrow(df_eqtl),method="fdr")
snps | gene | chr | pos | statistic | pvalue | FDR | beta | gene_type | TEST_FDR |
---|---|---|---|---|---|---|---|---|---|
16:64975370:G:A | ENSG00000166592 | 16 | 64975370 | 8.694963 | 6.32E-15 | 7.87E-09 | 17.23002 | protein_coding | 8.17E-10 |
16:65320793:G:C | ENSG00000166592 | 16 | 65320793 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | protein_coding | 1.85E-09 |
16:65321249:C:T | ENSG00000166592 | 16 | 65321249 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | protein_coding | 1.85E-09 |
16:65321591:T:G | ENSG00000166592 | 16 | 65321591 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | protein_coding | 1.85E-09 |
16:65322806:T:C | ENSG00000166592 | 16 | 65322806 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | protein_coding | 1.85E-09 |
Number of rows in the cis eqtl output= 129196
If I use the number of tests when calculating FDR, head(me$param)
$ntests
[1] 30385434
p.adjust(p=df_eqtl_jointed$pvalue,n=30385434,method="fdr")
The output looks like:
snps | gene | start_hg19 | end_hg19 | pos | statistic | pvalue | FDR | beta | chrom | TEST_FDR_NTEST |
---|---|---|---|---|---|---|---|---|---|---|
16:64975370:G:A | ENSG00000166592 | 66955582 | 66959547 | 64975370 | 8.694963 | 6.32E-15 | 7.87E-09 | 17.23002 | chr16 | 1.92E-07 |
16:65320793:G:C | ENSG00000166592 | 66955582 | 66959547 | 65320793 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | chr16 | 4.36E-07 |
16:65321249:C:T | ENSG00000166592 | 66955582 | 66959547 | 65321249 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | chr16 | 4.36E-07 |
16:65321591:T:G | ENSG00000166592 | 66955582 | 66959547 | 65321591 | 7.939682 | 4.80E-13 | 1.79E-08 | 10.07585 | chr16 | 4.36E-07 |
I'm interested to know on FDR because I'd subset genes from the analysis. So, knowing correct method would help me to calculate FDR for them separately.
Also, in the output, is FDR calculated for each gene individually? Just curious.
Thank you.
Hi,
Using MatrixEQTL
version 2.2 I'm running into a weird result that might be resolved by lowering the p-value threshold (as in #1). Still, maybe you've seen this before and have some idea on what could be done.
The issue is that many of the results in me$cis$eqtls
(I saved the output of Matrix_eQTL_main()
to the object me
) are NA
, including the snp and gene ids. So I can't check if either the snp or the gene id has some weird property.
> table(!is.na(me$cis$eqtls$snps))
FALSE TRUE
9431253 23828
> head(me$cis$eqtls[is.na(me$cis$eqtls$snps), ])
snps gene statistic pvalue FDR beta
23829 <NA> <NA> NA NA NA NA
23830 <NA> <NA> NA NA NA NA
23831 <NA> <NA> NA NA NA NA
23832 <NA> <NA> NA NA NA NA
23833 <NA> <NA> NA NA NA NA
23834 <NA> <NA> NA NA NA NA
## Call used:
me <- Matrix_eQTL_main(snps = meth, gene = exprinfo,
output_file_name.cis = paste0('.', cpg, '_', opt$feature,
'.txt'), # invis file, temporary
pvOutputThreshold = 0, pvOutputThreshold.cis = 5e-4,
useModel = modelLINEAR,
snpspos = methpos, genepos = exprpos, cisDist = 1000)
> table(is.na(exprpos$spliceid))
FALSE
60252
> table(is.na(exprpos$chr))
FALSE
60252
> sapply(exprpos, class)
spliceid chr start end
"character" "character" "integer" "integer"
> table(is.na(methpos$cname))
FALSE
18664892
> table(is.na(methpos$chr))
FALSE
18664892
> sapply(methpos, class)
cname chr pos
"character" "character" "integer"
> identical(methpos$cname, rownames(meth))
[1] TRUE
> identical(exprpos$spliceid, rownames(exprinfo))
[1] TRUE
The code works as expected with one methylation data set but not in a second one: both use the same gene information and have the same number of samples. The second one has more non-zero entries in the meth
object and we had to use a lower file slice size (meth$fileSliceSize <- 300
instead of 2000) to keep the memory from blowing up. That's why I'm guessing that lowering the p-value threshold could help. If needed, I can see if this happens with the data from a small chromosome and send you the files via email (like we did for #3).
Best,
Leonardo
Hi!
Thanks for this great tool - It's super easy to use and I've had a fun time learning how to apply it on my data. I've run an eQTL analysis I'm pretty happy with on two tissues separately, lets call them tissue 1 and tissue 2. For the most part, most ~73% high-confidence eQTLs are common to both tissue 1 and tissue 2. I've plotted the results and they generally look very nice. Interestingly, a sizeable chunk of the remaining cis-eQTLs were specific to tissue 2. This was very exciting. However, when I've plotted those results in both tissues I am seeing that the expression patterning looks almost identical in both tissues, and isn't tissue 2-specific at all - however tissue 2 is nonetheless highly significant whereas tissue 1 for some reason was not significant at all. For example, the image below is what basically all the "tissue 2-specific" eQTLs look like;
Stats for tissue 1 are NS/NA because they were never saved in the eQTL output to begin with, due to not passing my specified p value threshold (p < 1e-8). My only guess is that the differential significance (despite very similar expression) is due to my covariates, most likely sequencing batch effect, due to the others (sex, BMI, age) being identical across tissues 1 and 2. Furthermore, I wonder if this is happening because I coded sequencing batch using dummy variables, which I've never done before. I just wanted to know if you think the differences in significance arise from the batch effect or if there is something else I ought to consider.
Thanks again!
D
Thanks for your useful software. The p values I get by using Matrix_eQTL are very small, and some of them can reach p=2.2250738585072e-308, but the p values of GWAS or eGWAS are generally not more than 1e-10. Are the two methods the same? Can the p values obtained by the two methods be calculated together? How should the significance threshold of Matrix_eQTL be determined?
Looking forward to your reply
Hi,
Some strange things happen when I change pvOutputThreshold.
library(MatrixEQTL)
useModel = modelLINEAR
SNP_file_name = "GoodGene.exp.geno.numeric.xls"
expression_file_name = "GoodGene.exp.ra.exp.xls"
covariates_file_name = "GoodGene.exp.cov2.xls"
output_file_name = "GoodGene.exp.0.05.results.xls"
pvOutputThreshold = 1.43771984533585e-08
snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name)
}
me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
verbose = TRUE,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
pdf(file = paste0(output_file_name, ".pdf"), width = 8, height = 8)
print(me$param$dfFull)
##second run
library(MatrixEQTL)
useModel = modelLINEAR
SNP_file_name = "GoodGene.exp.geno.numeric.xls"
expression_file_name = "GoodGene.exp.ra.exp.xls"
covariates_file_name = "GoodGene.exp.cov2.xls"
output_file_name = "GoodGene.exp.0.01.results.xls"
pvOutputThreshold = 2.8754396906717e-09
snps = SlicedData$new()
snps$fileDelimiter = "\t" # the TAB character
snps$fileOmitCharacters = "NA" # denote missing values;
snps$fileSkipRows = 1 # one row of column labels
snps$fileSkipColumns = 1 # one column of row labels
snps$fileSliceSize = 2000 # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name)
gene = SlicedData$new()
gene$fileDelimiter = "\t" # the TAB character
gene$fileOmitCharacters = "NA" # denote missing values;
gene$fileSkipRows = 1 # one row of column labels
gene$fileSkipColumns = 1 # one column of row labels
gene$fileSliceSize = 2000 # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name)
cvrt = SlicedData$new()
cvrt$fileDelimiter = "\t" # the TAB character
cvrt$fileOmitCharacters = "NA" # denote missing values;
cvrt$fileSkipRows = 1 # one row of column labels
cvrt$fileSkipColumns = 1 # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name)
}
me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
verbose = TRUE,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE)
pdf(file = paste0(output_file_name, ".pdf"), width = 8, height = 8)
print(me$param$dfFull)
The strange thing is that there are 503951 eQTL in GoodGene.exp.0.05.results.xls (and 294781 eQTL' p-value <= 2.8754396906717e-09), but there are only 1577 eQTL in GoodGene.exp.0.01.results.xls. Why pvOutputThreshold affect p value, is this a bug of matrixEQTL?
GoodGene.exp.0.01.results.xls.gz
GoodGene.exp.0.05.results.xls.gz
Best,
Kun
Hi, if I understand correctly, missing values are imputed using the mean. In my data, the same genes are not measured for each person, so I have a lot of missing data. I do not want to impute this data, I simply want to exclude the people that don't have this gene measured. Right now, my only solution to this is to run MatrixEQTL separately per gene, so that I can exclude specific people that were not measured for each gene. By doing this, however, I kind of lose the benefit of how fast MatrixEQTL is for large data, and it makes my whole pipeline take much longer.
Hi @andreyshabalin,
Thank you for an amazing resource!
I am troubleshooting what I think is an environment-related issue when running Matrix_eQTL_Main(). In one environment (generic set of R libraries on an HPC cluster), the code runs as expected without errors, but when I run the same code and data in a different environment (conda or singularity container), I receive: "Colinear or zero covariates detected"
I believe this happens here under Covariates processing:
MatrixEQTL/R/Matrix_eQTL_engine.R
Lines 1502 to 1527 in 7e5c1cb
My .Machine$double.eps is the same across environments and it looks like the qr functions are referencing the same namespace.
I am not sure where to go or what to try next so if you could provide additional insights or suggestions, that would be greatly appreciated!
Minimal reproducible example:
First, load the package example data into the appropriate data structures:
library(MatrixEQTL)
# Toy data locations
base.dir = find.package('MatrixEQTL');
SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");
expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");
covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
# Load toy data as per MatrixEQTL vignette
snps = SlicedData$new();
snps$fileDelimiter = "\t"; # the TAB character
snps$fileOmitCharacters = "NA"; # denote missing values;
snps$fileSkipRows = 1; # one row of column labels
snps$fileSkipColumns = 1; # one column of row labels
snps$fileSliceSize = 2000; # read file in slices of 2,000 rows
snps$LoadFile(SNP_file_name);
snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
gene = SlicedData$new();
gene$fileDelimiter = "\t"; # the TAB character
gene$fileOmitCharacters = "NA"; # denote missing values;
gene$fileSkipRows = 1; # one row of column labels
gene$fileSkipColumns = 1; # one column of row labels
gene$fileSliceSize = 2000; # read file in slices of 2,000 rows
gene$LoadFile(expression_file_name);
genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
cvrt = SlicedData$new();
cvrt$fileDelimiter = "\t"; # the TAB character
cvrt$fileOmitCharacters = "NA"; # denote missing values;
cvrt$fileSkipRows = 1; # one row of column labels
cvrt$fileSkipColumns = 1; # one column of row labels
if(length(covariates_file_name)>0) {
cvrt$LoadFile(covariates_file_name);
}
Run Matrix_eQTL_main()
with the data but otherwise default arguments:
eQTLs <- Matrix_eQTL_main(snps = snps, gene = gene, cvrt = cvrt)
Error in Matrix_eQTL_main(snps = snps, gene = gene, cvrt = cvrt) :
nzchar(output_file_name) is not TRUE
It looks like the default for the argument, has either changed recently to ""
, or it has always been output_file_name = ""
but this is no longer interpreted internally the same as output_file_name = NULL
.
Thanks for the great package!
I would like to run MatrixEQTL on only samples that are heterozygous for the respective snp of interest.
For this I input a vcf file in which all homozygous snps are marked as missing values. However, since Matrix_eQTL_main() uses SetNanRowMean() to impute missing values during the preprocessing of the snps file this approach doesn't really work.
Is there any way to prevent the imputation of missing snp values in the vcf file or to make MatrixEQTL selectively only compare heterozygous samples for each snp-gene combination tested?
Thanks for your help!
Hello Andreyshabalin,
The MatrixEQTL is very classical and robust in our eQTL analysis. My question here is more like a request instead of a issue. Since I am using single cell sequencing data for QTL analysis, is there a way to add useModel = "non-parametric linear model" to your options to fit it to single cell data distribution?
Thanks,
Yijun
Hi Andrey,
Thanks for this amazing tool! In the output of Matrix eQTL, the most significant association ranks the first. When I was using it to map eQTLs in my simulated datasets, I found that some associations had different t-statistics but the same p-value (I think they may reach the minimum that can be calculated/displayed by R: 2.225074e-308). For these associations (or SNPs, they were for the same eGene), they are ordered by genomic locations. I was wondering whether it would make more sense if these SNPs could be sorted by t-statstics, which indicate significance levels.
Best,
Qinqin
if pvOutputThreshold and pvOutputThreshold.cis are not set, what will be the outcome?
What are the default values of pvOutputThreshold and pvOutputThreshold.cis?
Hi there,
I'm currently running MatrixEQTL to do methQTLs and I noticed something peculiar in my results.
Last week, I ran MatrixEQTL with separate cis and trans p-value thresholds. I used a cis p-value threshold of 1e-8 and a trans p-value threshold of 1e-14. I originally set a cis window of 10 kB. When I got my results back, there were 40,000 hits for trans, but zero hits for cis. I noticed that for a few trans hits, the associated SNP and methylation site were less than 50 kB apart. For one trans hit, they were 32 kB apart, for another, 22 kB, etc.
I reran MatrixEQTL with the same p-values, but I changed the cis window to 50 kB. I still have 0 cis hits. I am confused because my trans hits that were within 50 kB should now be cis hits, given I increased the cis window and the trans p-value is more stringent than the cis p-value.
I was wondering if you could help me understand why this might be occurring. Furthermore, if I ran MatrixEQTL without specifying a cis window, do you think this problem would be solved? Thank you in advance for your help!
Hi
Using MatrixEQTL R package, I am performing both cis and trans eQTL analyses in order to find some SNPs associated with some genes of interest. I try to understand the Beat value in the results, What does "Effect size estimate" mean here? I am wondering whether it can represent a trend in changes in gene expression levels in different genotypes.
Dear Andrey Shabalin
I have met a strange problem while using MatrixEQTL. In my study, some results that should have been defined as trans QTL were defined as cis QTLs. In other words, some cis-QTLs have a distance over 1e6 bp between SNP and its gene. The distance is the minimum value of the absolute value of the distance between SNP and both ends of the gene.
distance = min( abs( snp_position - gene_start ), abs( snp_position - gene_end ) )
there are some of the wrong QTLs:
qtl_type | SNP_ID | SNP_chr | SNP_loc | ref | alt | gene_ID | gene_chr | gene_start | gene_end | gene_strand | beta | t_stat | p | distance |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cis | rs7088262 | chr10 | 49836108 | A | G | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | -0.23 | -3.5 | 4.88E-04 | 3.073216 |
cis | rs61838734 | chr10 | 49681118 | T | C | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.14 | 3.74 | 1.91E-04 | 2.918226 |
cis | rs72783091 | chr10 | 49680879 | A | C | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.13 | 3.63 | 2.92E-04 | 2.917987 |
cis | rs200554992 | chr10 | 49680859 | TA | T | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.13 | 3.47 | 5.34E-04 | 2.917967 |
cis | rs56342922 | chr10 | 49680623 | A | G | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.14 | 3.68 | 2.46E-04 | 2.917731 |
cis | rs7093726 | chr10 | 49678147 | T | G | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.14 | 3.68 | 2.46E-04 | 2.915255 |
cis | rs7069220 | chr10 | 49678060 | C | T | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.13 | 3.51 | 4.59E-04 | 2.915168 |
cis | rs3906889 | chr10 | 49677182 | G | C | BMS1P5 | chr10 | 46737615 | 46762892 | -;chr10 | 0.13 | 3.63 | 3.02E-04 | 2.91429 |
some of the QTLs even have SNP and gene in different chromosomes:
cis | rs2869546 | chr15 | 78907345 | C | T | SNORA63 | chr3 | 186505088 | 186505222 | + | -0.15 | -3.82 | 1.41E-04 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cis | rs7182583 | chr15 | 78899210 | G | C | SNORA63 | chr3 | 186505088 | 186505222 | + | 0.14 | 3.57 | 3.77E-04 |
cis | rs9920189 | chr15 | 78666311 | G | T | SNORA63 | chr3 | 186505088 | 186505222 | + | 0.14 | 3.62 | 3.07E-04 |
Do you know why does it happen and how to solve it?
Hi,
I've ran MatrixEQTL (SNPs = 561,963, Expression = 422,070, No of Samples = 87). At the last stage, after the 100% completion message, I get the following error:
Error in pmin.int(x, val) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138
I've tried in Rmd and R scripts, to see if that made the difference.
Any suggestions?
R Version: 3.3.2 (Sincere Pumpkin Patch)
RAM on machine: 256GB
OS: linux-gnu (x86_64) - Ubuntu 16.04.1 LTS
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rcpp_0.12.9 ggplot2_2.2.1 dplyr_0.5.0 readr_1.0.0 MatrixEQTL_2.1.1
loaded via a namespace (and not attached):
[1] assertthat_0.1 grid_3.3.2 R6_2.2.0 plyr_1.8.4 gtable_0.2.0 DBI_0.5-1 magrittr_1.5 scales_0.4.1 lazyeval_0.2.0 tools_3.3.2 munsell_0.4.3 colorspace_1.2-6 tibble_1.2
Hello, I am encountering this error when running Matrix_eQTL_main:
Error in findInterval(x = stats.for.hist, vec = statbins1) :
'vec' must be sorted non-decreasingly and not contain NAs
Calls: Matrix_eQTL_main -> -> tabulate -> findInterval
Another dataset I generated using the same script works without any issue when I run Matrix_eQTL_main. Basically, I splitted patients and controls samples from the main data files to run the eQTL separately. The patient group files run correctly, but not the controls.
I cannot understand the problem.
Can I please share by email my data and script? I was trying here but probably the files are too big to be shared.
Thanks
Ignazio
Is there any option to calculate cis-eQTL ?
currently is does both cis and trans at the same time, but in many case we need to do only one of them.
Thanks!
Hi,
I am currently using matrix-eQTL for eQTL analysis of my wheat dataset. Is there a way to determine the best threshold for identifying cis and trans eQTLs? For example, how can I choose the value for pvOutputThreshold_cis and pvOutputThreshold_trans? I used 10-5 in both cases, but it provided many cis and trans eqtl, which may not be true. Also, I used the min.pv.by.genesnp = =T option, it provided me the p value number for each gene and snp, but I am confused about which one I should choose for the threshold. maybe I am missing something. Please guide me in this regard.
Thank you.
In the matirxEQTL tutorial, this process does not add a kinship matrix to run the eQTL analysis. However, My data is family-level, and I want to add the kinship to run the eQTL analysis, How to do it? Thanks~
hello,teacher,my result that have a lot SNP loci which the FDR and P-value values are same,i think what one reason is that the genotyping results for some SNPs are consistent,while some SNPs with inconsistent genotyping results also have the same FDR and P-value values.thanks
Hello,
I've had a lot of success running MatrixEQTL on fairly large datasets in the past. Currently, I'm attempting to run "Matrix_eQTL_engine" for ~600k SNPs and ~30k genes with 500 subjects (CentoOS with 256GB RAM). I did read your thread with @AndrewSkelton, and my first reaction was to lower the p-value. I tried lowering to 1e-4,1e-6, and 1e-9; however, I still get the following error: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138. I also reduced my dataset down to 100k SNPs and 20k genes and that ran with no issues. So, I'm really not sure how to proceed.
Please let me know if I can provide any additional information. Any help is greatly appreciated.
Best,
John
Here is the sessionInfo:
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] vegan_2.4-6 lattice_0.20-35
[3] permute_0.9-4 snpStats_1.26.0
[5] Matrix_1.2-9 survival_2.41-3
[7] plyr_1.8.4 VariantAnnotation_1.22.0
[9] Rsamtools_1.28.0 Biostrings_2.44.0
[11] XVector_0.16.0 SummarizedExperiment_1.6.1
[13] DelayedArray_0.2.0 matrixStats_0.53.1
[15] Biobase_2.36.1 GenomicRanges_1.28.1
[17] GenomeInfoDb_1.12.0 IRanges_2.10.0
[19] S4Vectors_0.14.0 BiocGenerics_0.22.0
[21] MatrixEQTL_2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.4.0 GenomicFeatures_1.28.0
[4] prettyunits_1.0.2 bitops_1.0-6 tools_3.4.0
[7] zlibbioc_1.22.0 progress_1.1.2 biomaRt_2.35.11
[10] digest_0.6.15 nlme_3.1-131 BSgenome_1.44.0
[13] RSQLite_1.1-2 memoise_1.1.0 mgcv_1.8-17
[16] DBI_0.6-1 GenomeInfoDbData_0.99.0 cluster_2.0.6
[19] rtracklayer_1.36.6 stringr_1.3.1 httr_1.3.1
[22] grid_3.4.0 R6_2.2.2 AnnotationDbi_1.38.2
[25] XML_3.98-1.10 BiocParallel_1.10.1 magrittr_1.5
[28] MASS_7.3-47 splines_3.4.0 GenomicAlignments_1.12.0
[31] assertthat_0.2.0 stringi_1.2.3 RCurl_1.95-4.10
Dear Dr. Shabalin,
I amtrying to perform an eQTL analysis using your tool, using genotypes obtained for HLA genes. These genes have many different possible alleles, which makes annotating them with either 0,1 or 2 problematic.
Is there a way to input such multiallelic variants in MatrixEQTL?
Kind regards,
Tijs
Hello!
We would like to use MatrixEQTL to test for the interaction between genotype and a categorical covariate with 4 groups. To do this, will we have to use multiple dummy variables, or is there an option to force Matrix EQTL to treat a covariate with >2 groups as categorical?
Thank you!
Elyssa
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.