uw-gac / genesis Goto Github PK

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness

Home Page: https://bioconductor.org/packages/GENESIS

R 99.76% C 0.24%

genesis's People

Contributors

Stargazers

Watchers

Forkers

terrytianyuzhang satoshio49 jiwonslee ravwojdyla sstadick ntduc11 tamartsi manigben dariushghasemi datagram pieterlukasse kevinluolk

genesis's Issues

pcrelateToMatrix slow on 50k samples

Hello, I am running pcrelateToMatrix on 5k samples and it has been running for several days.
Could you advise what I can do to run it faster? thresh=NULL doesn't seem to help. Thanks!

pcrelate parallel implementation

The help documentation for pcrelate() suggests that parallelization is possible using the component functions calcISAFBeta() and pcrelateSampBlock(). I hope I am not being completely daft, but I am struggling to set up a parallel implementation. How can this be done?

pcrelate issue for large sample size

I was running 76K samples for 33,765 SNPs, using PLINK bed/bim/fam files. PCrelate failed with the following error message:

Error in rbindlist(l, use.names, fill, idcol) : 
  Total rows in the list is 2165981000 which is larger than the maximum number of rows, currently 2147483647
Calls: pcrelate ... pcrelate -> .local -> .pcrelate -> rbind -> rbind -> rbindlist
In addition: Warning message:
In .pcrelate(gdsobj, pcs = pcs, scale = scale, ibd.probs = ibd.probs,  :
  small.samp.correct can only be used when all samples are analyzed in one block and `scale != none`
Execution halted
run_PCair_PCrelate_1st_iter_V1_largemem.R failed

It appears to hit the maximum row size (2^31-1) for a R data.table. Does this mean we cannot run pcrelate for more than ~65K samples? Any advice or insights? Thank you!

Issue with PCAir - Segmentation fault (core dumped)

Hi,

I am trying to run the pcair function, however when I do I get this error "Segmentation fault (core dumped)". Could you please advise me on how to troubleshoot the issue?

My R session is as below:

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

library(GENESIS)
library(GWASTools)
library(gdsfmt)
library(SNPRelate)
library(data.table)
library(dplyr)

KING <- fread("king.kin0")
KING <- as.data.frame(KING)
KINGmat <- matrix(
data = KING$Kinship,
nrow = length(KING$FID1),
ncol = length(KING$FID2),
byrow = TRUE,
dimnames = list(KING$FID1, KING$FID2)
)
geno <- GdsGenotypeReader(filename = "genotype.gds")
genoData <- GenotypeData(geno)
pruned <- fread("pruned.txt")
pruned <- as.data.frame(pruned)
pruned <- as.vector(pruned)
unrels <- fread("king_unrelated.txt")
unrels <- as.data.frame(unrels)
unrels <- as.vector(unrels)
mypcair <-pcair(genoData, kinobj= KINGmat,divobj = KINGmat,snp.include = pruned, verbose=T, eigen.cnt=10, unrel.set = unrels)
Using kinobj and divobj to partition samples into unrelated and related sets
Working with 107828 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Segmentation fault (core dumped)

Thank you for your help!

errors trying to use makeSparseMatrix

I'm using R version 4.0.4 with GENESIS_2.20.1. I am trying to make sure I can make a sparse kinship matrix for ~87K participants. I am starting with 30 participants, where only two are actually related. I am using the call makeSparseMatrix(kin.dat,thresh=NULL) where kin.dat is a data.frame containing the columns ID1, ID2 and value. Below are two of the things I tried and my diagnosis. My diagnosis may be incorrect, but I am hoping we can resolve the issue. I include a sample of what the input loolks like at the bottom. In addition, I get the following warning when I load the GENESIS library, which might be pertinent:

Warning message:
In .recacheSubclasses(def@className, def, env) :
  undefined subclass "numericVector" of class "Mnumeric"; definition not updated

When ID1 and ID2 are character, and value is numeric, I get the following error:

Error in submat + t(submat) : non-numeric argument to binary operator
Calls: makeSparseMatrix -> makeSparseMatrix -> .local -> .makeSparseMatrix_df

When ID1, ID2 and value are all numeric, I get the following error:

Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  :
  Incompatible join types: x.ID1 (double) and i.ID1 (character)
Calls: makeSparseMatrix ... .makeSparseMatrix_df -> [ -> [.data.table -> bmerge
Execution halted

I tried to debug, going through the code for .makeSparseMatrix_df(). It looks like the data.table() command makes all columns character when the ID columns are character, resulting in the error where submat + t(submat) cannot be calculated. .However, when I input all columns of the data.frame as numeric, the ids variable ( ids <- names(mem[mem == i])) seems to be character - resulting in the problematic merge shown in item 2 for the command sub <- x[ID1 %in% ids & ID2 %in% ids][allpairs, on = c("ID1", "ID2")]. I think that if you added lines makeing the ID columns character and the value column numeric, this might avoid these problems:

ID1 <- as.character(ID1)
ID2 <- as.character(ID2)
values <- as.numeric(value)

The input looks like the following data.frame, where the diagonal element comes first and the last record contain the related pair.
ID1 ID2 value
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
6 7 0.232122593288146

Error in assocTestSingle with Score.SPA test

Hi,
When running the association test with test="Score", it executed successfully. However, when switching to test="Score.SPA", an error message showed as:
Error in if (abs(q - m1)/sqrt(var1) < Cutoff) { :
missing value where TRUE/FALSE needed

I guess this happens because there are monomorphic SNPs or missing genotypes that leads to NA in the if statement. But I am not sure if it is due to other reasons and how can I bypass it.
(I also tried the SPA test on a small set of SNPs where there are no missing genotypes and no monomorphic SNPs, it executed successfully)

Issue with GxE interaction with Score test in assocTestSingle

When running GENESIS v. 2.13.11 and supplying an interactive term for the score test, I get the following error after all iterations are completed:

Error in match.names(clabs, names(xi)) :
names do not match previous names
Calls: source ... .local -> do.call -> -> rbind -> match.names
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Execution halted

It appears from the log, that Wald is done for the interaction test and Score for the genetic effect, but then at the end something happens when the rows are combined using rbind.

Here is my code:

 seqSetFilterChrom(SeqData, include = CHR)
    SeqIterator <- SeqVarBlockIterator(SeqData, variantBlock=5000)
    

    RequestedMixedModel = assocTestSingle(SeqIterator, 
                                          imputed = TRUE, 
                                          null.model = get(Ho), 
                                          # test = "Score",
                                          GxE = InteractionTerm
    )
    resetIterator(SeqIterator)

Ho represents the null model, which include PCs and a variable for maternal haplogroup. Outcome a binary variable, and I am modeling the effect of imputed variants.

Is this something that was addressed in more recent versions of GENESIS?

Error in assocTestSingle() results when covariate is confounded with SNP genotype.

Hello and thanks for your great package which I've been enjoying using!

I have encountered a bug that seems to result when a covariate provided to GENESIS is perfectly confounded with one or more 2-genotype snps in the dataset. This seems likely only to occur in small datasets then the confounding occurs by chance; in our case we have a small cohort of around 50 individuals. I found that SNPs which were confounded with sex (included as a covariate) were found to be highly significant (p values approaching 0) even when there are no true association. I think that perhaps this could be caused by handling of NAs that result when trying to fit a perfectly confounded covariate, though I haven't dug into the package's internal functions to try to find where this arises.

I've reproduced this problem here using data available from SNPRelate. I also compared GENESIS results to models fit using lm() to highlight where confounded SNPs are clearly estimated incorrectly.

Thanks and I hope this is helpful!

Max Number of samples for PC-AIR

Is there a max number of samples PC-AIR will work on? We find when the number of samples increases from 45K to 55K, we are getting NA results. With samples less than 45K, it runs fine. Any suggestions or workarounds?

What does correctKin do?

Hello! It looks like correctKin is only to be used when the sample size is small. What does it do and what does "small sample size" being True or False lead to? Specifically I'm looking at this #54.

As mentioned in a previous issue, I'm running a single query against all the other samples, so I have two blocks, one with one sample and one with many. I want to make sure the correctKin is something I can safely skip, and I want to make sure I'm setting the small sample size correctly.

Error when running the pcrelate function

Hi,

I have been trying run pcrelate(), however I am receiving this error:

Error in if (idx <= length(redo.index)) idx <- redo.index[idx] else idx <- idx - :
argument is of length zero
Calls: pcrelate ... .reducer_add -> .reducer_add -> .map_index -> .map_index
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
1 parallel job did not deliver a result

Do you have any advice on how to troubleshoot this issue?

Confusing error message in `assocTestSingle` when samples in null model are not in GDS file

Minor issue: I ran into this error when my null model had samples that weren't in the GDS file used for association testing:

> assoc <- assocTestSingle(iterator, nullModel,
+                          test=test,
+                          fast.score.SE=fast.score.SE,
+                          genome.build=build,
+                          BPPARAM=BPPARAM,
+                          geno.coding=geno.coding)
# of selected samples: 729
Using 1 CPU cores
Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : 
  'NA' indices are not (yet?) supported for sparse Matrices
Calls: assocTestSingle ... .bpiterate_serial -> ITER -> [ -> [ -> subCsp_rows -> intI
Execution halted

My null model had 5 extra samples not in the GDS file. Ideally there would be a check comparing the samples in the null model to the samples in the GDS, with more helpful error message.

`fitted.values` in non-mixed null models are not what we expect

The fitted.values element of null models is different for non-mixed models and mixed-models. This had caused errors in SPA p-values for models without variance components or where variance components were estimated to be 0 (see #84).

Consider a bigger fix to make fitted.values consistent for all types of null model (e.g., it is always the same thing) but that is a much more involved change and has some issues with backwards compatibility (e.g., using a version of the null model run with an older version of GENESIS to do Score.SPA association tests in a new version).

Originally posted by @amstilp in #84 (comment)

Estimate PVE

Hi,
I would like to know how the PVE is calculated. I have the data from the GMMAT package and I don't know how to estimate the PVE of the significant snps. I saw that with your package it can be calculated but I have had problems with the installation of the GENESIS and GWASTools packages. Thank you very much.

Best regards!!
Angélica

Error in setMethod - no existing definition for function ‘chromWithPAR’

Hi.

When I try to install the current version (Version: 2.13.3 Date: 2019-1-30), I am getting

Error in setMethod("chromWithPAR", "GenotypeData", function(gdsobj) { :
no existing definition for function ‘chromWithPAR’

coming from utils.R. Is one of my other packages out of date or something?

Thanks, David Duffy.

High Memory Consumption with Large Dataset in PC-Relate

Hello,

I'm currently using GENESIS for a large-scale genetic analysis involving approximately 500,000 samples. However, I've found that running PC-Relate with only 160,000 samples already exceeds my system's 1 TB memory capacity.

Given these memory demands, I'm wondering if there are any strategies or planned updates to reduce the memory footprint of PC-Relate. I'm also interested in any recommended approaches for handling such large datasets with GENESIS. Any advice or guidance would be greatly appreciated.
Thank you for your time and assistance.

Best Regards,
Erh-Chan

analysis pipeline for chromosome X

Can I include chromosome X for the GWAS using assocTestSingle()? If so, how is the chromosome X data handled?

change use of apply in makeSparseMatrix to GENESIS:::.apply

if one wants to use the makeSparseMatrix function with a Matrix, the call to 'apply' on line 72 will fail with 'cholMod error too large' in some cases. i propose using GENESIS:::.apply (need to add a 'selection' argument) which would handle all cases!

thanks :)

assocTestSingle.R() error when interacting with GenotypeIterator / GenotypeBlockIterator objects

I was unsure where to raise this issue since the error is specific to assocTestSingle.R() but stems from different method calls of getGenotypeSelection() from the GWASTools package.

In assocTestSingle.R(), line 83, getGenotypeSelection() is being called with order = "selection". This argument is undefined in the methods for getGenotypeSelection() extracting information from MatrixGenotypeReader class objects or GenotypeIterator / GenotypeBlockIterator objects constructed using MatrixGenotypeReader. This produces the error: "Error in .local(object, ...) : unused argument (order = "selection")".

Is the intent for the function to only work only with GenotypeIterator / GenotypeBlockIterator objects constructed from GdsGenotypeReader objects? If so, this needs to be specified in the documentation for assocTestSingle.R(). Alternatively, if the intent is for the function to also work with output constructed from MatrixGenotypeReader objects, then this is a valid issue for assocTestSingle.R().

fitnullmodel error for matrix with negatives

Trying to do run the fitnullmodel using a grm covariate matrix generated with pc-air. I get:

Error in asMethod(object) : not a positive definite matrix

This seem to stem from having negative values generated with king and pc-air in the matrix. Some issue with how cholesky decomposition is performed I guess, but why does this happen?

add check in varCompCI that family is gaussian

varCompCI should return an error when prop = TRUE and the null model family != "gaussian"

parallel implementation of pcrelate

The functions, correctKin, correctK0, correctK2, do not seem to work. I have been trying to follow the instructions provided in issue #38, to set up a parallel implementation of pcrelate. However, because these functions do not work, I am not sure how to do the post-processing. Also, the correctKin function is not actually used in the scripts referenced in the answer to that issue. Is it needed? Any guidance on the use of these functions to combine output of the calls to pcrelateSampBlock would be greatly appreciated. Thank you.

pcrelate REDUCE error while calculating individual AF betas

I have a sample of 49227 individuals, with 4626 of those individuals related to each other.

When I run pcrelate, I get the following output:

Using 1 CPU cores
49227 samples to be included in the analysis, split into 10 blocks...
Using 62 CPU cores
Betas for 2 PC(s) will be calculated using 44601 samples in training.set...
Calculating Indivdiual-Specific Allele Frequency betas for 375415 SNPs in 38 blocks...
Error in REDUCE(env[["value"]], env[[idx]][[i]]) : 
  Input is matrix but should be a plain list of items to be stacked

This is with this command:

pcrel_initial <- pcrelate(gdsobj = genoiter,
                          pcs = pcmat_initial[, 1:2],
                          training.set = parts$unrels$IID,
                          small.samp.correct = nrow(fam) < 5000,
                          ibd.probs = FALSE,
                          BPPARAM = BiocParallel::SerialParam())

genoiter looks like this:

> str(genoiter)
Formal class 'GenotypeBlockIterator' [package ""] with 6 slots
  ..@ snpBlock  : int 10000
  ..@ snpFilter :List of 38
  .. ..$ 1 : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ 2 : int [1:10000] 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
  .. ..$ 3 : int [1:10000] 20001 20002 20003 20004 20005 20006 20007 20008 20009 20010 ...
  .. ..$ 4 : int [1:10000] 30001 30002 30003 30004 30005 30006 30007 30008 30009 30010 ...
  .. ..$ 5 : int [1:10000] 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010 ...
  .. ..$ 6 : int [1:10000] 50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 ...
  .. ..$ 7 : int [1:10000] 60001 60002 60003 60004 60005 60006 60007 60008 60009 60010 ...
  .. ..$ 8 : int [1:10000] 70001 70002 70003 70004 70005 70006 70007 70008 70009 70010 ...
  .. ..$ 9 : int [1:10000] 80001 80002 80003 80004 80005 80006 80007 80008 80009 80010 ...
  .. ..$ 10: int [1:10000] 90001 90002 90003 90004 90005 90006 90007 90008 90009 90010 ...
  .. ..$ 11: int [1:10000] 100001 100002 100003 100004 100005 100006 100007 100008 100009 100010 ...
  .. ..$ 12: int [1:10000] 110001 110002 110003 110004 110005 110006 110007 110008 110009 110010 ...
  .. ..$ 13: int [1:10000] 120001 120002 120003 120004 120005 120006 120007 120008 120009 120010 ...
  .. ..$ 14: int [1:10000] 130001 130002 130003 130004 130005 130006 130007 130008 130009 130010 ...
  .. ..$ 15: int [1:10000] 140001 140002 140003 140004 140005 140006 140007 140008 140009 140010 ...
  .. ..$ 16: int [1:10000] 150001 150002 150003 150004 150005 150006 150007 150008 150009 150010 ...
  .. ..$ 17: int [1:10000] 160001 160002 160003 160004 160005 160006 160007 160008 160009 160010 ...
  .. ..$ 18: int [1:10000] 170001 170002 170003 170004 170005 170006 170007 170008 170009 170010 ...
  .. ..$ 19: int [1:10000] 180001 180002 180003 180004 180005 180006 180007 180008 180009 180010 ...
  .. ..$ 20: int [1:10000] 190001 190002 190003 190004 190005 190006 190007 190008 190009 190010 ...
  .. ..$ 21: int [1:10000] 200001 200002 200003 200004 200005 200006 200007 200008 200009 200010 ...
  .. ..$ 22: int [1:10000] 210001 210002 210003 210004 210005 210006 210007 210008 210009 210010 ...
  .. ..$ 23: int [1:10000] 220001 220002 220003 220004 220005 220006 220007 220008 220009 220010 ...
  .. ..$ 24: int [1:10000] 230001 230002 230003 230004 230005 230006 230007 230008 230009 230010 ...
  .. ..$ 25: int [1:10000] 240001 240002 240003 240004 240005 240006 240007 240008 240009 240010 ...
  .. ..$ 26: int [1:10000] 250001 250002 250003 250004 250005 250006 250007 250008 250009 250010 ...
  .. ..$ 27: int [1:10000] 260001 260002 260003 260004 260005 260006 260007 260008 260009 260010 ...
  .. ..$ 28: int [1:10000] 270001 270002 270003 270004 270005 270006 270007 270008 270009 270010 ...
  .. ..$ 29: int [1:10000] 280001 280002 280003 280004 280005 280006 280007 280008 280009 280010 ...
  .. ..$ 30: int [1:10000] 290001 290002 290003 290004 290005 290006 290007 290008 290009 290010 ...
  .. ..$ 31: int [1:10000] 300001 300002 300003 300004 300005 300006 300007 300008 300009 300010 ...
  .. ..$ 32: int [1:10000] 310001 310002 310003 310004 310005 310006 310007 310008 310009 310010 ...
  .. ..$ 33: int [1:10000] 320001 320002 320003 320004 320005 320006 320007 320008 320009 320010 ...
  .. ..$ 34: int [1:10000] 330001 330002 330003 330004 330005 330006 330007 330008 330009 330010 ...
  .. ..$ 35: int [1:10000] 340001 340002 340003 340004 340005 340006 340007 340008 340009 340010 ...
  .. ..$ 36: int [1:10000] 350001 350002 350003 350004 350005 350006 350007 350008 350009 350010 ...
  .. ..$ 37: int [1:10000] 360001 360002 360003 360004 360005 360006 360007 360008 360009 360010 ...
  .. ..$ 38: int [1:5415] 370001 370002 370003 370004 370005 370006 370007 370008 370009 370010 ...
  ..@ lastFilter:<environment: 0x56544778c8f0> 
  ..@ data      :Formal class 'GdsGenotypeReader' [package "GWASTools"] with 15 slots
  .. .. ..@ snpIDvar     : chr "snp.id"
  .. .. ..@ chromosomeVar: chr "snp.chromosome"
  .. .. ..@ positionVar  : chr "snp.position"
  .. .. ..@ scanIDvar    : chr "sample.id"
  .. .. ..@ genotypeVar  : chr "genotype"
  .. .. ..@ alleleVar    : chr "snp.allele"
  .. .. ..@ autosomeCode : int [1:22] 1 2 3 4 5 6 7 8 9 10 ...
  .. .. ..@ XchromCode   : int 23
  .. .. ..@ YchromCode   : int 25
  .. .. ..@ XYchromCode  : int 24
  .. .. ..@ MchromCode   : int 26
  .. .. ..@ genotypeDim  : chr "scan,snp"
  .. .. ..@ missingValue : int 3
  .. .. ..@ filename     : chr "results/intermediate/post-impute_filter/output/imputed_filtered_PCApre.gds"
  .. .. ..@ handler      :List of 5
  .. .. .. ..$ filename: chr "/sc/arion/projects/load/data-ext/ADGC/2021_freeze/processed/071822/results/intermediate/post-impute_filter/outp"| __truncated__
  .. .. .. ..$ id      : int 0
  .. .. .. ..$ ptr     :<externalptr> 
  .. .. .. ..$ root    : 'gdsn.class' raw [1:20] 00 00 00 00 ...
  .. .. .. ..$ readonly: logi TRUE
  .. .. .. ..- attr(*, "class")= chr "gds.class"
  ..@ snpAnnot  : NULL
  ..@ scanAnnot : NULL

pcmat_initial looks like this (censored):

> str(pcmat_initial)
 num [1:49227, 1:16] -0.0095 -0.00484 -0.0097 -0.00814 -0.01007 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:49227] "foo-AA--bar" "foo2-AA--bar2" "foo3-AA--bar3" "foo4-AA--bar4" ...
  ..$ : NULL

Finally, training.set is a character vector of IDs.

I get the same issue when just running calcISAFBeta.

The command is this:

calcISAFBeta(gdsobj = genoiter,
             pcs = pcmat_initial[, 1:2],
             sample.include = iids,
             training.set = parts$unrels$IID,
             BPPARAM = BiocParallel::SerialParam())

The output is this:

Using 1 CPU cores
Betas for 2 PC(s) will be calculated using 44601 samples in training.set...
Calculating Indivdiual-Specific Allele Frequency betas for 375415 SNPs in 38 blocks...
Error in REDUCE(env[["value"]], env[[idx]][[i]]) : 
  Input is matrix but should be a plain list of items to be stacked

Subsampling then running is sucessful.

> pcrel_initial_samp <- pcrelate(gdsobj = genoiter,
+                                sample.include = iids_samp,
+                                pcs = pcmat_initial[rownames(pcmat_initial) %in% iids_samp, 1:2],
+                                training.set = parts$unrels$IID[parts$unrels$IID %in% iids_samp],
+                                small.samp.correct = TRUE,
+                                ibd.probs = FALSE,
+                                BPPARAM = BiocParallel::SerialParam())
Using 1 CPU cores
4000 samples to be included in the analysis...
Betas for 2 PC(s) will be calculated using 3641 samples in training.set...
Running PC-Relate analysis for 4000 samples using 375415 SNPs in 38 blocks...
Performing Small Sample Correction...
>

Finally, setting the sample.block.size to 50000 seemed initially successful, but seems to have hung. At its maximum, it used 478.3 GiB of memory and is now down to 85.3 GiB, but it hasn't completed and CPU usage is at 0.

I will attach my sessioninfo when I get a chance.

SPA association tests are incorrect for non-mixed models

fitted.values in for glm null models already have expit applied, but we are treating them as if they do not in GENESIS. This leads to incorrect (and inflated) association test statistics.

varCompCI gives a heritability of 0

I met that varCompCI function gave heritability of 0 several times. Proportion is 0. Lower 95 and Upper 95 are NA. I used command: varCompCI(nullmod, prop = TRUE). My nullmod matrix seems no problem.
Is this reasonable?

My case is neither of the following two unsuitable situations as GENESIS manual mentions:
#####################################################################################
Note: varCompCI can not compute proportions of variance explained when heterogeneous residual variances are used in the null model (i.e. group.var is used in fitNullModel). Confidence intervals can still be computed for the variance component estimates on the original scale by setting prop = FALSE.

Note: variance component estimates are not interpretable for binary phenotypes when fit using the PQL method implemented in fitNullModel; proportions of variance explained should not be calculated for these models.

pcrelate: error with rbindlist()

Hi,

I am currently having an issue with running pcrelate() due to, potentially, having a large sample size of 140K as it shows this error:

Error in rbindlist(l, use.names, fill, idcol) :
Total rows in the list is 2203757913 which is larger than the maximum number of rows, currently 2147483647
Calls: pcrelate ... pcrelate -> .local -> .pcrelate -> rbind -> rbind -> rbindlist

I have tried to increase my sample block size, but is there a sample size limit to the pcrelate function?

Below is my code:
genoData_pruned <- GenotypeBlockIterator(genoData, snpInclude=pruned)
mypcrelate <- pcrelate(genoData_pruned, mypcair$vectors[,1:npca], training.set = mypcair$unrels, sample.block.size=10000, BPPARAM = BiocParallel::SerialParam())

Using 1 CPU cores
140831 samples to be included in the analysis, split into 15 blocks...
Using 1 CPU cores
Betas for 7 PC(s) will be calculated using 93272 samples in training.set...
Calculating Indivdiual-Specific Allele Frequency betas for 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,1)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,2)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,3)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,4)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,5)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,6)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,7)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,8)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,9)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,10)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,11)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,12)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,13)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,14)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (1,15)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,2)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,3)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,4)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,5)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,6)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,7)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,8)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,9)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,10)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,11)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Running PC-Relate analysis for sample block pair (2,12)
Using 1 CPU cores
Running PC-Relate analysis using 280707 SNPs in 29 blocks...
Error in rbindlist(l, use.names, fill, idcol) :
Total rows in the list is 2203757913 which is larger than the maximum number of rows, currently 2147483647
Calls: pcrelate ... pcrelate -> .local -> .pcrelate -> rbind -> rbind -> rbindlist
Execution halted

Thank you for your help!

Error in seqVCF2GDS(vcf.files, gds.file, verbose = FALSE)

Hi,
I get an error when I'm trying to convert from a vcf file to a gds file using the SeqArray package. Here is my code and the error message.

gds.file <- tempfile()
seqVCF2GDS(vcf.files, gds.file, verbose=FALSE)
Error in seqVCF2GDS(vcf.files, gds.file, verbose = FALSE) :
FORMAT ID 'QSS' (Number=A) should have 1 value(s) but receives 2, please consider revising it to 'Number=.'.
FILE: C:\Users\bwodu1\OneDrive - Johns Hopkins\Yegna Lab\u01\test_vcf\55488_60663_Cancer_vs_55488_60664_Benign_mutect2_passed.vcf
LINE: 66, COLUMN: 10, 0/1:23,14:0.381

fitNullModel fails if AnnotatedDataFrame contains a tibble

fitNullModel requires manipulation of rownames, which are not supported in tibbles. Add a coercion of tibble to base data.frame here: https://github.com/UW-GAC/GENESIS/blob/master/R/fitNullModel.R#L45

Bug with GxE when using GenotypeIterator

In assocTestSingle for GenotypeIterator it iterates over the SNP blocks. In each iteration GxE (the name of ithe interaction effects) is set to the model matrix.

if (!is.null(GxE)) GxE <- .modelMatrixColumns(null.model, GxE)

That works in the fist iteration when GxE is still a character but fails in the second iteration where GxE is already a model matrix. This should probably be moved outside the loop or changed to

if (!is.null(GxE)) GxE_m <- .modelMatrixColumns(null.model, GxE) 
assoc <- GENESIS:::testGenoSingleVar(null.model, G=geno, E=GxE_m, test=test)

(Though I can't really see a reson not to move it to before the start of the loop)

bug in assocTestSingle if there are monomorphic SNPs

In assocTestSingle monomorphic SNPs are filtered in this block (around line 101)

freq <- .alleleFreq(gdsobj, geno, sample.index=sample.index,
                                      male.diploid=male.diploid)
                  
# filter monomorphic variants
keep <- .filterMonomorphic(geno, count=n.obs, freq=freq$freq)
if (!all(keep)) {
  var.info <- var.info[keep,,drop=FALSE]
  geno <- geno[,keep,drop=FALSE]
  n.obs <- n.obs[keep]
  freq <- freq[keep,,drop=FALSE]
}

But freq is actually a list with the entries freq and MAC. So the last line should be

 freq$freq <- freq$freq[keep,,drop=FALSE]
 freq$MAC <- freq$MAC[keep,,drop=FALSE]

Issue: Running PCAir on 140K individuals

Hi,

I have been running the pcair() function on 140K individuals for 3 days and it doesn't look like it is making any progress. Could this be a sample size issue or a memory issue?

As you see below:
Reading in Kinship estimates from KING --kinship output...
Using 140831 samples in sample.include
Identifying clusters of relatives...
82438 relatives in 21768 clusters; largest cluster = 9653
Creating block matrices for clusters...
58393 samples with no relatives included
Putting all samples together into one block diagonal matrix
Using kinobj and divobj to partition samples into unrelated and related sets
Working with 140831 samples
Identifying relatives for each sample using kinship threshold 0.0441941738241592
Identifying pairs of divergent samples using divergence threshold -0.0441941738241592
Partitioning samples into unrelated and related sets...
...1000 samples added to related.set...
...2000 samples added to related.set...
...3000 samples added to related.set...
...4000 samples added to related.set...
...5000 samples added to related.set...
...6000 samples added to related.set...
...7000 samples added to related.set...
...8000 samples added to related.set...
...9000 samples added to related.set...
...10000 samples added to related.set...
...11000 samples added to related.set...
...12000 samples added to related.set...
...13000 samples added to related.set...
...14000 samples added to related.set...
...15000 samples added to related.set...
...16000 samples added to related.set...
...17000 samples added to related.set...
...18000 samples added to related.set...
...19000 samples added to related.set...
...20000 samples added to related.set...
...21000 samples added to related.set...
...22000 samples added to related.set...
...23000 samples added to related.set...
...24000 samples added to related.set...
...25000 samples added to related.set...
...26000 samples added to related.set...
...27000 samples added to related.set...
...28000 samples added to related.set...
...29000 samples added to related.set...
...30000 samples added to related.set...
...31000 samples added to related.set...
...32000 samples added to related.set...
...33000 samples added to related.set...
...34000 samples added to related.set...
...35000 samples added to related.set...
...36000 samples added to related.set...
...37000 samples added to related.set...
...38000 samples added to related.set...
...39000 samples added to related.set...
...40000 samples added to related.set...
...41000 samples added to related.set...
...42000 samples added to related.set...
...43000 samples added to related.set...
...44000 samples added to related.set...
...45000 samples added to related.set...
...46000 samples added to related.set...
...47000 samples added to related.set...
Unrelated Set: 93272 Samples
Related Set: 47559 Samples
Performing PCA on the Unrelated Set...
Principal Component Analysis (PCA) on genotypes:
Excluding 7,660,980 SNPs (non-autosomes or non-selection)
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# of samples: 93,272
# of SNPs: 280,707
using 1 thread
# of principal components: 20
PCA: the sum of all selected genotypes (0,1,2) = 4401574483
CPU capabilities: Double-Precision SSE2 AVX
Tue Jan 9 18:06:54 2024 (internal increment: 64)

Thank you for your help.

Outdated pcair() call in vignette

I believe this pcair() call in the vignette is outdated, as the S4 method for GenotypeData does not take arguments kinobj, etc.

This workflow from your test is working for me.

assocTestSingle SeqVarBlockIterator - problem handling multiallelic sites

Hi, I am having problems (that I will work around by removing the nuisance marker) running

assocTestSingle(iterator, null1, verbose=TRUE)
where iterator is a SeqVarBlockIterator

I am trying this on a test region where one locus has five alleles (and 5 are monomorphic):

var.info <- variantInfo(gdsobj, alleles = FALSE, expanded = TRUE)

has 106 rows, while

geno <- expandedAltDosage(gdsobj, use.names = FALSE, sparse = sparse)[sample.index,
, drop = FALSE]
dim(expandedAltDosage(gdsobj, use.names = FALSE, sparse = sparse))
[1] 2391 114

has 114 alleles, which seems right to me.

But as a result,

res[[i]] <- cbind(var.info, n.obs, freq, assoc)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 106, 114

I am presuming variantinfo(expanded=T) should replicate the positions for each allele?

TIA,
David Duffy

PCAir unable to proceed when there are no related individuals

Hello,

I am encountering following error when running PCAir on some of my cohorts, and it looks
like I get this error when there are no related individuals. Could you help me on this issue?

Thanks,
Seunghun

ERROR:unused arguments (two.stage = TRUE, norm.option = "all", rescale = "residSD")

Hi,
An error occurred when I ran fitNullModel():

obj_nullmodel_GENESIS <- GENESIS::fitNullModel(data_GENESIS,outcome="PLT",
                                               covars=c("age","age2", "pc1", "pc2", "pc3","pc4", "pc5","pc6","pc7","pc8","pc9","pc10"),
                                               cov.mat=sgrm,family="gaussian",
                                               two.stage=TRUE,norm.option="all",rescale="residSD",
                                               verbose=TRUE)

Error in .local(x, ...) : 
  unused arguments (two.stage = TRUE, norm.option = "all", rescale = "residSD")

These three arguments are mentioned in your reference manual. I am so puzzled about that. Hope you can provide to help me fix out this problem.

Best,
Damien

fitNullModel doesn't converge when provided a list of matrices with more than 1 matrix

fitNullModel doesn't converge when provided a list of matrices with more than 1 matrix. We tried providing one matrix, and it was fine, we tried providing a list of a single matrix, and it was fine (converged). It seems like there is an issue with more than one matrix. I haven't looked at the code.

HOW to FIX - PCAir excludes all my SNPs because there is no chromosome information

Hi, I am just trying now to run PCAir and PCRelate. However, I ran into a problem that I am not sure how to fix.

When I run pcair(), I get the following error (highlighted in bold):

The code I used is below:

GDS file obtained from a genlight object (dartR package) with the dartR::gl2gds" function. Then GDS file converted into gdsobj with the code below (from GDWAS tools package):

mydataset_geno <- GdsGenotypeReader(filename = "C:/documents/mydataset_GDS.gds")

class(mydataset_geno) # [1] "GdsGenotypeReader" attr(,"package") [1] "GWASTools"

object without annotation

mydataset_genoData <- GenotypeData(mydataset_geno)

str(mydataset_genoData)

Formal class 'GenotypeData' [package "GWASTools"] with 3 slots
..@ data :Formal class 'GdsGenotypeReader' [package "GWASTools"] with 15 slots
.. .. ..@ snpIDvar : chr "snp.id"
.. .. ..@ chromosomeVar: chr "snp.chromosome"
.. .. ..@ positionVar : chr "snp.position"
.. .. ..@ scanIDvar : chr "sample.id"
.. .. ..@ genotypeVar : chr "genotype"
.. .. ..@ alleleVar : chr "snp.allele"
.. .. ..@ autosomeCode : int [1:22] 1 2 3 4 5 6 7 8 9 10 ...
.. .. ..@ XchromCode : int 23
.. .. ..@ YchromCode : int 25
.. .. ..@ XYchromCode : int 24
.. .. ..@ MchromCode : int 26
.. .. ..@ genotypeDim : chr "snp,scan"
.. .. ..@ missingValue : int 3
.. .. ..@ filename : chr "C:/documents/mydataset"| truncated
.. .. ..@ handler :List of 5
.. .. .. ..$ filename: chr "C:/documents/mydataset_GDS.gds_LocusC"| truncated
.. .. .. ..$ id : int 1
.. .. .. ..$ ptr :
.. .. .. ..$ root : 'gdsn.class' raw [1:20] 08 00 00 00 ...
.. .. .. ..$ readonly: logi TRUE
.. .. .. ..- attr(*, "class")= chr "gds.class"
..@ snpAnnot : NULL
..@ scanAnnot: NULL

PCAir

mydataset_pca_PCAir <- pcair(
mydataset_genoData,
kinobj = mydataset_kingMat,
kin.thresh=2^(-9/2), # = 0.04 = 3rd degree kinship threshold, which corresponds to first cousins (actually 1stcousins kinship= 1/16 - Weir 2006 & 0.06>0.04)
divobj = mydataset_kingMat,
div.thresh=-2^(-9/2))

Using kinobj and divobj to partition samples into unrelated and related sets
Working with 425 samples
Identifying relatives for each sample using kinship threshold 0.0441941738241592
Identifying pairs of divergent samples using divergence threshold -0.0441941738241592
Partitioning samples into unrelated and related sets...
Unrelated Set: 375 Samples
Related Set: 50 Samples
Performing PCA on the Unrelated Set...
Principal Component Analysis (PCA) on genotypes:
Excluding 8,685 SNPs on non-autosomes
Error in .InitFile2(cmd = "Principal Component Analysis (PCA) on genotypes:", :
There is no SNP!

How to make sure that all SNPs are used in PCAir and not only Autosomes (as my SNPs have no chromosome position being a non-model species with no genome)?

Thank you for any help! Gabriella

sessionInfo( ): R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8 LC_MONETARY=English_Australia.utf8 [4] LC_NUMERIC=C LC_TIME=English_Australia.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] GENESIS_2.26.0 GWASTools_1.42.1 Biobase_2.56.0 BiocGenerics_0.42.0 InRelate_0.1.0
[6] remotes_2.4.2 fstcore_0.9.12 radiator_1.2.8 OutFLANK_0.2 LEA_3.8.0
[11] vcfR_1.12.0 stockR_1.0.74 qvalue_2.28.0 SNPRelate_1.30.1 dartR_2.0.4
[16] adegenet_2.1.7 ade4_1.7-19 plotrix_3.8-2 forcats_0.5.1 stringr_1.4.1
[21] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
[26] tidyverse_1.3.1 ggplot2_3.4.0 plyr_1.8.7 SeqArray_1.36.3 gdsfmt_1.32.0

loaded via a namespace (and not attached): [1] utf8_1.2.2 R.utils_2.12.0 tidyselect_1.1.2 RSQLite_2.3.2
[5] BiocParallel_1.30.4 grid_4.2.1 combinat_0.0-8 StAMPP_1.6.3
[9] GWASExactHW_1.01 devtools_2.4.3 munsell_0.5.0 codetools_0.2-18
[13] withr_2.5.0 colorspace_2.0-3 fst_0.9.8 pegas_1.1
[17] knitr_1.39 rstudioapi_0.13 stats4_4.2.1 labeling_0.4.2
[21] RgoogleMaps_1.4.5.3 logistf_1.26.0 GenomeInfoDbData_1.2.8 bit64_4.0.5
[25] farver_2.1.1 gap.datasets_0.0.5 rprojroot_2.0.3 vctrs_0.5.1
[29] generics_0.1.3 xfun_0.31 R6_2.5.1 doParallel_1.0.17
[33] GenomeInfoDb_1.32.2 fields_14.0 bitops_1.0-7 cachem_1.0.6
[37] reshape_0.8.9 assertthat_0.2.1 promises_1.2.0.1 scales_1.2.0
[41] vroom_1.5.7 pinfsc50_1.2.0 gtable_0.3.0 formula.tools_1.7.1
[45] processx_3.7.0 spam_2.9-0 sandwich_3.0-2 MatrixModels_0.5-0
[49] rlang_1.0.6 calibrate_1.7.7 splines_4.2.1 rgdal_1.5-32
[53] hexbin_1.28.2 broom_1.0.0 BiocManager_1.30.18 reshape2_1.4.4
[57] modelr_0.1.8 backports_1.4.1 httpuv_1.6.5 tools_4.2.1
[61] usethis_2.1.6 ellipsis_0.3.2 raster_3.5-21 RColorBrewer_1.1-3
[65] DNAcopy_1.70.0 sessioninfo_1.2.2 Rcpp_1.0.9 zlibbioc_1.42.0
[69] RCurl_1.98-1.7 ps_1.7.1 prettyunits_1.1.1 viridis_0.6.2
[73] zoo_1.8-12 S4Vectors_0.34.0 haven_2.5.0 cluster_2.1.3
[77] fs_1.5.2 magrittr_2.0.3 SeqVarTools_1.34.0 data.table_1.14.2
[81] SparseM_1.81 genetics_1.3.8.1.3 lmtest_0.9-40 reprex_2.0.1
[85] mvtnorm_1.1-3 pkgload_1.3.0 hms_1.1.1 patchwork_1.1.1
[89] mime_0.12 xtable_1.8-4 readxl_1.4.0 IRanges_2.30.0
[93] gridExtra_2.3 compiler_4.2.1 mice_3.14.0 maps_3.4.0
[97] crayon_1.5.1 gdistance_1.3-6 R.oo_1.25.0 htmltools_0.5.2
[101] mgcv_1.8-40 later_1.3.0 tzdb_0.3.0 lubridate_1.8.0
[105] DBI_1.1.3 dbplyr_2.2.1 PopGenReport_3.0.7 MASS_7.3-57
[109] Matrix_1.4-1 permute_0.9-7 cli_3.4.1 R.methodsS3_1.8.2
[113] gdata_2.18.0.1 parallel_4.2.1 dotCall64_1.0-1 igraph_1.3.2
[117] GenomicRanges_1.48.0 pkgconfig_2.0.3 sp_1.5-0 terra_1.5-34
[121] versions_0.3 xml2_1.3.3 foreach_1.5.2 XVector_0.36.0
[125] rvest_1.0.2 quantsmooth_1.62.0 callr_3.7.1 digest_0.6.29
[129] vegan_2.6-2 Biostrings_2.64.0 cellranger_1.1.0 operator.tools_1.6.3
[133] curl_4.3.2 gap_1.2.3-6 quantreg_5.93 shiny_1.7.1
[137] gtools_3.9.3 lifecycle_1.0.3 nlme_3.1-157 dismo_1.3-5
[141] jsonlite_1.8.0 seqinr_4.2-16 viridisLite_0.4.0 fansi_1.0.3
[145] pillar_1.7.0 lattice_0.20-45 GGally_2.1.2 survival_3.3-1
[149] fastmap_1.1.0 httr_1.4.3 pkgbuild_1.3.1 glue_1.6.2
[153] mmod_1.3.3 png_0.1-7 iterators_1.0.14 bit_4.0.4
[157] stringi_1.7.8 blob_1.2.3 memoise_2.0.1 ape_5.6-2

assocTestSingle exhausts iterators

Block iterator methods in SeqVarTools and elsewhere tend to require a user-written loop to iterate through piecemeal subsets of data. The behavior of assocTestSingle methods conflicts a bit with this expectation, since its inner loop isn't escaped until the SeqVarIterator method returns FALSE, fully consuming the iterator.

While this behavior is efficient, and desirable in many settings, it could be worth adding the option to escape the while loop to allow users to interact with individual iterations, and prevent side effects. This could be as simple as an optional argument outfun=force to be called on res[[i]] after each iteration (like outfun=return, some fwrite closure, and so on)

Feel free to close if this contradicts the package's intended design!

assocTestSingle results: MAC, n.obs and freq don't match.

use "1-MAC/n.obs" to calculate SNP freq, but it doesn't match the freq given by assocTestSingle result.

row.id--variant.id--chr--pos--n.obs----freq----MAC----Score
486612 16055342 5 2648696 6387 0.9999198 1 -0.9808342238
32370 16055799 5 2665352 6387 0.9999153 1 -0.9951444874
243629 16142912 5 6575722 6387 0.9999915 0 -0.1023526695
457635 16175052 5 8172258 6387 0.9999654 0 -0.2336916838
466436 16175140 5 8176852 6387 0.9999654 0 -0.2336916838
188639 16192362 5 8995817 6387 0.9999416 1 -0.5098825422
131844 16211794 5 9978068 6387 0.9999090 1 -0.9181567810
3839139 16694315 5 34766540 6387 0.9999144 1 -0.9661750655
1469144 16716945 5 35962594 6387 0.9999154 1 -0.9703244084
1584144 16717060 5 35970516 6387 0.9999150 1 -0.9702497538
4554485 18455030 5 133261208 6387 0.9999555 1 -0.5108373173
4661758 18455137 5 133268025 6387 0.9999621 0 -0.4650721894
4289491 18464765 5 133842935 6387 0.9999313 1 -0.7213551748
4711802 18465187 5 133871022 6387 0.9999249 1 -0.8912234380
2261764 18472737 5 134354541 6387 0.9999988 0 -0.0123795277
4912514 18475388 5 134496277 6387 0.9999060 1 -0.9474345675
1979497 18492455 5 135394341 6387 0.9999146 1 -0.9584221596
2162511 18492638 5 135403773 6387 0.9999178 1 -0.9663698679
4565498 18505041 5 136104042 6387 0.9998915 1 -1.0145696174
2472517 18527948 5 137531525 6387 0.9992340 10 -1.4651801670
1762529 18567238 5 140169487 6387 0.9999941 0 -0.0702220428
859527 18641335 5 144213292 6387 0.9999209 1 -0.9724937863
2770522 18643246 5 144310206 6387 0.9999133 1 -0.9588007179

K0 values greater than 1

Hello!

I've been rereading the vignettes and paper and noticed that K0 is > 1 in the vignette: http://bioconductor.org/packages/devel/bioc/vignettes/GENESIS/inst/doc/pcair.html#output-from-pc-relate. Given that K0 is the probability of sharing 0 alleles, how can it be greater than 1? I've noticed this in my own data as well, even after applying the correction functions.

Is it possible to test individual samples against a built model?

Hello!

First off, I've been really impressed with how smooth it is to get GENESIS up and running even as a bit of a novice R user.

I have a large corpus of samples that I am generating a PC-Relate model for (millions of samples). I was wondering if it is possible to do the following things:

Add to the kinbtw model in batches? (over time, adding more samples)
Test an individual sample pairwise agains all samples in the model for kinship?

Looking at the code, I'm not seeing a way to do this. Am I missing anything built in? If not built in, do you think there is a path of adding it? (I'm happy be the implementor if needed).

Adding functions to perform meta-analysis

Any plan to add functions that would allow performing meta-analysis at the single variant and gene/variant aggregation unit levels?

Thanks!

Generate Dense Kinship matrix

Hello,

I trying to generate dense kinship matrix. How to return dense kinship matrix using (pcrelateToMatrix)?

Thank you,

correctK2() drops samples under specific conditions

Hi team,

I ran into a very unusual bug where calling correctK2() drops rows when called in a specific instance. I have an implementation based on your analysis pipeline, but because I replaced rbind() with dplyr::bind_rows() I ran into this issue

library(GENESIS)
sessionInfo()
packageVersion("dplyr")
# [1] ‘1.0.8’


b_kinSelf <- NULL
b_kinBtwn <- NULL
i <- 5
j <- 5
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

i <- 4
j <- 4
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

kinself_bind_rows <- b_kinSelf

R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Stream 8

Matrix products: default
BLAS/LAPACK: /home/src/intel/compilers_and_libraries_2020.1.217/linux/mkl/li
b/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GENESIS_2.20.1 nvimcom_0.9-88

loaded via a namespace (and not attached):
 [1] quantsmooth_1.56.0     Rcpp_1.0.8.2           lattice_0.20-41
 [4] tidyr_1.2.0            Biostrings_2.58.0      formula.tools_1.7.1
 [7] zoo_1.8-9              assertthat_0.2.1       lmtest_0.9-38
[10] foreach_1.5.2          utf8_1.2.2             R6_2.5.1
[13] GenomeInfoDb_1.26.7    backports_1.4.1        MatrixModels_0.5-0
[16] gdsfmt_1.26.1          stats4_4.0.5           RSQLite_2.2.7
[19] pillar_1.7.0           zlibbioc_1.36.0        rlang_1.0.2
[22] data.table_1.14.0      SparseM_1.81           blob_1.2.1
[25] S4Vectors_0.28.1       Matrix_1.3-2           splines_4.0.5
[28] GWASExactHW_1.01       RCurl_1.98-1.6         bit_4.0.4
[31] broom_0.7.12           compiler_4.0.5         pkgconfig_2.0.3
[34] BiocGenerics_0.36.1    mgcv_1.8-34            tidyselect_1.1.2
[37] GenomeInfoDbData_1.2.4 tibble_3.1.6           DNAcopy_1.64.0
[40] IRanges_2.24.1         codetools_0.2-18       fansi_1.0.2
[43] crayon_1.5.0           dplyr_1.0.8            bitops_1.0-7
[46] grid_4.0.5             nlme_3.1-152           lifecycle_1.0.1
[49] DBI_1.1.2              magrittr_2.0.2         SeqVarTools_1.28.1
[52] cli_3.2.0              cachem_1.0.6           XVector_0.30.0
[55] SNPRelate_1.24.0       mice_3.13.0            ellipsis_0.3.2
[58] generics_0.1.2         vctrs_0.3.8            sandwich_3.0-1
[61] iterators_1.0.14       tools_4.0.5            bit64_4.0.5
[64] Biobase_2.50.0         glue_1.6.2             purrr_0.3.4
[67] parallel_4.0.5         fastmap_1.1.0          survival_3.2-10
[70] SeqArray_1.30.0        GenomicRanges_1.42.0   GWASTools_1.36.0
[73] operator.tools_1.6.3   memoise_2.0.1          logistf_1.24
[76] quantreg_5.88
[1] "block:"
[1] 5 5
[1] "before correction nrow res$kinBtwn:"
[1] 10335331
[1] "after correction nrow res$kinBtwn:"
[1] 10335331

[1] "block:"
[1] 4 4
[1] "before correction nrow res$kinBtwn:"
[1] 10330785
[1] "after correction nrow res$kinBtwn:"
[1] 2580151

I was able to track the problem down to lines 630 and 632 in the correctK2() function definition where it calls merge(); manually running equivalent code caused merge to behave in the same way. I think this is possibly a bug in the data.table method for merge(), but I'm not totally sure.

If I change only b_kinSelf <- dplyr::bind_rows(b_kinSelf, res$kinSelf) to b_kinSelf <- rbind(b_kinSelf, res$kinSelf), I do not trigger this bug:

# WORKING COUNTEREXAMPLE
b_kinSelf <- NULL
b_kinBtwn <- NULL
i <- 5
j <- 5
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- rbind(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

i <- 4
j <- 4
res <- readRDS(paste0("b_pcrelate_block_", i, "_", j, ".rds"))
print("block:")
print(c(i, j))
b_kinSelf <- rbind(b_kinSelf, res$kinSelf)
print("before correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
res$kinBtwn <- correctK2(kinBtwn = res$kinBtwn, kinSelf = b_kinSelf,
                         small.samp.correct = FALSE, pcs = NULL,
                         sample.include = NULL)
res$kinBtwn <- correctK0(kinBtwn = res$kinBtwn)
print("after correction nrow res$kinBtwn:")
print(nrow(res$kinBtwn))
b_kinBtwn <- dplyr::bind_rows(b_kinBtwn, res$kinBtwn)
rm(res)
cat("\n")

kinself_rbind <- b_kinSelf

[1] "block:"
[1] 5 5
[1] "before correction nrow res$kinBtwn:"
[1] 10335331
[1] "after correction nrow res$kinBtwn:"
[1] 10335331

[1] "block:"
[1] 4 4
[1] "before correction nrow res$kinBtwn:"
[1] 10330785
[1] "after correction nrow res$kinBtwn:"
[1] 10330785

The only difference I see between kinself_rbind and kinself_bind_rows is the latter has an extra attribute

> str(kinself_rbind)
Classes ‘data.table’ and 'data.frame':  9093 obs. of  3 variables:
 $ ID  : chr  [redacted]
 $ f   : num  0.000501 -0.016491 -0.0036 -0.00307 -0.008572 ...
 $ nsnp: num  197276 198380 198390 198323 198318 ...
 - attr(*, ".internal.selfref")=<externalptr>
> str(kinself_bind_rows)
Classes ‘data.table’ and 'data.frame':  9093 obs. of  3 variables:
 $ ID  : chr  [redacted]
 $ f   : num  0.000501 -0.016491 -0.0036 -0.00307 -0.008572 ...
 $ nsnp: num  197276 198380 198390 198323 198318 ...
 - attr(*, ".internal.selfref")=<externalptr>
 - attr(*, "sorted")= chr "ID"
> all.equal(kinself_rbind$ID, kinself_bind_rows$ID)
[1] TRUE
> all.equal(kinself_rbind$f, kinself_bind_rows$f)
[1] TRUE
> all.equal(kinself_rbind$nsnp, kinself_bind_rows$nsnp)
[1] TRUE

I know that's a lot of details to throw at you when the bug might be with a dependency, rather than your package, but I wanted to flag it in case there are circumstances where this might commonly crop up. Let me know if there's any other details I can give you and I have this example along with the example data on the Blue Lab servers if @smgogarten wants to take a look

Conditional analysis

Hi,

Is it possible to do conditional analysis in GENESIS without having to re-learn the null model from scratch each time a covariate is added?

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed

Hello,

I have had several issues that mimic the error (below) when trying to run GENESIS on a particular chromosome. It says there's some issue with an if statement, as stated below. In addition, it states some NA's were introduced.

Everything else seems to run fine and I have R compiled with Intel MKL libraries for matrix calculations, but for certain chromosomes / tests, I get the following error. It has been difficult to pinpoint the root of the error, but I am wondering if the authors or others could shed light on this error and how it may be avoided. I also wonder if the warning and the error are related.

...
Iteration 704 of 1086 completed
Iteration 715 of 1086 completed

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed
Calls: assocTestAggregate ... .local -> .meanImpute -> [<- -> [<- -> [<- -> [<- -> int2i

In addition: Warning message:
In int2i(as.integer(i), n) : NAs introduced by coercion to integer range
Execution halted

Thank you,
Vamsee

Error using assocTestAggregate

Hi,

I am using the GENESIS package to perform a SKAT-O analysis on every chromosome, variants being aggregated by genes.
I am using the same code for all of the chromosomes, and it worked perfectly until the chromosome 19.

Running the code

assoc <- assocTestAggregate(iterator,
+                             mod_1,
+                             test="SKATO",
+                             AF.max=0.01,
+                             weight.beta=c(1,25),
+                             verbose = FALSE)

I have the following message:
Error in seqBlockApply(gdsobj, "genotype", function(x) { :
Invalid position in CIndex.

I can provide more details if needed.
Thanks!

Loss of results when doing a block comparison with a single sample in one of the blocks.

Hello!

I'm running the pcrelateSampBock with a single sample in block1 and many samples in block2. When I run it this way, I get a block back that has nothing in it. I think it's because of this section:

GENESIS/R/pcrelate.R

Line 550 in 52d7c1b

estDT <- Reduce(merge, estDT)

where the data frames are reduced and since the nsnp df has sample names for ID1 and ID2 but the other metrics have just 1 for ID1, it reduces to nothing.

Stepping through the code, everything looks good up until the line referenced above, and then the data frame contains nothing.

Doing the same thing with 10 samples in block1 and the same samples in block2 yields results as expected.

I don't have a minimum reproducible example for this that I can share at the moment but I'll work on putting one together. I wanted to know if anyone else is running this in this way successfully?

error, unable to find an inherited method for function '.readSampleId' for signature

Hello,

I'd like to ask your help for solving an error. I got this error during the pcair.

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function '.readSampleId' for signature

Could you tell me how to resolve this error?

Thanks,
Jaeyoon

fn.gds<- "xxx.gds"
fn.out<- "xxx"
gds_obj <- snpgdsOpen(fn.gds) # from plink
#KING-----------------------------------------------------------------
king <- snpgdsIBDKING( gds_obj, verbose = FALSE)
kingMat <- king$kinship
dimnames(kingMat) <- list(king$sample.id, king$sample.id)
saveRDS(kingMat, paste(fn.out, ".rds", sep=""))
pcair_obj<- pcair( gdsobj=gds_obj, kinMat=kingMat, kin.thresh=2^(-9/2), divMat=kingMat, div.thresh=-2^(-9/2) )
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function '.readSampleId' for signature '"SNPGDSFileClass"'
gds_obj
File: xxx.gds (453.4M)

[ ] *
|--+ sample.id { Str8 5109 LZMA_ra(14.5%), 8.4K }
|--+ snp.id { Str8 370810 LZMA_ra(18.5%), 780.1K }
|--+ snp.position { Int32 370810 LZMA_ra(54.9%), 795.7K }
|--+ snp.chromosome { UInt8 370810 LZMA_ra(0.08%), 293B } *
|--+ snp.allele { Str8 370810 LZMA_ra(11.6%), 167.5K }
|--+ genotype { Bit2 5109x370810, 451.7M } *
--+ sample.annot [ data.frame ] *
|--+ sex { Str8 5109 LZMA_ra(2.15%), 117B }
--+ phenotype { Int32 5109 LZMA_ra(0.69%), 149B }

variable names with "SE" or "Est" crash

When running assocTestSingle with a GxE effct (though I suppose the problem would be the same without) I got the error message

Error in res[g, grep("SE", colnames(res))] <- sqrt(diag(Vbetas)) :
    number of items to replace is not a multiple of replacement length

along with

In addition: Warning message:
In grep(paste0("^", col.name), colnames(null.model$model.matrix)) :
  argument 'pattern' has length > 1 and only the first element will be used

Trurns out that in .testGenoSingleVarWaldGxE standard errors are set by res[g, grep("SE", colnames(res))] <- sqrt(diag(Vbetas)). If any of the names of covariates matches SE (e.g. SEX) the regex also matches Est.G.SEX which leads to incorrect number of values.

I don't think the regex thing is ideal but changing it to something like ^SE\\.G" should make it a bit more robust.

uw-gac / genesis Goto Github PK

genesis's People

Contributors

Stargazers

Watchers

Forkers

genesis's Issues

GDS file obtained from a genlight object (dartR package) with the dartR::gl2gds" function. Then GDS file converted into gdsobj with the code below (from GDWAS tools package):

object without annotation

PCAir

Recommend Projects

Recommend Topics

Recommend Org