Hello, I am running pcrelateToMatrix on 5k samples and it has been running for several

That seems unusual; it should not be taking that long. What ve

Thank you for your help! GENESIS_2.22.2 <

pcrelateToMatrix slow on 50k samples about genesis HOT 10 CLOSED

uw-gac commented on July 17, 2024

pcrelateToMatrix slow on 50k samples

from genesis.

Comments (10)

smgogarten commented on July 17, 2024 1

Yes, that's what I would suggest. With thresh=NULL, the algorithm will consider any kinship value >0 a related pair. Since you have so many relationships, I think that's why it's taking so long. Setting a different threshold will eliminate the pairs with very low kinship values, and give you a block-diagonal matrix where the samples in each block are more closely related. You can experiment with different threshold values.

from genesis.

smgogarten commented on July 17, 2024

That seems unusual; it should not be taking that long.

What version of GENESIS are you using?
How many rows are in the pcrelate kinBtwn data frame?
Have you tried setting the threshold to a higher value, so there are fewer samples in each block? Note that if you are using the default setting of scaleKin=2, the threshold applies to the rescaled values, so if you want to make blocks of 3rd-degree relatives and above, your threshold would be 2*2^(-9/2) = 0.088

from genesis.

byoo commented on July 17, 2024

Thank you for your help!

GENESIS_2.22.2

> nrow(pcrel$kinBtwn)
[1] 1249425066

No, I haven't. Just to be sure, do you suggest to run pcrelateToMatrix(pcrel, scaleKin=2, thresh=0.088)?

from genesis.

byoo commented on July 17, 2024

It made a progress! In the end, it encountered an error as below. And, it did finish successfully with a higher threshold. But I don't think I understood how to select a good threshold. Would it be 2*2^(-4/2) to make blocks of 2nd-degree relatives and above?

Creating block matrices for clusters...
Error in if (n > 0) c(NA_integer_, -n) else integer() :
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rep.fac * nx : NAs produced by integer overflow
2: In .set_row_names(as.integer(prod(d))) :
  NAs introduced by coercion to integer range

In addition, I encountered an error below while trying to build a null model using the GRM created from the high threshed in the step above. Could you help me understand and resolve the issues? Thanks much for your help!

Error in .local(x, ...) :
  internal_chm_factor: Cholesky factorization failed
In addition: Warning message:
In .local(x, ...) :
  Cholmod warning 'not positive definite' at file ../Cholesky/t_cholmod_rowfac.c, line 430

from genesis.

mconomos commented on July 17, 2024

Hello - going back a couple of messages:

nrow(pcrel$kinBtwn)
[1] 1249425066

Something seems off here. If you have 5K samples, then kinBtwn should only have 12,497,500 rows. Your number is off by a factor of 100. How many rows are in the kinSelf object?

from genesis.

byoo commented on July 17, 2024

Ah, I have 50k samples. Thanks for checking and sorry for the confusion!

from genesis.

byoo commented on July 17, 2024

After trying with different parameters, I was able to build a null model successfully. I ended up using thresh = 0.3. That's higher than what you recommended. What would be implications to use a high thresh in building GRM?

from genesis.

mconomos commented on July 17, 2024

I'm a bit concerned that there is an issue with your PC-Relate results -- have you tried making any plots of the estimates? For example, a histogram of estimated pairwise kinship coefficients, or a hexbin plot of kinship coefficients vs K0 (if you estimated the IBD sharing probabilities), or a histogram of the estimated inbreeding coefficients. A threshold of 0.3 is really high, so unless you have a ton of 1st and 2nd degree relatives in your sample, you really shouldn't need the threshold to be that high.

from genesis.

byoo commented on July 17, 2024

This dataset is a subset of UKB exome and kin vs K0 is plotted below (sampled 10M rows). How would downstream analyses (e.g. fitting null model and association tests) be affected by the high threshold? Random effects from distantly related pairs are ignored? GRM gets orders of magnitude larger even with small changes (e.g. threshold of 2.7) and I had difficulties fitting null model. Thank you.

from genesis.

mconomos commented on July 17, 2024

That kinship vs k0 plot is concerning. The maximum value k0 should take is 1 (it can be a little bit larger than 1 because the estimator is not truncated, but it certainly should not have values as large as you are seeing). I can think of a couple possibilities for why your relatedness estimates would be bad:

You are adjusting for either too many or not enough ancestry principal components in the PC-Relate analysis. How many PCs did you adjust for and how did you go about selecting them?
An issue with the genotype data in the GDS file. You might want to look at allele frequency and variant heterozygosity distributions to make sure those seem reasonable and that the genotype data is as expected.

from genesis.

pcrelateToMatrix slow on 50k samples about genesis HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent