Hi, I am trying to reproduce the geuvadis study but tensorQTL fails.

MemoryError about tensorqtl HOT 8 CLOSED

broadinstitute commented on July 22, 2024

MemoryError

from tensorqtl.

Comments (8)

francois-a commented on July 22, 2024

Hi,

How much memory does your instance have? Based on the error log, you're trying to load a VCF with ~78M variants and 445 samples, which will require at least ~35GB for the VCF alone. I recommend filtering the VCF for common variants before running tensorQTL, since this will have to be done anyway and will result in unnecessary i/o otherwise.

For the GPU issue, have you looked at the installation instructions here?

You can test your setup using the following commands:

python -c "import torch; print(torch.__version__)
python -c "import torch; print(torch.cuda.is_available())

After that, please try running the example notebook.

from tensorqtl.

snesic commented on July 22, 2024

The instance has 64 GB. The idea is to run it on any instance (as we do it now with FastQTL, it just becomes slower). I believe that your example works well but we could generally have much larger VCFs so it would be good to know how it scales.

Yes, I checked the GPU installation locally and it works. It would be excellent if that was already implemented in the docker image so I don't have to add additional scripts.

Thanks,

from tensorqtl.

francois-a commented on July 22, 2024

The docker image is based on an nvidia image, which has some additional requirements (see https://hub.docker.com/r/nvidia/cuda/).

There is now an option to load each chromosome into memory individually for large VCFs (--load_split), but I do recommend using a VCF with common variants only, since others should not be included in mapping.

from tensorqtl.

snesic commented on July 22, 2024

Right, it makes sense to keep only common variants. Would you suggest to remove variants where we don't have all three genotypes (0/0, 0/1, 1/1)?

I would still like to make sure that TensorQTL doesn't fail when VCF is large enough. Here we only had 445 samples, what would happen for 2k samples.. I also tried load_split but it breaks before as it tries to load all variants into memory.

So far, I guess the safest way is to run TensorQTL per chromosome.

The wrapper script is very convenient to run the analysis on an instance non-interactively. There are two issues I faced. The script fails if gene/transcript expression values are all zero (or any number). Many quantifiers return all genes even if their value is zero so we could end up with a constant value (zero) of a certain gene across the samples. It also fails when there is only one variant in the gene cis region. I know this is very unlikely but this could potentially ruin a big analysis..

from tensorqtl.

francois-a commented on July 22, 2024

No, besides the usual QC steps (e.g., filtering out variants that fail HWE), I'd simply filter by MAF. You can have variants that pass these steps and don't have all three dosages.

Can you please give specific and reproducible examples for each of these problems (and open a separate issue for them)?

I don't see any issues when the phenotype is constant for all samples (the p-values in the output are NaN, as expected). Same when there is only a single variant in the cis window. Here's the data and code I used:

prefix = 'test'
output_dir = '.'

genotype_df = pd.DataFrame(np.random.choice([0,1,2], [1000,1000]),
# genotype_df = pd.DataFrame(np.zeros([1000,1000]),
                           index=['chr1_{}'.format(i) for i in range(1,1001)],
                           columns=['P{}'.format(i) for i in range(1,1001)])
variant_df = pd.DataFrame(index=genotype_df.index)
variant_df['chrom'] = 'chr1'
variant_df['pos'] = np.arange(1,1001)

phenotype_df = pd.DataFrame(np.random.randn(5, 1000),
# phenotype_df = pd.DataFrame(np.ones([5, 1000]),
                            index=['P{}'.format(i) for i in range(1,6)],
                            columns=genotype_df.columns)
phenotype_pos_df = pd.DataFrame(index=phenotype_df.index)
phenotype_pos_df['chr'] = 'chr1'
phenotype_pos_df['tss'] = 500

covariates_df = pd.DataFrame(np.random.randn(1000,3), index=phenotype_df.columns, columns=['C{}'.format(i) for i in range(1,4)])

cis.map_nominal(genotype_df, variant_df,
                phenotype_df, phenotype_pos_df,
                covariates_df, prefix,
                output_dir=output_dir)

# or for single variant:
# cis.map_nominal(genotype_df.loc[['chr1_5']], variant_df.loc[['chr1_5']],
#                 phenotype_df, phenotype_pos_df,
#                 covariates_df, prefix,
#                 output_dir=output_dir)

df = pd.read_parquet('test.cis_qtl_pairs.chr1.parquet')
df

from tensorqtl.

snesic commented on July 22, 2024

Thanks for the instructions and code. It could be some library incompatibility but when I run your code I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code is similar so I will open a new issue.

Just to be sure before I close this issue, there is no plan to make the wrapper script scale, i.e. work with any number of variants or samples?

from tensorqtl.

francois-a commented on July 22, 2024

Thanks! Can you please clarify what you mean by scaling, and specifically what size limits? There should be no problems running the current version on a VCF of ~2k samples and ~10M variants with ~50k phenotypes on 100GB of memory.

I can add an option similar to --load_split that simply processes the VCF in fixed-size chunks.

from tensorqtl.

snesic commented on July 22, 2024

By scaling I mean to be able to run it successfully on any machine, no matter how large my VCF is. Now I remember that FastQTL had similar issues. The wrapper script had possibility to run in parallel but some chunks were failing silently. They were killed due to a lack of RAM but the script finished successfully. It took us quite some time to realize why we get different size output each time.

Here, the error was, as you mentioned, it couldn't initialize an array of ~78M variants. Probably there will never be so many variants (after filtering) even if the number of samples is much larger than 450 but it would be nice if we don't have to think about that.

Maybe you could use just try except (instead of additional parameter) and in case of memory error create chunks, but I am not sure if this could help (it might kill python rather than return the error), never had a chance to test it in more detail..

from tensorqtl.

MemoryError about tensorqtl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent