Giter Site home page Giter Site logo

MemoryError about tensorqtl HOT 8 CLOSED

broadinstitute avatar broadinstitute commented on July 22, 2024
MemoryError

from tensorqtl.

Comments (8)

francois-a avatar francois-a commented on July 22, 2024

Hi,

How much memory does your instance have? Based on the error log, you're trying to load a VCF with ~78M variants and 445 samples, which will require at least ~35GB for the VCF alone. I recommend filtering the VCF for common variants before running tensorQTL, since this will have to be done anyway and will result in unnecessary i/o otherwise.

For the GPU issue, have you looked at the installation instructions here?

You can test your setup using the following commands:

python -c "import torch; print(torch.__version__)
python -c "import torch; print(torch.cuda.is_available())

After that, please try running the example notebook.

from tensorqtl.

snesic avatar snesic commented on July 22, 2024

The instance has 64 GB. The idea is to run it on any instance (as we do it now with FastQTL, it just becomes slower). I believe that your example works well but we could generally have much larger VCFs so it would be good to know how it scales.

Yes, I checked the GPU installation locally and it works. It would be excellent if that was already implemented in the docker image so I don't have to add additional scripts.

Thanks,

from tensorqtl.

francois-a avatar francois-a commented on July 22, 2024

The docker image is based on an nvidia image, which has some additional requirements (see https://hub.docker.com/r/nvidia/cuda/).

There is now an option to load each chromosome into memory individually for large VCFs (--load_split), but I do recommend using a VCF with common variants only, since others should not be included in mapping.

from tensorqtl.

snesic avatar snesic commented on July 22, 2024

Right, it makes sense to keep only common variants. Would you suggest to remove variants where we don't have all three genotypes (0/0, 0/1, 1/1)?

I would still like to make sure that TensorQTL doesn't fail when VCF is large enough. Here we only had 445 samples, what would happen for 2k samples.. I also tried load_split but it breaks before as it tries to load all variants into memory.

So far, I guess the safest way is to run TensorQTL per chromosome.

The wrapper script is very convenient to run the analysis on an instance non-interactively. There are two issues I faced. The script fails if gene/transcript expression values are all zero (or any number). Many quantifiers return all genes even if their value is zero so we could end up with a constant value (zero) of a certain gene across the samples. It also fails when there is only one variant in the gene cis region. I know this is very unlikely but this could potentially ruin a big analysis..

from tensorqtl.

francois-a avatar francois-a commented on July 22, 2024

No, besides the usual QC steps (e.g., filtering out variants that fail HWE), I'd simply filter by MAF. You can have variants that pass these steps and don't have all three dosages.

Can you please give specific and reproducible examples for each of these problems (and open a separate issue for them)?

I don't see any issues when the phenotype is constant for all samples (the p-values in the output are NaN, as expected). Same when there is only a single variant in the cis window. Here's the data and code I used:

prefix = 'test'
output_dir = '.'

genotype_df = pd.DataFrame(np.random.choice([0,1,2], [1000,1000]),
# genotype_df = pd.DataFrame(np.zeros([1000,1000]),
                           index=['chr1_{}'.format(i) for i in range(1,1001)],
                           columns=['P{}'.format(i) for i in range(1,1001)])
variant_df = pd.DataFrame(index=genotype_df.index)
variant_df['chrom'] = 'chr1'
variant_df['pos'] = np.arange(1,1001)

phenotype_df = pd.DataFrame(np.random.randn(5, 1000),
# phenotype_df = pd.DataFrame(np.ones([5, 1000]),
                            index=['P{}'.format(i) for i in range(1,6)],
                            columns=genotype_df.columns)
phenotype_pos_df = pd.DataFrame(index=phenotype_df.index)
phenotype_pos_df['chr'] = 'chr1'
phenotype_pos_df['tss'] = 500

covariates_df = pd.DataFrame(np.random.randn(1000,3), index=phenotype_df.columns, columns=['C{}'.format(i) for i in range(1,4)])

cis.map_nominal(genotype_df, variant_df,
                phenotype_df, phenotype_pos_df,
                covariates_df, prefix,
                output_dir=output_dir)

# or for single variant:
# cis.map_nominal(genotype_df.loc[['chr1_5']], variant_df.loc[['chr1_5']],
#                 phenotype_df, phenotype_pos_df,
#                 covariates_df, prefix,
#                 output_dir=output_dir)

df = pd.read_parquet('test.cis_qtl_pairs.chr1.parquet')
df

from tensorqtl.

snesic avatar snesic commented on July 22, 2024

Thanks for the instructions and code. It could be some library incompatibility but when I run your code I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code is similar so I will open a new issue.

Just to be sure before I close this issue, there is no plan to make the wrapper script scale, i.e. work with any number of variants or samples?

from tensorqtl.

francois-a avatar francois-a commented on July 22, 2024

Thanks! Can you please clarify what you mean by scaling, and specifically what size limits? There should be no problems running the current version on a VCF of ~2k samples and ~10M variants with ~50k phenotypes on 100GB of memory.

I can add an option similar to --load_split that simply processes the VCF in fixed-size chunks.

from tensorqtl.

snesic avatar snesic commented on July 22, 2024

By scaling I mean to be able to run it successfully on any machine, no matter how large my VCF is. Now I remember that FastQTL had similar issues. The wrapper script had possibility to run in parallel but some chunks were failing silently. They were killed due to a lack of RAM but the script finished successfully. It took us quite some time to realize why we get different size output each time. 

Here, the error was, as you mentioned, it couldn't initialize an array of ~78M variants. Probably there will never be so many variants (after filtering) even if the number of samples is much larger than 450 but it would be nice if we don't have to think about that.   

Maybe you could use just try except (instead of additional parameter) and in case of memory error create chunks, but I am not sure if this could help (it might kill python rather than return the error), never had a chance to test it in more detail.. 

from tensorqtl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.