Comments (8)
Hi,
How much memory does your instance have? Based on the error log, you're trying to load a VCF with ~78M variants and 445 samples, which will require at least ~35GB for the VCF alone. I recommend filtering the VCF for common variants before running tensorQTL, since this will have to be done anyway and will result in unnecessary i/o otherwise.
For the GPU issue, have you looked at the installation instructions here?
You can test your setup using the following commands:
python -c "import torch; print(torch.__version__)
python -c "import torch; print(torch.cuda.is_available())
After that, please try running the example notebook.
from tensorqtl.
The instance has 64 GB. The idea is to run it on any instance (as we do it now with FastQTL, it just becomes slower). I believe that your example works well but we could generally have much larger VCFs so it would be good to know how it scales.
Yes, I checked the GPU installation locally and it works. It would be excellent if that was already implemented in the docker image so I don't have to add additional scripts.
Thanks,
from tensorqtl.
The docker image is based on an nvidia image, which has some additional requirements (see https://hub.docker.com/r/nvidia/cuda/).
There is now an option to load each chromosome into memory individually for large VCFs (--load_split
), but I do recommend using a VCF with common variants only, since others should not be included in mapping.
from tensorqtl.
Right, it makes sense to keep only common variants. Would you suggest to remove variants where we don't have all three genotypes (0/0, 0/1, 1/1)?
I would still like to make sure that TensorQTL doesn't fail when VCF is large enough. Here we only had 445 samples, what would happen for 2k samples.. I also tried load_split but it breaks before as it tries to load all variants into memory.
So far, I guess the safest way is to run TensorQTL per chromosome.
The wrapper script is very convenient to run the analysis on an instance non-interactively. There are two issues I faced. The script fails if gene/transcript expression values are all zero (or any number). Many quantifiers return all genes even if their value is zero so we could end up with a constant value (zero) of a certain gene across the samples. It also fails when there is only one variant in the gene cis region. I know this is very unlikely but this could potentially ruin a big analysis..
from tensorqtl.
No, besides the usual QC steps (e.g., filtering out variants that fail HWE), I'd simply filter by MAF. You can have variants that pass these steps and don't have all three dosages.
Can you please give specific and reproducible examples for each of these problems (and open a separate issue for them)?
I don't see any issues when the phenotype is constant for all samples (the p-values in the output are NaN, as expected). Same when there is only a single variant in the cis window. Here's the data and code I used:
prefix = 'test'
output_dir = '.'
genotype_df = pd.DataFrame(np.random.choice([0,1,2], [1000,1000]),
# genotype_df = pd.DataFrame(np.zeros([1000,1000]),
index=['chr1_{}'.format(i) for i in range(1,1001)],
columns=['P{}'.format(i) for i in range(1,1001)])
variant_df = pd.DataFrame(index=genotype_df.index)
variant_df['chrom'] = 'chr1'
variant_df['pos'] = np.arange(1,1001)
phenotype_df = pd.DataFrame(np.random.randn(5, 1000),
# phenotype_df = pd.DataFrame(np.ones([5, 1000]),
index=['P{}'.format(i) for i in range(1,6)],
columns=genotype_df.columns)
phenotype_pos_df = pd.DataFrame(index=phenotype_df.index)
phenotype_pos_df['chr'] = 'chr1'
phenotype_pos_df['tss'] = 500
covariates_df = pd.DataFrame(np.random.randn(1000,3), index=phenotype_df.columns, columns=['C{}'.format(i) for i in range(1,4)])
cis.map_nominal(genotype_df, variant_df,
phenotype_df, phenotype_pos_df,
covariates_df, prefix,
output_dir=output_dir)
# or for single variant:
# cis.map_nominal(genotype_df.loc[['chr1_5']], variant_df.loc[['chr1_5']],
# phenotype_df, phenotype_pos_df,
# covariates_df, prefix,
# output_dir=output_dir)
df = pd.read_parquet('test.cis_qtl_pairs.chr1.parquet')
df
from tensorqtl.
Thanks for the instructions and code. It could be some library incompatibility but when I run your code I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
My code is similar so I will open a new issue.
Just to be sure before I close this issue, there is no plan to make the wrapper script scale, i.e. work with any number of variants or samples?
from tensorqtl.
Thanks! Can you please clarify what you mean by scaling, and specifically what size limits? There should be no problems running the current version on a VCF of ~2k samples and ~10M variants with ~50k phenotypes on 100GB of memory.
I can add an option similar to --load_split
that simply processes the VCF in fixed-size chunks.
from tensorqtl.
By scaling I mean to be able to run it successfully on any machine, no matter how large my VCF is. Now I remember that FastQTL had similar issues. The wrapper script had possibility to run in parallel but some chunks were failing silently. They were killed due to a lack of RAM but the script finished successfully. It took us quite some time to realize why we get different size output each time.
Here, the error was, as you mentioned, it couldn't initialize an array of ~78M variants. Probably there will never be so many variants (after filtering) even if the number of samples is much larger than 450 but it would be nice if we don't have to think about that.
Maybe you could use just try except (instead of additional parameter) and in case of memory error create chunks, but I am not sure if this could help (it might kill python rather than return the error), never had a chance to test it in more detail..
from tensorqtl.
Related Issues (20)
- QTL interaction HOT 1
- Inconsistent tss_distance in cis.map_nominal and cis.map_independent HOT 4
- Approximate the adjusted P-value from the best nominal P-value HOT 2
- TensorQTL for Methylation QTL analysis HOT 1
- v1.0.7 Docker container build fails HOT 3
- sQTL analysis TSS
- TensorQTL analysis problem HOT 7
- Which is the effect allele? HOT 3
- pgenlib error HOT 1
- Expected format for --interaction parameter HOT 2
- How to use Plink2 files including multiallelic file HOT 2
- An error in tensorqtl_examples.ipynb ? HOT 1
- Reproduce GTEx v8 HOT 3
- Interaction mode: k variable overwrite in cis.py script (v1.0.7) HOT 1
- nominal p value threshold for non-top eVariants in interaction mode
- missing pval_nominal_threshold column in output cis_qtl.txt file HOT 1
- Output Documentation HOT 3
- TensorQTL load Plink2 files in command line HOT 1
- Can we use tensorqtl without a covariate? HOT 1
- Cannot run cis.map_independent HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorqtl.