Giter Site home page Giter Site logo

polyfun's Introduction

PolyFun, PolyLoc and PolyPred

PolyFun (POLYgenic FUNctionally-informed fine-mapping)
PolyLoc (POLYgenic LOCalization of complex trait heritability)
PolyPred (POLYgenic Prediction of complex traits)

CI

This repo contains the code of the methods PolyFun for functionally-informed fine-mapping, PolyLoc for polygenic localization of complex trait heritability, and PolyPred for complex trait prediction. PolyFun and PolyLoc are described in Weissbrod et al. 2020 Nat Genet. PolyPred is described in Weissbrod*, Kanai*, Shi* et al. 2022 Nat Genet.

PolyFun estimates prior causal probabilities for SNPs, which can then be used by fine-mapping methods like SuSiE or FINEMAP. Unlike previous methods for functionally-informed fine-mapping, PolyFun can aggregate polygenic data from across the entire genome and hundreds of functional annotations.

PolyLoc generalizes fine-mapping by constructing minimal sets of SNPs that causally explain a given proportion (e.g. 50%) of SNP heritability.

PolyPred exploits fine-mapping to improve cross-population polygenic risk scores, by predicting using causal effect estimates intead of tagging effect estimates.

We also provide a script called finemapper that facilitates fine-mapping with methods like SuSiE, saving many of the preprocessing steps often required to perform fine-mapping (e.g. handling allelic flips between the summary statistics and reference genotypes).

The files in the ldstore directory are an adaptation of the ldstore package (written by Christian Benner) to Python 3.



Manual

We provide a detailed manual of PolyFun, PolyLoc and PolyPred in the Wiki page. If you run into any issues, please check the FAQ first.



Installation

We provide several installation options.

Install option 1: Create an Anaconda environment

The easiest way to install polyfun is by creating a dedicated environment through the Anaconda Python distribution. To do this, please install Anaconda on your machine and then type the following commands:

git clone https://github.com/omerwe/polyfun
cd polyfun
conda env create -f polyfun.yml
conda activate polyfun

Note: You can speed up the installation by ~100x by installing mamba and then replacing mamba with conda in the commands above.

This will install all the dependencies except for FINEMAP and LDstore This will allow you to perform fine-mapping using SuSiE, but not using FINEMAP. Please see installation instructions for FINEMAP and LDstore below.

After the installation, you can always invoke the PolyFun environment with the command conda activate polyfun. We recommend that you frequently make sure you have the latest version of polyfun installed by going to the polyfun directory and typing git pull.

If you have any trouble with the conda environment you created, you can instead try installing the dependencies using the conda lockfile polyfun.yml.lock. This installs software versions that are known to work, but they will be outdated compared to installing a fresh conda environment.

mamba create --name polyfun-lock --file polyfun.yml.lock
conda activate polyfun-lock

Install option 2: Manually install packages

PolyFun and PolyLoc are designed for Python >=3.6 and require the following freely available Python packages:

It is recommended (but not required) to also install the following:

If rpy2 or Ckmeans.1d.dp are not installed, PolyFun and PolyLoc will fallback to suboptimal clustering via scikit-learn. If you'd like to use FINEMAP instead of SuSiE for fine-mappping, you will also require:

  1. FINEMAP v1.4.1.
  2. (optional) The program LDstore 2.0 for computing LD directly from .bgen files (imputed genotypes)

Please see installation instructions for these packages below.

We recommend running PolyFun/PolyLoc via the Anaconda Python distribution. In Anaconda, you can install all the Python packages with the command "conda install <package_name>". Alternatively, the Python packages can be installed with the command "pip install --user <package_name>".

Once all the prerequisite packages are installed, you can install PolyFun/PolyLoc on a git-enabled machine by typing:

git clone https://github.com/omerwe/polyfun

We recommend that you frequently make sure you have the latest version of polyfun installed by going to the polyfun directory and typing git pull.

Installing FINEMAP v1.4.1

To install FINEMAP v1.4.1, please type one of the following two commands:
If you use Linux:

wget http://christianbenner.com/finemap_v1.4.1_x86_64.tgz
tar xvf finemap_v1.4.1_x86_64.tgz

If you use Mac OS X :

wget http://christianbenner.com/finemap_v1.4.1_MacOSX.tgz
tar xvf finemap_v1.4.1_MacOSX.tgz

Installing LDstore 2.0

To install LDstore, please type one of the following two commands:
If you use Linux:

wget http://www.christianbenner.com/ldstore_v2.0_x86_64.tgz
tar xzvf ldstore_v2.0_x86_64.tgz

If you use Mac OS X :

wget http://www.christianbenner.com/ldstore_v2.0_MacOSX.tgz
tar xzvf ldstore_v2.0_MacOSX.tgz



Testing the installation

We recommend testing PolyFun by invoking the script:

python test_polyfun.py --python3 <python3_exe>

where python3_exe (optional) is the command you type to start a python3 session (default is python). If the script completes without an error, everything is fine. If you see any errors, please consult the FAQ.

To test FINEMAP integration, provide the path to the executable when invoking the script (if not provided, the script will not test FINEMAP integration):

python test_polyfun.py --python3 <python3_exe> --finemap-exe <finemap_exe>



Contact

For questions and comments, please open a Github issue.

polyfun's People

Contributors

jdblischak avatar jerome-f avatar omerwe avatar shimaomao26 avatar zhilizheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

polyfun's Issues

bitarray dependency

The python module bitarray also seems to be required for polyfun (in step 2 of Approach 2), but isn't mentioned in the README.

compute_ldscores.py: line 200 TypeError: can't multiply sequence by non-int of type 'float'

Hello,

I am trying to run the compute_ldscores.py to calculate LDscores for my annotations. I am get an error at this step [INFO] Applying initial ldscores loop, this is the error:

Traceback (most recent call last):
File "compute_ldscores.py", line 190, in
df_ldscores = compute_ldscores(args)
File "compute_ldscores.py", line 149, in compute_ldscores
ldscores = geno_array.ldScoreVarBlocks(block_left, args.chunk_size, annot=annot_values)
File "/home/doug/Documents/Polyfun/polyfun/ldsc_polyfun/ldscore.py", line 127, in ldScoreVarBlocks
return self.corSumVarBlocks(block_left, c, func, snp_getter, annot)
File "/home/doug/Documents/Polyfun/polyfun/ldsc_polyfun/ldscore.py", line 200, in corSumVarBlocks
cor_sum[l_A:l_A+b, :] += np.dot(rfuncAB, annot[l_B:l_B+c, :])
File "<array_function internals>", line 6, in dot
TypeError: can't multiply sequence by non-int of type 'float'

I am not sure what is causing this error, do you have any ideas?

This is the command I am using:

python compute_ldscores.py
--bfile ../AD31chr1polyfun
--annot ../gFIUDWchr1anno.gz
--out ../chr1.parquet
--ld-wind-kb 2000

Where AD31chrpolyfun is the stem for plink formatted bed,bim,fam files. The snp name format in the plink files is chr:pos:A1:A2, where A1 and A2 are organised alphabetically.

The format of gFIUDWchr1anno.gz looks like this
SNP CHR BP A1 A2 rsID Z N ANNO1 ANNO2
1:100000012:G:T 1 100000012 T G rs10875231 -1.2578195807 652942.54 1 0

Both files have the exact same snps in them. Any ideas what I can do to resolve the problem?

Cheers,
Doug

LD directly from alkesgroup server

Hey Omer,

I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the load_ld() function in run_finemapper.py to be able to download LD directly from the alkesgroup server.

I also added a helper function find_ld_prefix() to identify which LD file you need based on some info.

Just wanted to share these in case you found them helpful to integrate into PolyFun.

Best,
Brian

import numpy as np
import pandas as pd
import logging
import os
import scipy.sparse as sparse
from io import BytesIO
import requests


def find_ld_prefix(chrom, min_pos):
    # min_pos = 40590585; chrom=12
    alkes_url = "https://data.broadinstitute.org/alkesgroup/UKBB_LD"
    bp_starts = list(range(1, 252000001, 1000000))
    bp_ends = [x+3000000 for x in bp_starts]
    i = max([i for i,x in enumerate(bp_starts) if x<=min_pos])
    prefix = "chr%d_%d_%d" % (chrom, bp_starts[i], bp_ends[i])
    ld_prefix = os.path.join(alkes_url, prefix)
    return ld_prefix

def load_ld(ld_prefix, server=True):
    # ld_prefix="https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr12_41000001_44000001"
    # load SNPs info
    snps_filename_parquet = ld_prefix + '.parquet'
    snps_filename_gz = ld_prefix + '.gz'

    if os.path.exists(snps_filename_parquet) | server==False:
        df_ld_snps = pd.read_parquet(snps_filename_parquet)
    elif os.path.exists(snps_filename_gz) | server==True:
        df_ld_snps = pd.read_table(snps_filename_gz, delim_whitespace=True)
    else:
        raise ValueError('couldn\'t find SNPs file %s or %s' % (snps_filename_parquet, snps_filename_gz))

    # load LD matrix
    R_filename = ld_prefix + '.npz'
    if server:
        print("Reading .npz file from alkesgroup server")
        r = requests.get(R_filename, stream=True)
        R = sparse.load_npz(BytesIO(r.raw.read())).toarray()
    else:
        if not os.path.exists(R_filename):
            raise IOError('%s not found' % (R_filename))
        R = sparse.load_npz(R_filename).toarray()
    R = R + R.T
    assert np.allclose(np.diag(R), 1.0)
    logging.info('Loaded an LD matrix for %d SNPs from %s' % (R.shape[0], R_filename))

    # sanity checks
    assert R.shape[0] == R.shape[1]
    if R.shape[0] != df_ld_snps.shape[0]:
        raise ValueError('LD matrix has a different number of SNPs than the SNPs file')

    return R, df_ld_snps

padding zero for chromosome

Hi Omer,

finemapper.py pads a 0 in front of chromosome number 1-9, this creates a snp mismatch between the bcor file and the input passed from finemapper.py script during the execution of FINEMAP. There should likely be a note in the Wiki about it.
If the .bcor file is prepared independent of finemapper.py then the chromosome position can be represented as 1 or 01.

if df_z['CHR'].iloc[0]<10: df_z['CHR'] = '0' + df_z['CHR'].astype(str)

SNP

Thank you for your answer. When I was using Polyfun, I found that there were more than 9 million SNPs in my statistical summary file and I used the most complete reference file you gave, but I only took the first 5 annotations due to memory reasons, and many SNPs were deleted after one after another merge weight LDscore.After the chi^2 step, there are only over 1000 SNPs remain, so there are too few SNPs left. Could you please explain the reason to me?How is it possible to get more SNPs and get better results

Handling case-control sum stats

munge_sumstats from LDSC handles (log) odds ratios in place of beta. Is it intentional to not handle this in PolyFun?

Thanks.

Column merging

extract_snpvar.py

For scenarios where someone's summary stats file has both SNP and CHR/POS, I modified the merging procedure so that column names don't get duplicated.
lines 66-76

#merge the dfs
    # BMS edits
    print('Merging metadata files by...')
    if all(x in df_snps.columns for x in ['SNP', 'BP', 'CHR']):
        logging.info('   SNP and position.')
        df = df_meta.merge(df_snps, on=['SNP', 'A1', 'A2', 'SNP', 'BP', 'CHR'], how='inner')
    elif 'SNP' in df_snps.columns:
        print('   SNP.')
        df = df_meta.merge(df_snps, on=['SNP', 'A1', 'A2'], how='inner')
    else:
        print('   position.')
        df = df_meta.merge(df_snps, on=['CHR', 'BP', 'A1', 'A2'], how='inner')

susieR update

Hey Omer,

It looks like susieR has been updated so that in the newest version susie_bhat() is no longer a function (it was merged into the new function susie_suff_stat (). Unfortunately this change causes an error in run_finemapper.py:

python ./echolocatoR/tools/polyfun/run_finemapper.py --ld Data/GWAS/Nalls23andMe_2019/LRRK2/plink/chr12_40000001_43000001 --sumstats Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS.12.snpvar_constrained.gz --n 1474097 --chr 12 --start 40515388 --end 40713899 --method susie --max-num-causal 10 --out Data/GWAS/Nalls23andMe_2019/LRRK2/finemap.UKBB.LRRK2.gz
*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************

[INFO]  Loaded an LD matrix for 22522 SNPs from Data/GWAS/Nalls23andMe_2019/LRRK2/plink/chr12_40000001_43000001.npz
[INFO]  Loading sumstats file...
[INFO]  Loaded sumstats for 373589 SNPs
[1] "/usr/local/anaconda3/envs/polyfun_venv/lib/R"

[INFO]  Removing 21839 SNPs in the LD matrix that are not in the sumstats file
[WARNING]  Flipping the alleles of 380 SNPs in the LD matrix that are flipped compared to the sumstats file
[INFO]  Starting functionally-informed SuSiE fine-mapping for chromosome 12 BP 40515388-40713899 (683 SNPs)
Traceback (most recent call last):
  File "./echolocatoR/tools/polyfun/run_finemapper.py", line 118, in <module>
    verbose=args.verbose, ld=ld, df_ld_snps=df_ld_snps)
  File "/Users/schilder/Desktop/Fine_Mapping/echolocatoR/tools/polyfun/finemapper.py", line 492, in finemap
    susie_obj = self.susieR.susie_bhat(
AttributeError: module 'susieR' has no attribute 'susie_bhat'

To handle this, in my R scripts I wrote a little snippet of code that searches for the new function and if it can't find it, uses the old one. Maybe you could include a pythonic version of this in your script?:

susie_func <- ifelse(length(find("susie_bhat"))==0, 
                       susieR::susie_suff_stat, susieR::susie_bhat)  
  fitted_bhat <- susie_func(...)

Moreover, as a way of dealing with these little dependency/versioning issues, would it be possible for you to provide a yaml file that contains all of the python, R, and other packages necessary for running PolyFun? This would be a really great way to make the right Conda environment and get started with PolyFun right away. Here's one example from a related repo by OpenTargets.

Thanks,
Brian

Max causal N vs. Exact Causal N

In regards to the--max-num-causal argument described here:

From my understanding, I thought Susie took the max number of SNPs as a parameter and iteratively fine-maps all scenarios from 1 to max SNPs to identify the optimal number of causal SNPs for the given data. FINEMAP on the other hand takes the exact number of SNPs. Do I have this mixed up?

Compute LD-scores for each SNP bin using .bcor files

Hi,

I am interested in creating priors with PolyFun using approach 3. I would like to pass estimates of LD from the data I have, stored in .bcor files. Is this possible? The --ld-ukb flag does something similar, but seems to be set up for using UK Biobank specifically. I can see how to do the finemapping step with my bcor LD estimates, but I suspect accurate LD estimation in creating the priors will also be very important.

Thanks,

Joni

Calculating _M files to accompany .l2.ldscore files

Should these be calculated as part of my compute_ldscore.py run? I thought it was just something like the number of SNPs in each annotation, but I'm not sure what the equivalent is for continuous annotations. I'm not finding the answer easily in the LDSC wiki either. Thanks!

assert len(np.unique(chr_num)) > 1 AssertionError

Hi,

I tried to run the sample data successfully. Then, I tried to analyze a significant hit with UKB data downloaded. After renaming the files according to example_data names, I am able to run the first step:
python munge_polyfun_sumstats.py
--sumstats chr19.sumstat.txt.gz
--n 637154
--out chr19.sumstat_munged.parquet
--min-info 0.6
--min-maf 0.001

python polyfun.py
--compute-h2-L2
--no-partitions
--output-prefix 19_40853017_41853107_sumstats_munged
--sumstats 19_40853017_41853107_sumstats_munged.parquet
--ref-ld-chr ../baselineLF2.2.UKB/annotations.
--w-ld-chr ../baselineLF2.2.UKB/weights.

python polyfun/polyfun.py \

--compute-h2-L2
--no-partitions
--output-prefix 19_40853017_41853107_sumstats_munged
--sumstats 19_40853017_41853107_sumstats_munged.parquet
--ref-ld-chr ../baselineLF2.2.UKB/annotations.
--w-ld-chr ../baselineLF2.2.UKB/weights.


  • PolyFun (POLYgenic FUNctionally-informed fine-mapping)
  • Version 1.0.0
  • (C) 2019-2021 Omer Weissbrod

[INFO] Reading summary statistics from 19_40853017_41853107_sumstats_munged.parquet ...
[INFO] Read summary statistics for 5042 SNPs.
[INFO] Reading reference panel LD Score from ../baselineLF2.2.UKB/annotations.[1-22] ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [05:01<00:00, 13.69s/it]
[INFO] Read reference panel LD Scores for 19386297 SNPs.
[INFO] Reading regression weight LD Score from ../baselineLF2.2.UKB/weights.[1-22] ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [02:08<00:00, 5.83s/it]
[INFO] Read regression weight LD Scores for 18275613 SNPs.
[INFO] After merging with reference panel LD, 4791 SNPs remain.
[INFO] After merging with regression SNP LD, 4784 SNPs remain.
[WARNING] number of SNPs is smaller than 200k; this is almost always bad.
Traceback (most recent call last):
File "polyfun/polyfun.py", line 848, in
polyfun_obj.polyfun_main(args)
File "polyfun/polyfun.py", line 772, in polyfun_main
self.polyfun_h2_L2(args)
File "polyfun/polyfun.py", line 594, in polyfun_h2_L2
self.run_ldsc(args, use_ridge=True, nn=False, evenodd_split=False, keep_large=False)
File "polyfun/polyfun.py", line 231, in run_ldsc
nnls_exact=args.nnls_exact
File "polyfun/ldsc_polyfun/regressions.py", line 414, in init
nnls_exact=nnls_exact
File "/group/tools/polyfun/ldsc_polyfun/regressions.py", line 257, in init
skip_ridge_jackknife=skip_ridge_jackknife, num_chr_sets=num_chr_sets)
File "polyfun/ldsc_polyfun/jackknife.py", line 582, in init
assert len(np.unique(chr_num)) > 1
AssertionError

Any suggestion?

zcat vs. gzcat

In the README, you say to use zcat to view .gz files, but only gzcat seems to work for me. Could this just be a Mac vs. Windows thing?

Versioning details:
Apple gzip 272.250.1
MacBook Pro, Mojave V10.14.5

sample size N

Hello, I want to use Polyfun to calculate the posterior probability, but when calculating the method 2 of the first step, the code always shows dimension error, I don't understand the relationship between dimension and sample size N, what size does this sample size N refer to?

Regularization and continuous annotations

I'm trying to understand how the regularization is affected by the magnitude of the coefficient for continuous annotations (which is affected by the scale at which the continuous annotation is reported). Should novel continuous annotations be normalized prior to inclusion in polyfun?
Thanks!
Gabe

Add print-coefficient flag

Hey Omer,

Awesome tool. Really enjoying it so far in our simulations. It would be great to include a flag to output the penalized taus, similar to the --print-coefficient flag in LDSC.

Thanks!

ModuleNotFoundError: No module named 'compute_ldscores_ukb'

(Sorry, this one is my fault following #34...!)

When trying to run create_finemapper_jobs.py, it fails with the following error:

File "resources/finemapping/polyfun/create_finemapper_jobs.py", line 6, in
from polyfun import configure_logger, check_package_versions
File "/scratch/mdd-meta/resources/finemapping/polyfun/polyfun.py", line 13, in
from compute_ldscores_ukb import compute_ldscores_chr
ModuleNotFoundError: No module named 'compute_ldscores_ukb'

I think this is because "compute_ldscores_ukb.py" is now "compute_ldscores_from_ld.py"?

polyfun+ finemap

Hello, I used Polyfun to get the prior probability files of more than 800,000 SNPs on 22 chromosomes of my schizophrenia, and then wanted to calculate the posterior probability. However, I see that the reference you gave is divided according to chromosome segments, are there some segments that don't get SNPs, resulting in? I used my prior probability file to test the FineMap method of Example 3 in Method 3 you gave, and the posterior probability was only 1 or 0. Could you tell me why? And I'm a little confused about the choice of parameters. Is there an efficient way to get the PIPs of all SNPs on 22 chromosomes?

image

Column renaming

munge_polyfun_sumstats.py

It seems SNP was missing from the columns that get renamed. Added this.
lines 75-77

# BMS edit:: SNP col wasn't being renamed below
    return df.rename( columns={snp_column: 'SNP', allele1_col: 'A1', allele0_col: 'A2', a1freq_col: 'MAF', bp_column: 'BP',
                 chr_column: 'CHR', info_col: 'INFO', beta_col: 'BETA', se_col: 'SE', pvalue_col: 'P', z_col: 'Z'}, errors='ignore')

RNULLType error

Hi!

I'm trying to run the following example from the Wiki page:

#run fine-mapper
python3 run_finemapper.py \
    --ld LD_temp/chr1_46000001_49000001 \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --n 383290 \
    --chr 1 \
    --start 46000001 \
    --end 49000001 \
    --method susie \
    --max-num-causal 5 \
    --out output/finemap.UKB.1.46000001.49000001.gz

...but I'm receiving the following error:

*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************

[INFO]  Loaded an LD matrix for 19286 SNPs from LD_temp/chr1_46000001_49000001.npz
[INFO]  Loading sumstats file...
[INFO]  Loaded sumstats for 966 SNPs
Traceback (most recent call last):
  File "run_finemapper.py", line 115, in <module>
    cache_dir=args.cache_dir)
  File "/Users/schilder/Desktop/Fine_Mapping/echolocatoR/tools/polyfun/finemapper.py", line 428, in __init__
    self.RNULLType = rpy2.rinterface.RNULLType
AttributeError: module 'rpy2.rinterface' has no attribute 'RNULLType'

I checked and rpy2 is up to date for both Python 2 and 3.7 (the latter of which I'm using to run the command).
Here's my versioning details:

  • OS
    • MacBook Pro, Mojave V10.14.5
  • Python
    • rpy2==3.1.0
  • R
    • R version 3.6.1 (2019-07-05), Action of the Toes

Thanks in advance,
Brian

load_ld() - feature request

run_finemapper.py

load_ld()

Hey Omer,

Finally getting around to trying the finemapper script. One thing that might be nice is being able to specify different locations/prefixes for LD and SumStats files (I noticed it uses ld_prefix for both). In my case, I have sum stats from the PD_GWAS and LD from UKBB.
E.g.:

def load_ld(ld_prefix, sumstats_prefix):
...
...

Thanks,
Brian

susie wrapper error

I get this error when running finemapper.py using susie. It seems that df_ld_snps and df_sumstats have the same snpid but the order is different. I sorted snpvar_ridge_constrained.gz
by position which fixed the problem, but I'm not sure why this file is not sorted. My munged sumstats are sorted by chr and position...

[INFO]  Computing LD matrix for chromosome 17 BP 29786491-31538425
[INFO]  Running LDStore...
[INFO]  done in 7.71 seconds
[INFO]  Running LDStore meta...
[INFO]  Extracting LD matrix to a text file
[INFO]  Running LDStore LD extraction...
[INFO]  done in 15.35 seconds
Traceback (most recent call last):
  File "/project2/xinhe/software/polyfun/run_finemapper.py", line 124, in <module>
    verbose=args.verbose, ld=ld, df_ld_snps=df_ld_snps, debug_dir=args.debug_dir)
  File "/project2/xinhe/software/polyfun/finemapper.py", line 401, in finemap
    assert np.all(self.df_ld_snps['SNP'] == self.df_sumstats_locus['SNP'])
  File "/project2/xinhe/software/miniconda3/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/common.py", line 64, in new_method
    return method(self, other)
  File "/project2/xinhe/software/miniconda3/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects

original 1.snpvar_ridge_constrained.gz

CHR     SNP     BP      A1      A2      SNPVAR  Z       N
1       rs10875231      100000012       T       G       1.6348e-08      1.2727e+00      80000
1       rs186077422     10000006        A       G       1.8599e-08      -1.7100e+00     80000
1       rs6678176       100000827       T       C       2.2472e-08      1.1075e+00      80000
1       rs78286437      100000843       T       C       1.6348e-08      2.5414e-01      80000
1       rs144406489     100001138       A       G       1.6348e-08      1.9074e-02      80000

sorted by BP column:

CHR     SNP     BP      A1      A2      SNPVAR  Z       N
1       rs575272151     11008   C       G       2.3577e-08      -1.0408e+00     80000
1       rs544419019     11012   C       G       2.3577e-08      -1.0408e+00     80000
1       rs540538026     13110   A       G       3.8092e-08      -1.1355e+00     80000
1       rs62635286      13116   T       G       3.6471e-08      -1.9663e-01     80000
1       rs200579949     13118   A       G       3.6471e-08      -1.9663e-01     80000

Thanks.

/tmp filled up quickly by genome wide fine mapping

Hello,

Thank for create_finemapper_jobs.py, the genome wide finemapping become much easier. At the same time, I got email from system admin that I filled /tmp folder quickly with ld.npz files so nothing else will dispatch. I can clean up the /tmp folder after each job finished. I thought it might be a good idea to have codes to do clean up inside the finemapper.py. I used susie in the finemapper.py.

Best regards,

Yunling

Susie Update

Hi Omer,

Susie code base has been improved and the current version adds a bunch of bug fix and performance improvements.
One of the things that has changed is the susie_bhat function

 761         except:
 762             susie_obj = self.susieR.susie_bhat(
 763                     bhat=bhat.reshape((m,1)),
 764                     shat=np.ones((m,1)),
 765                     R=self.df_ld.values,
 766                     n=self.n,
 767                     L=num_causal_snps,
 768                     scaled_prior_variance=(0.0001 if (prior_var is None) else prior_var),
 769                     estimate_prior_variance=(prior_var is None),
 770                     residual_variance=(self.R_null if (residual_var is None) else residual_var),
 771                     estimate_residual_variance=(residual_var is None),
 772                     verbose=verbose,
 773                     prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
 774                 )

which is removed from the package and the susie_rss is been used instead. This might throw an error for people using the newer susie version if the code exits with an exceptions

Per Locus

Kind of a random (and possibly dumb) question; does it make any sense to run LDSC on a single user-defined locus (e.g. a locus from a GWAS, that spans anywhere between 500 and 5000 bps)? Or does LDSC always require genome-wide data?

missing SNPs

So I'm on step 2 of Polyfun approach 3 and I'm getting the following error (despite having the --allow-missing flag on. I'm using 1KG Phase 1 as a reference (--bfile-chr ) at the moment. Do you know what might be going on?

python ./echolocatoR/tools/polyfun/polyfun.py --compute-ldscores --output-prefix ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS --bfile-chr /sc/orga/projects/pd-omics/data/1000_Genomes/Phase1/1000G.mac5eur. --chr 12 --allow-missing
*********************************************************************
* PolyFun (POLYgenic FUNctionally-informed fine-mapping)
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************

[WARNING]  no ld-wind argument specified.  PolyFun will use --ld-cm 1.0
Traceback (most recent call last):
  File "./echolocatoR/tools/polyfun/polyfun.py", line 902, in <module>
    polyfun_obj.polyfun_main(args)
  File "./echolocatoR/tools/polyfun/polyfun.py", line 835, in polyfun_main
    self.compute_ld_scores(args)
  File "./echolocatoR/tools/polyfun/polyfun.py", line 697, in compute_ld_scores
    df_ldscores_chr = self.compute_ldscores_plink_chr(args, chr_num, df_bins_chr)
  File "./echolocatoR/tools/polyfun/polyfun.py", line 747, in compute_ldscores_plink_chr
    raise ValueError('Not all SNPs were assigned a bin (meaning some SNPS are not in the annotation files)')
ValueError: Not all SNPs were assigned a bin (meaning some SNPS are not in the annotation files)

Handling missing SNPs (polyfun.py)

polyfun.py

When df_snpvar contains SNPs that aren't in df_sumstats, the script throws an error and doesn't continue. I added several lines to simply remove the extra snps from df_snpvar and continue.
lines 634-639

 if df_snpvar.shape[0] < df_sumstats.shape[0]:
            # raise ValueError('not all SNPs in the sumstats file are also in the annotations file')
            # BMS edit:: Remove extra SNPs instead of stopping.
            logging.info('Not all SNPs in the sumstats file are also in the annotations file.')
            snp_filt = df_snpvar.SNP.isin(df_sumstats.SNP)
            logging.info('Removing %d extra SNPs...' % (sum(snp_filt)))
            df_snpvar = df_snpvar.loc[snp_filt, :]

In my case, this edit resulted in the removal of 34872 SNPs (using the sample annotations downloaded with the repo and Nalls et al GWAS summary stats).

lower case alleles

Hi Omer, I recently found a bug w/ the call munge function where A1 and A2 that are lowercase make all SNPs thrown out during fine mapping. The failure was noticed during Susie fine mapping when 0 SNPs left after overlapping w/ ref and ld matrices. It was thrown as a warning rather than error. I propose to make them all upper to match with the UKB SNPs. I tested this and this saved 3/20 GWAS sum stats that I'm working with.

Also btw, pandas v1.0.0 doesn't seem to work with one obscure polyfun_utils.py function due to new syntax. Works with later versions.

def rename_df_columns(df_sumstats):
    chr_column = find_df_column(df_sumstats, ['CHR', 'CHROMOSOME', 'CHROM'])
    bp_column = find_df_column(df_sumstats, ['BP', 'POS', 'POSITION', 'COORDINATE', 'BASEPAIR'])
    snp_column = find_df_column(df_sumstats, ['SNP', 'RSID', 'RS', 'NAME','MarkerName'])
    a1freq_col = find_df_column(df_sumstats, ['A1FREQ', 'freq', 'MAF', 'FRQ'], allow_missing=True)
    info_col = find_df_column(df_sumstats, 'INFO', allow_missing=True)
    beta_col = find_df_column(df_sumstats, ['BETA', 'EFF', 'EFFECT', 'EFFECT_SIZE', 'OR'], allow_missing=True)
    se_col = find_df_column(df_sumstats, ['SE'], allow_missing=True)
    pvalue_col = find_df_column(df_sumstats, ['P_BOLT_LMM', 'P', 'PVALUE', 'P-VALUE', 'P_value', 'PVAL'], allow_missing=True)
    z_col = find_df_column(df_sumstats, ['Z', 'ZSCORE', 'Z_SCORE'], allow_missing=True)    
    n_col = find_df_column(df_sumstats, ['N', 'sample_size'], allow_missing=True)    
    ncase_col = find_df_column(df_sumstats, ['N_cases', 'Ncase', 'Nca','Total_NCase'], allow_missing=True)    
    ncontrol_col = find_df_column(df_sumstats, ['N_controls', 'Ncontrol','Nco','Total_NControl'], allow_missing=True)    
    try:
        allele1_col = find_df_column(df_sumstats, ['ALLELE1', 'A1', 'a_1'])
        allele0_col = find_df_column(df_sumstats, ['ALLELE0', 'A0', 'a_0'])
    except ValueError:
        allele1_col = find_df_column(df_sumstats, ['ALLELE1', 'A1'])
        allele0_col = find_df_column(df_sumstats, ['ALLELE2', 'A2'])
    # make alleles uppercase
    df_sumstats[allele1_col] = df_sumstats[allele1_col].str.upper()
    df_sumstats[allele0_col] = df_sumstats[allele0_col].str.upper()
    # rename columns
    return df_sumstats.rename(columns={snp_column:'SNP', allele1_col:'A1', 
            allele0_col:'A2', a1freq_col:'MAF', bp_column:'BP', 
            chr_column:'CHR', info_col:'INFO', beta_col:'BETA', 
            se_col:'SE', pvalue_col:'P', z_col:'Z', n_col:'N',
            ncase_col:'N_CASES', ncontrol_col:'N_CONTROLS'}, errors='ignore')

Enrichment .results file

I noticed in the original S-LDSC that a .results file with the enrichment scores for each annotation is also produced. Is PolyFun also capable of producing this file (or something comparable)?

Thanks,
Brian

From the LDSC wiki:

.results file

This file has the results of the analysis in tab-delimited form. If any category contains all SNPs, then that category will not appear in this file. There is one row for each category and columns summarizing the results: Proportion of SNPs, Proportion of heritability, Enrichment, and standard errors. Enrichment is (Prop. heritability) / (Prop. SNPs). If you use the --print-coefficients flag, then there will also be columns for the regression coefficients. (See Finucane, Bulik-Sullivan et al., bioRxiv for a discussion of the relationship between the coefficients and proportions of heritability.)

ValueError: FINEMAP object does not support --allow-missing

Hello Omer,

I used UKB LD in my fine mapping process with finemap method as following:
python finemapper.py
--ld ../UKB_LD/chr19_41000001_44000001
--sumstats output/ea.bolt.19.snpvar_ridge_constrained.gz
--n 80000
--chr 19
--start 40853017
--end 41853107
--method finemap
--finemap-exe finemap/default/finemap
--max-num-causal 10
--cache-dir LD_cache
--out output/finemap.finemap_exe.1.46000001.49000001.gz

I got following message:
[INFO] Loading LD file ../UKB_LD/chr19_41000001_44000001
[INFO] Done in 45.04 seconds
Traceback (most recent call last):
File "/group/tools/polyfun/finemapper.py", line 1129, in
verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
File "/group/tools/polyfun/finemapper.py", line 880, in finemap
self.sync_ld_sumstats(ld_arr, df_ld_snps, allow_missing=allow_missing)
File "/group/tools/polyfun/finemapper.py", line 275, in sync_ld_sumstats
raise ValueError(error_msg)
ValueError: not all SNPs in the sumstats file were found in the LD matrix! You could drop the missing SNPs with the flag --allow-missing, but please note that these omitted SNPs may be

I add --allow-missing option
python finemapper.py
--ld ../UKB_LD/chr19_41000001_44000001
--sumstats output/ea.bolt.19.snpvar_ridge_constrained.gz
--n 80000
--chr 19
--start 40853017
--end 41853107
--method finemap
--finemap-exe finemap/default/finemap
--max-num-causal 10
--cache-dir LD_cache
--out output/finemap.finemap_exe.1.46000001.49000001.gz
--allow-missing

I got:
[INFO] Loading sumstats file...
[INFO] Loaded sumstats for 272448 SNPs in 4.43 seconds
Traceback (most recent call last):
File "/group/tools/polyfun/finemapper.py", line 1129, in
verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
File "/group/tools/polyfun/finemapper.py", line 841, in finemap
raise ValueError('FINEMAP object does not support --allow-missing')
ValueError: FINEMAP object does not support --allow-missing

Best regards,

Yunling

Handling missing SNPs (extract_snpvar.py)

extract_snpvar.py

Currently, when Not all SNPs in the SNPs file were found in the meta file. Wrote a list of missing SNPs appears, it registers as an error and stops the script. In my fork I modified this so that it simply removes the missing SNPs and continues (lines 99-102).

plink reference files

Hi Omer,

So I finally got my Conda issues sorted out! Since then things have been running pretty smoothly, until I got to this step. I'm realizing it's because I haven't provided any plink reference files.
In the documentation it says to use a closely related reference database (e.g. 1000 Genomes), but since I'm using UKBB for every other step, is there somewhere I can find the UKBB ref files? (either on your server or UKBB's).

Thanks in advance!
Brian

 (polyfun_venv) [schilb03@lh03c03 Fine_Mapping]$ python ./echolocatoR/tools/polyfun/polyfun.py --compute-ldscores --output-prefix ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS --bfile-chr /sc/orga/projects/pd-omics/tools/polyfun/annotations/baselineLF2.2.UKB/reference.  --allow-missing
*********************************************************************
* PolyFun (POLYgenic FUNctionally-informed fine-mapping)
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************

[WARNING]  no ld-wind argument specified.  PolyFun will use --ld-cm 1.0
Traceback (most recent call last):
  File "./echolocatoR/tools/polyfun/polyfun.py", line 898, in <module>
    check_files(args)
  File "./echolocatoR/tools/polyfun/polyfun.py", line 130, in check_files
    get_file_name(args, 'bim', chr_num, verify_exists=True)
  File "./echolocatoR/tools/polyfun/polyfun.py", line 238, in get_file_name
    raise IOError('%s file not found: %s'%(file_type, file_name))
OSError: bim file not found: /sc/orga/projects/pd-omics/tools/polyfun/annotations/baselineLF2.2.UKB/reference.1.bim

Error after executing test run

Hi, I am receiving the following error after executing the test run for polyfun. I ran the following command in the polyfun conda environment:

python test_polyfun.py --python3 python --ldstore ldstore_v1.1_x86_6
4/ldstore

The following was the error message:
Writing fine-mapping results to /tmp/tmp4xe7wax0/finemap.1.46000001.49000001.gz Traceback (most recent call last): File "test_polyfun.py", line 226, in <module> test_finemapper(temp_dir, args.ldstore, args.python3) File "test_polyfun.py", line 208, in test_finemapper compare_dfs(tmpdir, gold_dir, outfile) File "test_polyfun.py", line 35, in compare_dfs assert np.all(df1[c] == df2[c]), 'found mismatch between %s and %s in column %s'%(file1, file2, c) AssertionError: found mismatch between /tmp/tmp4xe7wax0/finemap.1.46000001.49000001.gz and /home/tkamath/polyfun/gold/finemap.1.46000001.49000001.gz in column SNP

pyarrow

Hi Omer,

I seem to be having an issue with pyarrow (on our cluster, but not on my laptop) in extract_snpvar.py. On our cluster, I get the the error:

from pyarrow import ArrowIOError
ImportError: cannot import name 'ArrowIOError' from 'pyarrow' (unknown location)

Is there a specific version of pyarrow that works best? Here's my Conda env:

(echoR) [schilb03@li03c04 Fine_Mapping]$ conda list
# packages in environment at /sc/orga/work/schilb03/conda/envs/echoR:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                      1_llvm    conda-forge
bitarray                  1.2.1            py38h516909a_0    conda-forge
ca-certificates           2019.11.28           hecc5488_0    conda-forge
certifi                   2019.11.28               py38_0    conda-forge
decorator                 4.4.1                      py_0    conda-forge
fastparquet               0.3.3            py38hc1659b7_0    conda-forge
joblib                    0.14.1                     py_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_8    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libffi                    3.2.1             he1b5a44_1006    conda-forge
libgcc-ng                 9.2.0                h24d8f2e_2    conda-forge
libgfortran-ng            7.3.0                hdf63c60_5    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libllvm8                  8.0.1                hc9558a2_0    conda-forge
libopenblas               0.3.7                h5ec1e0e_7    conda-forge
libstdcxx-ng              9.2.0                hdf63c60_2    conda-forge
llvm-openmp               9.0.1                hc9558a2_2    conda-forge
llvmlite                  0.31.0           py38h8b12597_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
networkx                  2.4                        py_0    conda-forge
numba                     0.48.0           py38hb3f55d8_0    conda-forge
numpy                     1.18.1           py38h95a1406_0    conda-forge
openssl                   1.1.1d               h516909a_0    conda-forge
pandas                    1.0.1            py38hb3f55d8_0    conda-forge
pip                       20.0.2                     py_2    conda-forge
pyarrow                   0.16.0                   pypi_0    pypi
python                    3.8.1                h357f687_2    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
pytz                      2019.3                     py_0    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
scikit-learn              0.22.1           py38hcdab131_1    conda-forge
scipy                     1.4.1            py38h921218d_0    conda-forge
setuptools                45.2.0                   py38_0    conda-forge
six                       1.14.0                   py38_0    conda-forge
sqlite                    3.30.1               hcee41ef_0    conda-forge
tabix                     0.2.6                ha92aebf_0    bioconda
thrift                    0.11.0          py38he1b5a44_1001    conda-forge
tk                        8.6.10               hed695b0_0    conda-forge
tqdm                      4.42.1                     py_0    conda-forge
wheel                     0.34.2                     py_1    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h516909a_1006    conda-forge

Thanks,
Brian

is_indel

Hi Omer,

There is this code block in the polyfun_utils.py that checks for indels.

def set_snpid_index(df, copy=False, allow_duplicates=False):
    if copy:
        df = df.copy()
    is_indel = (df['A1'].str.len()>1) | (df['A2'].str.len()>1)
    df['A1_first'] = (df['A1'] < df['A2']) | is_indel
    df['A1s'] = df['A2'].copy()

the logic here creates new snpid index which in the case of indels fails when the those indels are flipped in the summary stats compared to the LD panel. example below:

snpid
22.51164013.T.TGTG                          22_51164013_226995   22  51164013      T                          TGTG 

this ends up throwing a warning X variants with sumstats were not found in the LD file and will be omitted
I am not sure if this is the intended behavior or a bug. If i comment out the is_indel in polyfun_utils.py then the finemapper.py identifies the flipped snps between the LD panel and the summary stats that are indels. Any help here is appreciated.

Best
Jerome

MAF munging

munge_polyfun_sumstats.py

In line 62 I noticed that when the columns are renamed, freq and MAF seem to be treated the same. But couldn't these two things be different in a summary stats file? Perhaps one way would be to check if 1-freq ≤ .5, and if it is then you know it's the minor allele (and then can flip the ref/alt alleles and effect, though the specifics of this might depend on the particular file format).

cannot specify specific annotations when using multiple annotation files

I am trying to run LDSC with multiple annotation files. I would like to specify the set of annotations from each file but I believe there is a bug in the code that does not allow that to happen. I noticed in the function

read_ld_sumstats

function the following code is problematic:

    if args.anno is not None:
        cols_to_keep = np.zeros(len(ref_ld.columns), dtype=np.bool)        
        annotations = args.anno.split(',')
        if np.any(~np.isin(annotations, ref_ld.columns.str[:-2])):
            raise ValueError('Not all annotations specified with --anno are found in the LD scores file')        
        cols_to_keep = (ref_ld.columns.str[:-2].isin(annotations)) | (ref_ld.columns.isin(['CHR', 'SNP']))
        assert np.sum(cols_to_keep) == len(annotations)+2
        M_cols_to_keep = ref_ld.drop(columns=['CHR', 'SNP']).columns.str[:-2].isin(annotations)
        assert np.sum(M_cols_to_keep) == len(annotations)
        ref_ld = ref_ld.loc[:, cols_to_keep]
        M_annot = M_annot[:, M_cols_to_keep]
        log.log('Keeping only annotations specified with --anno')

the code

ref_ld.columns.str[:-2]

is a problem because if there are multiple files then the ref_ld matrix will have column names with something like "L2_0" or "L2_1" instead of just L2. This causes a mismatch in the annotation names and the column names in the ref_ld column. I thought I could modify it by just putting a -4 throughout that part but that seems to cause an error with the following function call

overlap_matrix, M_tot = _read_annot(args, log)

It's not clear how I can easily fix this problem. I'm wondering if you have any suggestions for this situation.

Chirag Lakhani

More munging

munge_polyfun_sumstats.py

PolyFun seems to be working fine on my laptop with the example annotations, but when I try to run it on our cluster with the full annotations I'm running into some errors I can't seem to figure out.

Input:

ml python/3.7.3 && python3 ./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py --sumstats /sc/orga/projects/pd-omics/data/nallsEtAl2019/combined_meta/nallsEtAl2019_allSamples_allVariants.mod.txt --n 1474097 --out ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/sumstats_munged.parquet --min-info 0 --min-maf 0

Output:

[INFO]  Reading sumstats file...
./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py:201: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  df_sumstats = pd.read_table(args.sumstats, delim_whitespace=True)

[INFO]  Done in 9.44 seconds
Traceback (most recent call last):
  File "./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py", line 205, in <module>
    df_sumstats = rename_df_columns(df_sumstats, min_info_score=args.min_info, min_maf=args.min_maf)
  File "./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py", line 76, in rename_df_columns
    chr_column:'CHR', info_col:'INFO', beta_col:'BETA', se_col:'SE', pvalue_col:'P', z_col:'Z'}, errors='ignore')
  File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/core/frame.py", line 4025, in rename
    return super(DataFrame, self).rename(**kwargs)
  File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 1072, in rename
    'argument "{0}"'.format(list(kwargs.keys())[0]))
TypeError: rename() got an unexpected keyword argument "errors"

It seems do be related to the summ stats renaming but I can't find anything wrong with that part of the code.

XtX is not symmetric

While running finemapper.py, susieR sometimes crashes with the error:
rpy2.rinterface_lib.embedded.RRuntimeError: Error in (function (bhat, shat, R, n, var_y = 1, XtX, Xty, yty, maf = NULL, :
XtX is not a symmetric matrix.

I was able to fix it with the code below. It will be good to have a fix for this.

727 try:
728 #make LD matrix symmetric
729 ldm = np.triu(self.df_ld.values)
730 ldm = ldm + ldm.T - np.diag(np.diag(ldm))
731
732 susie_obj = self.susieR.susie_suff_stat(
733 bhat=bhat.reshape((m,1)),
734 shat=np.ones((m,1)),
735 R=ldm,
736 n=self.n,
737 L=num_causal_snps,
738 scaled_prior_variance=(0.0001 if (prior_var is None) else prior_var),
739 estimate_prior_variance=(prior_var is None),
740 residual_variance=(self.R_null if (residual_var is None) else residual_var),
741 estimate_residual_variance=(residual_var is None),
742 verbose=verbose,
743 prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
744 )

Error with overlap annot

Hi, I am running S-LDSC with baseline and my own annotations and am running into the following error:

    overlap_matrix_prop[i, :] = overlap_matrix[i, :] / M_annot
ValueError: operands could not be broadcast together with shapes (1,121) (1,137)

Baseline has 16 annotations and my own annotations have 121. Both have the SNP,CHR,BP,A1,A2 columns. Both have .M files that agree with the number of annotations. Both have the same number of SNPs in each chromosome, and they are all tab-delimited.

There must be some mismatch between the two I suppose. I tried running each individually and there were no problems.

MAF

My schizophrenia summary data does not have a MAF column. Can I simply set the MAF to 0?Is that going to make a big difference

AttributeError: 'DataFrame' object has no attribute 'ukb'

I am attempting to run step 3. Compute LD-scores for each SNP bin of PolyFun approach 3: Computing prior causal probabilities non-parametrically. My code is essentially the same as the example. The only difference is I have added the flag --allow-missing

mkdir -p LD_cache
python polyfun.py \
    --compute-ldscores \
    --output-prefix output/testrun \
    --ld-ukb \
    --ld-dir LD_cache \
    --chr 1 \
    --allow-missing

When I run this code with the current HEAD commit df5c072, I get the following error message:

Traceback (most recent call last):
  File "bin/polyfun/polyfun.py", line 848, in <module>
    polyfun_obj.polyfun_main(args)
  File "bin/polyfun/polyfun.py", line 776, in polyfun_main
    self.compute_ld_scores(args)
  File "bin/polyfun/polyfun.py", line 637, in compute_ld_scores
    df_ldscores_chr = compute_ldscores_chr(df_bins_chr, ld_dir)
  File "/redacted/bin/polyfun/compute_ldscores_from_ld.py", line 193, in compute_ldscores_chr
    if args.ukb:
  File "/redacted/mambaforge/envs/polyfun/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'ukb'

The issue appears to be that polyfun.py passes a pandas DataFrame:

df_ldscores_chr = compute_ldscores_chr(df_bins_chr, ld_dir)

when compute_ldscores_chr() expects args to be the first argument:

def compute_ldscores_chr(args, df_annot_chr):

I was going to attempt to fix this by passing args to compute_ldscores_chr() and replacing args.ukb with args.ld_ukb. However, I noticed that args.ukb is valid when invoking compute_ldscores_from_ld.py directly, as documented in Computing LD-scores with pre-computed UK Biobank LD matrices.

parser.add_argument('--ukb', default=False, action='store_true', help='if specified, the script will download and use UK Biobank LD files')

Thus it's unclear to me how best to update compute_ldscores_chr() to support both of these use cases. In case it could be helpful, I traced the function signature change from compute_ldscores_chr(df_annot_chr, ld_dir, no_cache=False) to compute_ldscores_chr(args, df_annot_chr) in commit b6674ce, which is the same commit where compute_ldscores_ukb.py was renamed to compute_ldscores_from_ld.py.

input format for summary statistics for munging?

hello there,

Thanks for this wonderful software and Wikipedia. I was able to install it seamlessly using conda.
I have GWAS summary statistics for an admixed minority population. The summary statistics follow CHR:POS:Allele1:Allele2 format. Data are imputed in-house, therefore SNP names follow this pattern than rsids.

I would like create parquet format for the summary stats file.

May I know what column headers are needed with munge_polyfun_sumstats.py? Also, is there any specific order or columns that is must with the script?

with SNP rsids, chromosome and base pair info, and either a p-value, an effect size estimate and its standard error, a Z-score or a p-value.

Sample data from BOLT-LMM has columns as: SNP, CHR, BP, INFO, BETA, SE

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.