Giter Site home page Giter Site logo

omerwe / s-pcgc Goto Github PK

View Code? Open in Web Editor NEW
14.0 14.0 3.0 11.25 MB

Heritability, genetic correlation and functional enrichment estimation for case-control studies

License: GNU General Public License v3.0

Python 99.85% Rouge 0.15%
gwas statistical-genetics

s-pcgc's People

Contributors

lrousselet avatar omerwe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

s-pcgc's Issues

Failed tests with anaconda3

I am not sure how to handle this error:

conda 4.3.30

conda env create PCGC
conda create --name PCGC
source activate PCGC
conda install numpy
conda install scipy
conda install scikit-learn
conda install pandas
conda install pandas-plink
conda install -c conda-forge pandas-plink
conda install tqdm

git clone https://github.com/omerwe/S-PCGC

python test_sumstats.py
temporary directory: /tmp/2qujmo39
Creating summary statistics for study s3...
Command python pcgc_sumstats_creator.py --bfile example/s3 --covar example/s3.cov --prev 0.01 --pheno example/s3.phe --frqfile /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --annot /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --out /tmp/2qujmo39/s3.1 --sync /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model. returned 1 with the following stdout:
b'pcgc_sumstats_creator.py:388: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_pheno.iloc[:,-1] = df_pheno.iloc[:,-1].astype(np.int)\npcgc_sumstats_creator.py:50: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_covar[c] = df_covar[c].astype(np.float)\n[INFO] reading frq file...\n[INFO] Reading plink file from disk (this may take a while...)\n*********************************************************************\n* S-PCGC sumstats creator\n* Version 2.0.0\n* (C) 2018 Omer Weissbrod\n*********************************************************************\n\n\rMapping files: 0%| | 0/3 [00:00<?, ?it/s]\rMapping files: 100%|\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88| 3/3 [00:00<00:00, 81.52it/s]\n[INFO] 1500 SNPs remained after removing SNPs without MAFs\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\nTraceback (most recent call last):\n File "pcgc_sumstats_creator.py", line 592, in \n sumstats_creator = PCGC_Sumstats(args)\n File "pcgc_sumstats_creator.py", line 75, in init\n self.bfile, df_pheno, df_maf, self.num_snps, self.sample_size = self.read_plink(args, df_pheno, df_maf)\n File "pcgc_sumstats_creator.py", line 486, in read_plink\n is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/ops/common.py", line 70, in new_method\n return method(self, other)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/arraylike.py", line 40, in eq\n return self._cmp_method(other, operator.eq)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/series.py", line 5617, in _cmp_method\n raise ValueError("Can only compare identically-labeled Series objects")\nValueError: Can only compare identically-labeled Series objects\n'
Traceback (most recent call last):
File "test_sumstats.py", line 106, in
test_sumstats(temp_dir)
File "test_sumstats.py", line 58, in test_sumstats
run_command(ss_command)
File "test_sumstats.py", line 20, in run_command
raise IOError()
OSError

Unable to use pcgc_r2.py in combination with --extract

Dear Omer,

I'm trying to implement s-PCGC and running the example with 1000G data. I'm able to run all steps with the toy example. I'm also able to tun pcgc_sync.py.

I run into an error with pcgc_r2.py:

[INFO] reading annot file...
/hpc/hers_en/mbakker2/tools/S-PCGC/pcgc_utils.py:56: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_chr = pd.read_table(fname, delim_whitespace=True, index_col=index_col, header=header, usecols=usecols)
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:29: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_sync = pd.read_table(args.sync+'sync', index_col='Category')
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:42: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_extract = pd.read_table(args.extract, squeeze=True, header=None)
[INFO] Extracting 60 SNPs
[INFO] Computing r^2 products for 60 SNPs
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:63: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_fam = pd.read_table(args.bfile+'.fam', header=None)
[INFO] Loading SNP file...
After filtering, 60 SNPs remain
Traceback (most recent call last):
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 115, in
df_prod_r2 = compute_r2_prod(args)
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 73, in compute_r2_prod
df_annotations = df_annotations.iloc[geno_array.kept_snps]
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2221, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2203, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

I used:

chr=22
python $PCGC/pcgc_r2.py \
    --annot $baseline/baselineLD.${chr}. \
    --sync $baseline/baselineLD. \
    --bfile $refDir/1000G_EUR_Phase3_plink/1000G.EUR.QC.${chr} \
    --extract $PCGC/example/good_snps.txt \
    --out $baseline/baselineLD.goodSNPs.${chr}

When I leave out --extract $PCGC/example/good_snps.txt PCGC runs fine and I get the expected output. But for all SNPs and not the 'good SNPs'

I tried to subset good_snps.txt to only the chr22 SNPs, but the same error occurs.

I am using conda to create an environment to run PCGC. This is created using
conda create -n pcgc python=2.7 numpy pandas scikit-learn scipy tqdm bitarray
And conda list -n pcgc gives:

# packages in environment at /hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc:
#
# Name Version Build Channel
atomicwrites 1.3.0 py_0 conda-forge
attrs 19.1.0 py_0 conda-forge
backports_abc 0.5 py_1 conda-forge
bitarray 0.8.3 py27h14c3975_0
blas 1.0 mkl
bokeh 1.0.4 py27_1000 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
certifi 2019.3.9 py27_0 conda-forge
cffi 1.12.2 py27hf0e25f4_1 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 0.8.0 py_0 conda-forge
cytoolz 0.9.0.1 py27h14c3975_1001 conda-forge
dask 1.1.4 py_0 conda-forge
dask-core 1.1.4 py_0 conda-forge
distributed 1.26.0 py27_1 conda-forge
freetype 2.9.1 h94bbf69_1005 conda-forge
funcsigs 1.0.2 py_3 conda-forge
futures 3.2.0 py27_1000 conda-forge
heapdict 1.0.0 py27_1000 conda-forge
intel-openmp 2019.1 144
jinja2 2.10 py_1 conda-forge
jpeg 9c h14c3975_1001 conda-forge
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.36 h84994c4_1000 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h648cc4a_1001 conda-forge
locket 0.2.0 py_2 conda-forge
markupsafe 1.1.1 py27h14c3975_0 conda-forge
mkl 2019.1 144
mkl_fft 1.0.10 py27ha843d7b_0
mkl_random 1.0.2 py27hd81dba3_0
more-itertools 4.3.0 py27_1000 conda-forge
msgpack-python 0.6.1 py27h6bb024c_0 conda-forge
ncurses 6.1 he6710b0_1
numpy 1.16.2 py27h7e9f1db_0
numpy-base 1.16.2 py27hde5b4d6_0
olefile 0.46 py_0 conda-forge
openssl 1.1.1b h14c3975_1 conda-forge
packaging 19.0 py_0 conda-forge
pandas 0.24.1 py27he6710b0_0
pandas-plink 1.2.30 py27h14c3975_0 conda-forge
partd 0.3.9 py_0 conda-forge
pathlib2 2.3.3 py27_1000 conda-forge
pillow 5.3.0 py27h00a061d_1000 conda-forge
pip 19.0.3 py27_0
pluggy 0.9.0 py_0 conda-forge
psutil 5.6.1 py27h14c3975_0 conda-forge
py 1.8.0 py_0 conda-forge
pycparser 2.19 py27_1 conda-forge
pyparsing 2.3.1 py_0 conda-forge
pytest 4.3.1 py27_0 conda-forge
pytest-runner 4.4 py_0 conda-forge
python 2.7.15 h9bab390_6
python-dateutil 2.8.0 py27_0
pytz 2018.9 py27_0
pyyaml 5.1 py27h14c3975_0 conda-forge
readline 7.0 h7b6447c_5
scandir 1.10.0 py27h14c3975_0 conda-forge
scikit-learn 0.20.3 py27hd81dba3_0
scipy 1.2.1 py27h7c811a0_0
setuptools 40.8.0 py27_0
singledispatch 3.4.0.3 py27_1000 conda-forge
six 1.12.0 py27_0
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.27.2 h7b6447c_0
tblib 1.3.2 py_1 conda-forge
tk 8.6.8 hbc83047_0
toolz 0.9.0 py_1 conda-forge
tornado 5.1.1 py27h14c3975_1000 conda-forge
tqdm 4.31.1 py_0
wheel 0.33.1 py27_0
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.1.7 h14c3975_1001 conda-forge
zict 0.1.4 py_0 conda-forge
zlib 1.2.11 h7b6447c_3

How to perform an analysis using just summary statistics

I apologize for my obtusity, but I Can't figure out from the manual how to actually analyze my data.
I have, let's say, two summary statistics. I do not have access to the subject left genotypes or phenotypes. Just the summary statistics (and they were done with Regenie too).

I can build an LD panel from 1000 G (I am a bit confused regarding whether I need to select good SNPs since the summary stats is already seriously filtered) but how to I turn a summary statistics from Regenie into the one that would work for you?

Would it be correct to say that all I need is to preprocess it with LDSC tools? specifically pass them through munge_sumstats.py and then into S-PCGA?

Sincerely

Different shape/ length error

Hey There,

Thanks for your reading. I'm using this package and get an error:

[WARNING] 8634629 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
100%|???????????????????????????????????????????????????????????????????????| 22/22 [16:14<00:00, 44.29s/it]
/S-PCGC/pcgc_main.py:397: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself.
Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
gencov_arr = np.empty((len(pcgc_data_list), len(pcgc_data_list)), dtype=np.object)
Traceback (most recent call last):
File "/S-PCGC/pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "/S-PCGC/pcgc_main.py", line 402, in init
cov_ij = self.create_cov_obj(args, oi, oj,
File "/pcgc_main.py", line 628, in create_cov_obj
self.compute_taus(args, oi, oj,
File "/S-PCGC/pcgc_main.py", line 753, in compute_taus
z1_anno = df_annotations_sumstats_noneg.values * sumstats1[:, np.newaxis] * np.sqrt(trace_ratios1)
ValueError: operands could not be broadcast together with shapes (8634629,97) (8636723,1)

And with the same data, same codes we get a different error message when performed by another person:

[WARNING] 8636723 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
[INFO] reading annot files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [21:54<00:00, 59.77s/it]
Traceback (most recent call last):
File "pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "pcgc_main.py", line 394, in init
self.load_annotations_data(args, df_prodr2, index_intersect)
File "pcgc_main.py", line 488, in load_annotations_data
is_same = (df.index == index_intersect).all()
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 123, in cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare

Do you have any idea how this error comes and how to solve it? Thanks a lot and looking forward to your reply :))

pcgc_sumstats_creator.py : Error after removing "because they have different alleles in plink and MAF files"

When a SNP with different alleles in the plink and MAF files are present they are successfully removed via:

is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])
is_consistent_snp = is_consistent_snp | (df_maf[allele1_col] == df_bim['a0']) & (df_maf[allele0_col] == df_bim['a1'])
if not np.all(is_consistent_snp):
     logging.info('%d SNPs will be removed because they have different alleles in plink and MAF files'%(np.sum(~is_consistent_snp)))
     df_maf = df_maf.loc[is_consistent_snp]
     df_bim = df_bim.loc[is_consistent_snp]

However, this produces an error later in the code due to a size mismatch.

0: Traceback (most recent call last):
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 593, in <module>
0:     sumstats_creator.compute_all_sumstats(args.chunk_size)
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 271, in compute_all_sumstats
0:     self.set_locus(snp1, snp2)
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 318, in set_locus
0:     snp_maf = self.mafs[snp1+j]
0:   File "path/lib/python3.8/site-packages/pandas/core/series.py", line 879, in __getitem__
0:     return self._values[key]
0: IndexError: index 44269 is out of bounds for axis 0 with size 44269

If I force removal of additional SNPs, the index at which the error occurs reduces by a corresponding interval which leads me to believe this is what causes the problem. E.g. I forced removal of 2 additional SNPs the error becomes.

IndexError: index 44267 is out of bounds for axis 0 with size 44267

I am able to get around this by identifying the mismatched SNPs and removing them from my good_snps.txt for now

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.