shz9 / magenpy Goto Github PK

View Code? Open in Web Editor NEW

16.0 2.0 5.0 2.67 MB

Modeling and Analysis of (Statistical) Genetics data in python

Home Page: https://shz9.github.io/magenpy/

License: MIT License

Python 91.58% Cython 5.42% Makefile 0.20% C++ 1.80% Shell 0.32% Dockerfile 0.69%

genotype gwas ldsc prs simulation linkage-disequilibrium phenotype

magenpy's Introduction

`magenpy`: Modeling and Analysis of (Statistical) Genetics data in python

magenpy is a Python package for modeling and analyzing statistical genetics data. The package provides tools for:

Reading and processing genotype data in plink BED format.
Efficient LD matrix construction and storage in Zarr array format.
Data structures for harmonizing various GWAS data sources.
- Includes parsers for commonly used GWAS summary statistics formats.
Simulating polygenic traits (continuous and binary) using complex genetic architectures.
- Multi-cohort simulation scenarios (beta)
- Simulations incorporating functional annotations in the genetic architecture (beta)
Interfaces for performing association testing on simulated and real phenotypes.
Preliminary support for processing and integrating genomic annotations with other data sources.

Helpful links

magenpy's People

Contributors

Stargazers

Watchers

Forkers

dorukcakmakci jackhtstanley legaultmarc michaelkomo shouyeliu

magenpy's Issues

Add AnnotatedPhenotypeSimulator

The current PhenotypeSimulator class has some functionality for simulating with annotations. We need to remove those functionalities and embed them in a new class AnnotatedPhenotypeSimulator that inherits from PhenotypeSimulator and incorporates annotations in simulate_mixture_assignments and simulate_betas.

Tasks:

Remove annotation_weights property (as well as other related methods) from GWASimulator.
Create AnnotatedPhenotypeSimulator class
In AnnotatedPhenotypeSimulator, overwrite the simulate_mixture_assignment method to incorporate annotations.
In AnnotatedPhenotypeSimulator, overwrite the simulate_betas method to incorporate annotations.

Additionally, we need to figure out two things:

How to simulate betas according to user-defined enrichments for each (or some) annotation classes.

Implement heritability estimators

I'd like to provide implementation for summary statistics-based and individual-level data heritability estimators. This is important for many downstream models as well as for data interpretation.

Here are some popular heritability estimators that we can easily implement with the magenpy interfaces:

LD Score regression (ldsc). We already have a simple estimator implemented, but it would be nice to have a fully functional implementation that incorporates annotations, estimates standard errors, etc.
The GWASH estimator: This is an interesting estimator proposed by Schwartzman et al. a few years ago based on important work by Lee Dicker. The estimator seems simple enough to implement with the magenpy LD interfaces.
The Haseman-Elston heritability estimator based on individual-level data.

We may add other estimators down the line if there is interest.

Add support for reading summary statistics

We need to add method to the GWASDataLoader class to read pre-computed summary statistics that can be obtained from other software tools (e.g. PLINK). The method should take either one big file and break into different chromosomes or 1 file per chromosome (as we do with the genotypes).

Sample summary statistics files can be found here: independent_sumstats.

Method signature is already written:

def read_summary_stats(self, sumstats_file):
    """
    TODO: implement reading functions for summary statistics
    """
    pass

Modify `LDMatrix` to allow for reordering/sorting LD matrices

There are cases where we would like to re-order the LD matrix by shuffling its rows and columns to align with some external requirements. This task is challenging to implement, because it will likely mess up with the LD boundaries and affect the sparsity structure of the original matrix.

As a starting point, we can limit ourselves to solutions that do not necessarily re-order the matrix itself, but affect the order in which we iterate over its rows/columns. This may require that instead of storing LD boundaries (start and end point for each row), we can store the indices of all neighboring SNPs whose LD is non-zero. This is similar to the sparse matrix formats implemented by scipy and other libraries.

Will keep this issue open to explore various ideas for how we can go about implementing this.

Add LD matrices to S3/Google Cloud storage services

Instead of asking the user to download the LD matrices every time they need to run, e.g. viprs, we can leverage Zarr's APIs for cloud storage and read the matrices from a central repository on, e.g. Amazon s3 or Google Cloud. For this to happen, we need to extend the LDMatrix class to handle various distributed storage systems, with the help of the MutableMapping interface from Dask (see here).

A simple way to go about this is to add methods to the LDMatrix class, such as .from_s3() and .from_gc() to read matrices from the s3 and Google Cloud storage systems, respectively.

Package is slow to import

At the moment, the package takes several seconds at least to be imported. This is likely because we're importing very heavy dependencies at the top of some of the modules, including scipy, pandas, and numpy, matplotlib, etc. It would be good to refactor the import statements to reduce the time to load the main modules.

Explore using `bed-reader` as a backend

I was curious about the possibility of using bed-reader as a backend for some of the operations on the genotype matrix. From the preliminary testing that I have done, it seems fast and provides a lots of nice functionality.

Here's a sample implementation that inherits from GenotypeMatrix and implements some of the useful operations, such as linear scoring and reading the genotype file in chunks. However, in early experiments, it seems a bit slow compared to pandas-plink or using plink directly, as implemented in plinkBEDGenotypeMatrix. May need to improve this for it to be a practical competitor.

class rustBEDGenotypeMatrix(GenotypeMatrix):
    """
    NOTE: Still experimental... lots more work to do...
    """

    def __init__(self, sample_table=None, snp_table=None, temp_dir='temp', bed_mat=None):
        super().__init__(sample_table=sample_table, snp_table=snp_table, temp_dir=temp_dir)

        # xarray matrix object, as defined by pandas-plink:
        self.bed_mat = bed_mat

    @classmethod
    def from_file(cls, file_path, temp_dir='temp'):

        from bed_reader import open_bed

        try:
            bed_mat = open_bed(file_path)
        except Exception as e:
            raise e

        # Set the sample table:
        sample_table = pd.DataFrame({
            'FID': bed_mat.fid,
            'IID': bed_mat.iid,
            'fatherID': bed_mat.father,
            'motherID': bed_mat.mother,
            'sex': bed_mat.sex,
            'phenotype': bed_mat.pheno
        }).astype({
            'FID': str,
            'IID': str,
            'fatherID': str,
            'motherID': str,
            'sex': float,
            'phenotype': float
        })

        sample_table['phenotype'] = sample_table['phenotype'].replace({-9.: np.nan})
        sample_table = sample_table.reset_index()

        # Set the snp table:
        snp_table = pd.DataFrame({
            'CHR': bed_mat.chromosome,
            'SNP': bed_mat.sid,
            'cM': bed_mat.cm_position,
            'POS': bed_mat.bp_position,
            'A1': bed_mat.allele_1,
            'A2': bed_mat.allele_2
        }).astype({
            'CHR': int,
            'SNP': str,
            'cM': float,
            'POS': np.int,
            'A1': str,
            'A2': str
        })

        snp_table = snp_table.reset_index()

        g_mat = cls(sample_table=SampleTable(sample_table),
                    snp_table=snp_table,
                    temp_dir=temp_dir,
                    bed_mat=bed_mat)

        return g_mat

    @property
    def sample_index(self):
        return self.sample_table.table['index'].values

    @property
    def snp_index(self):
        return self.snp_table['index'].values

    def score(self, beta, standardize_genotype=False, skip_na=True):
        """
        Perform linear scoring on the genotype matrix.
        :param beta: A vector or matrix of effect sizes for each variant in the genotype matrix.
        :param standardize_genotype: If True, standardize the genotype when computing the polygenic score.
        :param skip_na: If True, skip missing values when computing the polygenic score.
        """

        pgs = None

        if standardize_genotype:
            from .stats.transforms.genotype import standardize
            for (start, end), chunk in self.iter_col_chunks(return_slice=True):
                if pgs is None:
                    pgs = standardize(chunk).dot(beta[start:end])
                else:
                    pgs += standardize(chunk).dot(beta[start:end])
        else:
            for (start, end), chunk in self.iter_col_chunks(return_slice=True):
                if skip_na:
                    chunk_pgs = np.nan_to_num(chunk).dot(beta[start:end])
                else:
                    chunk_pgs = np.where(np.isnan(chunk), self.maf[start:end], chunk).dot(beta[start:end])

                if pgs is None:
                    pgs = chunk_pgs
                else:
                    pgs += chunk_pgs

        return pgs

    def perform_gwas(self, **gwa_kwargs):

        raise NotImplementedError

    def compute_allele_frequency(self):
        self.snp_table['MAF'] = (np.concatenate([np.nansum(bed_chunk, axis=0)
                                                 for bed_chunk in self.iter_col_chunks()]) / (2. * self.n_per_snp))

    def compute_sample_size_per_snp(self):

        self.snp_table['N'] = self.n - np.concatenate([np.sum(np.isnan(bed_chunk), axis=0)
                                                       for bed_chunk in self.iter_col_chunks()])

    def iter_row_chunks(self, chunk_size='auto', return_slice=False):

        if chunk_size == 'auto':
            matrix_size = self.estimate_memory_allocation()
            # By default, we allocate 128MB per chunk:
            chunk_size = int(self.n // (matrix_size // 128))

        for i in range(int(np.ceil(self.n / chunk_size))):
            start, end = int(i * chunk_size), min(int((i + 1) * chunk_size), self.n)
            chunk = self.bed_mat.read(np.s_[self.sample_index[start:end], self.snp_index], num_threads=1)
            if return_slice:
                yield (start, end), chunk
            else:
                yield chunk

    def iter_col_chunks(self, chunk_size='auto', return_slice=False):

        if chunk_size == 'auto':
            matrix_size = self.estimate_memory_allocation()
            # By default, we allocate 128MB per chunk:
            chunk_size = int(self.m // (matrix_size // 128))

        for i in range(int(np.ceil(self.m / chunk_size))):
            start, end = int(i * chunk_size), min(int((i + 1) * chunk_size), self.m)
            chunk = self.bed_mat.read(np.s_[self.sample_index, self.snp_index[start:end]], num_threads=1)
            if return_slice:
                yield (start, end), chunk
            else:
                yield chunk

Re-organize the code for modularity

Currently, the GWASDataLoader class incorporates many functionalities that make it brittle and harder to work with. We would like to move some of those functionalities into their own classes/objects.

Here are a number of ideas that we can implement:

Move summary statistics to their own class SumStats
Move LD functionalities to their own class LD
Move annotations to their own class 'Annotations`
(Need discussion) Move genotypes to their own class Genotype

In this sort of setup, the GWASDataLoader object would be an interface to harmonize these data classes, instead of performing many different, unrelated functions.

Make LD matrices triangular

Currently, the interfaces for calculating and handling LD matrices stored in Zarr arrays assume that the matrices are square. This can be very expensive for extremely large matrices, when we scale to hundreds of thousands to millions of SNPs. As a way to reduce storage requirements, we can store only the upper triangular part of the matrix and reconstruct the lower triangular part on the fly when processing the data (perhaps with the help of scipy's csr_matrix?).

Create utilities for converting part or all of the Zarr LD matrix to csr_matrix.
Modify iterators to allow for triangular matrices where the full data is reconstructed on the fly.

Add GDLMerger class

In some cases, we would like to merge GWAS data from multiple cohorts or datasets. For these cases, it'd would be great to facilitate this with a GDLMerger object that handles merging the data smoothly in the background.

Here are some basic tasks that can be implemented within this framework:

Create a basic GDLMerger class that only merges the datasets based on SNPs and their identifiers (ID, a1, a2, etc...). This can be helpful for some applications.
Create a GDLMerger class that extends the basic class and performs meta analysis on shared GWAS SNPs (if summary statistics are provided).

Check for and find nearest PSD LD matrix

In some applications, we may require that the LD matrix (SNP-by-SNP correlation matrix) be positive definite (PD) or positive semi-definite (PSD). Very often, this is not the case, which may result in instabilities for some downstream models. Therefore, it would be great to augment our LD functionalities to check for PSD and to also find the nearest PSD matrix to the sample LD matrix.

A good starting point may be the code snippet in this StackOverflow thread. I modified this function to find the nearest PSD while retaining the sparsity structure in the original matrix:

from numpy import linalg as la

def nearestPD(A):
    """Find the nearest positive-definite matrix to input

    A Python/Numpy port of John D'Errico's `nearestSPD` MATLAB code [1], which
    credits [2].
    
    Credit: https://stackoverflow.com/a/43244194
    Modified by Shadi Zabad to retain the sparsity structure in the original matrix.

    [1] https://www.mathworks.com/matlabcentral/fileexchange/42885-nearestspd

    [2] N.J. Higham, "Computing a nearest symmetric positive semidefinite
    matrix" (1988): https://doi.org/10.1016/0024-3795(88)90223-6
    """

    np.fill_diagonal(A, 1.)
    B = (A + A.T) / 2
    _, s, V = la.svd(B)

    H = np.dot(V.T, np.dot(np.diag(s), V))

    A2 = (B + H) / 2

    A3 = (A2 + A2.T) / 2
    A3[A == 0.] = 0.

    if isPD(A3):
        return A3

    spacing = np.spacing(la.norm(A))
    # The above is different from [1]. It appears that MATLAB's `chol` Cholesky
    # decomposition will accept matrixes with exactly 0-eigenvalue, whereas
    # Numpy's will not. So where [1] uses `eps(mineig)` (where `eps` is Matlab
    # for `np.spacing`), we use the above definition. CAVEAT: our `spacing`
    # will be much larger than [1]'s `eps(mineig)`, since `mineig` is usually on
    # the order of 1e-16, and `eps(1e-16)` is on the order of 1e-34, whereas
    # `spacing` will, for Gaussian random matrixes of small dimension, be on
    # othe order of 1e-16. In practice, both ways converge, as the unit test
    # below suggests.
    I = np.eye(A.shape[0])
    k = 1
    while not isPD(A3):
        mineig = np.min(np.real(la.eigvals(A3)))
        A3 += I * (-mineig * k**2 + spacing)
        A3[A == 0.] = 0.
        k += 1

    return A3


def isPD(B):
    """Returns true when input is positive-definite, via Cholesky"""
    try:
        _ = la.cholesky(B)
        return True
    except la.LinAlgError:
        return False

The main bottleneck here is that we need to perform SVD on the LD matrix, which may be computationally expensive, even with sparse arrays. One way to get around this difficulty is to deploy this within LD blocks, which may be more manageable.

To experiment with this, we can try to tackle the following tasks:

Extract blocks from LDMatrix corresponding to the LD blocks defined by the block estimator.
Find nearest PSD for each block separately.
Check if this alleviates computational or numerical instabilities for some downstream tasks (e.g. viprs).

Replace LD matrix matching logic

Currently, when we match LD matrices with GWAS summary statistics, a new matrix is created in temp_dir with only the subset of SNPs that are shared between the two data sources. This can be very slow and expensive for large LD matrices, potentially replicating 100s of GB of data unnecessarily.

As an alternative solution, we can instead generate a mask that ensures that the posterior estimates for SNPs that only exist in the LD matrix is set to zero by default. This mask would be generated in the matching step and would be passed down for downstream tasks, such as PRS model fitting.

Details TBD.