Giter Site home page Giter Site logo

snp2vec's Introduction

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Pull Requests Welcome GitHub license Contributor Covenant

SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP patterns of genomic sequences. The effectiveness of SNP2Vec has been evaluated for Alzheimer's disease risk in a Chinese cohort and is found to significantly outperforms existing polygenic risk score methods and all other deep learning baselines which are trained on haploid sequences.

Research Paper

SNP2Vec has been accepted by BioNLP 2022 and you can find the details in our paper . If you are using any component on SNP2Vec including the token mapping resources, the cached chromosome matrix, or the Alzheimer's disease risk dataset in your work, please cite the following paper:

@inproceedings{cahyawijaya-etal-2022-snp2vec,
    title = "{SNP}2{V}ec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study",
    author = "Cahyawijaya, Samuel  and
      Yu, Tiezheng  and
      Liu, Zihan  and
      Zhou, Xiaopu  and
      Mak, Tze Wing Tiffany  and
      Ip, Yuk Yu Nancy  and
      Fung, Pascale",
    booktitle = "Proceedings of the 21st Workshop on Biomedical Language Processing",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.bionlp-1.14",
    doi = "10.18653/v1/2022.bionlp-1.14",
    pages = "140--154",
    abstract = "Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer{'}s disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences.",
}

Repository Structure

We provide code and resources to generate the SNP pre-training dataset which we use to build the Dipformer model in our paper.

Chromosome Matrix

We provide two pre-processed chromosome matrix for Chromosome-19 and Chromosome-21 which build from GRCh37 and dbSNP 153

For generating other chromosome matrices, you can check the gen_chromosome_matrix.ipynb provided on this repo.

Alzheimer's Disease Risk Dataset

To access the Alzheimer's disease risk dataset used for evaluating the model in our paper, you need to request and sign a Data Use Agreement (DUA) by contacting Tiffany T.W MAK ([email protected]) or Xiaopu Zhou ([email protected]).

snp2vec's People

Contributors

samuelcahyawijaya avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.