Giter Site home page Giter Site logo

dsun2 / clusterduck Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ncbi-hackathons/clusterduck

0.0 0.0 0.0 268 KB

Disease Clustering from Literature Based on Minimal Training Data

License: MIT License

Python 64.09% Jupyter Notebook 35.91%

clusterduck's Introduction

ClusterDuck

Disease clustering from phenotypic literature data through Document Understanding, Comprehension and Knowledge

(doi: )

What's the problem ?

Typically, SNPs are studied in terms of a "one disease - one SNP" relationship. This results in researchers and clinicians with deep knowledge of a disease but often incomplete knowledge of all potentially relevant SNPs.

Why should we solve it ?

Knowledge of a larger set of potentially relevant SNPs to a collection of phenotypes would allow finding a novel set of relevant publications.

What is ClusterDuck ?

ClusterDuck is a tool to automatically identify genetically-relevant publications and returns relevant

How to use ClusterDuck ?

Prerequisite

Installation

  • Install python packages required: pip3 install -r requirements.txt

  • Download the pubmed database and required data from nltk: python3 setup.py

Example

  1. Use easy-to-start command line tool ClusterDuck.py

    python3 ClusterDuck.py "Autistic behavior" "Restrictive behavior" "Impaired social interactions" "Poor eye contact" "Impaired ability to form peer relationships" "No social interaction" "Impaired use of nonverbal behaviors" "Lack of peer relationships" "Stereotypy"
  2. A case study

    python3 generate_csv.py

  3. Train Topic Models

    After you have corpora, you can run the following function in train_lda.py to obtain topic models:

    lda1, lda2 = train_ldas(corpus1, corpus2, n_topics=N_TOPICS, alpha=ALPHA, eta=ETA)

    where N_TOPICS, ALPHA and ETA parameterize both topic models.

Test Suite

python3 ./dc/test_utils.py

ClusterDuck Workflow

Workflow Pipeline

Input

Set of phenotypic terms from HPO ontology.

Workflow

  • A 'phenotypic' corpus of literature is extracted from PubMed using the user-input HPO phenotypic terms.
  • All SNPs mentioned in the 'phenotypic clusters are idenfified.
  • PubMed is queried using the phenotypically-relevant SNPs to extract a second 'phenotypic + genetic' corpus.
  • Topic modeling is run on each corpus separately.
  • Topic distributions are compared to discover new genetically-inspired and relevant topics.

Output

A list of novel genetically-related topics to the initial phenotypic input.

Planned Features

  • Synonyms search from user-input HPO provides a synonym list for each of their controlled vocabulary terms. This can be incorporated as a preprocessor with the user input to allow
  • Make use of hierarchy HPO is an ontology of terms and user-input terms are likely to have sub- and super-class terms.
  • Filtering different types of research articles Optionally add a [PT] query filter to the PubMed query to limit the types of publications returned.
  • Use of EMR-type data to build corpus as oppose to PubMed An EMR-based corpus is more likely to be associated with diseases (especially to ICD terms) than a PubMed-based corpus.

People/Team

  • Jennifer Dong
  • Larry Gray
  • Joseph Halstead
  • Yi Hsiao
  • Wayne Pereanu
  • Neelay Trivedi
  • Nathan Wan
  • Donghui Wu

Presentations

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.