Giter Site home page Giter Site logo

ancestral-genome-reconstruction's Introduction

Ancestral Genome Reconstruction - Genome Simulator

Motivation

We are interested in reconstructing ancestral SNPs for known, highly inter-related pedigrees. This code is part of the ongoing research of Gabriela Brown and Dr. Sara Mathieson at Swarthmore College.

We are building off of the work done by Georgi et al. in 2014 on the heritability of Bipolar Affective Disorder. [1]

[1] - http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004229

Installation

This code uses pipenv to install and manage python dependencies: https://pypi.python.org/pypi/pipenv

To install, run pip install pipenv or brew install pipenv.

To start a pipenv environment and install all python dependencies, run pipenv install, which will use the Pipenv file in the repository.

(Optional) To install GERMLINE, follow the instructions at: http://www.cs.columbia.edu/~gusev/germline/

Usage

pipenv run python3 genome_simulator.py [-h] [--verbose] [--length LENGTH] [--output OUTPUT] pedigree_file family_id

Arguments

pedigree_file - Path to the file that specifies the pedigree structure to use in the simulation.
family_id - The FAMILY_ID of the pedigree in the pedigree_file.
--length - Integer length in base pairs of chromosome to simulate. Default value is 5000.
--output - Filename to write output files to (.ped, .map, .pdf).
--verbose - Increase output verbosity.
--help - Output usage information.

File Formats

Input

The pedigree data is supplied by the pedigree_file and family_id arguments. The pedigree file should contain a single pedigree in a format similar to the PLINK format. [2] If you have a PLINK file, run convert_format.py to get the right input format.

The pedigree file must be space delimited, with no header, and have the following columns:

  1. FAMILY_ID -- A single word or number per pedigree. This should be the same for every row.
  2. INDIVIDUAL_ID -- A unique identifying number.
  3. FATHER_ID -- The INDIVIDUAL_ID of the father. Use 0 for unknown/missing individuals.
  4. MOTHER_ID -- The INDIVIDUAL ID of the mother. Use 0 for unknown/missing individuals.
  5. SEX -- Use 1 for male, 2 for female.

In addition,

  1. Every INDIVIDUAL_ID present in the FATHER_ID and MOTHER_ID columns must also have their own row. The file is space delimited, with no header.
  2. The mother and father of an individual must precede the individual in the dataset.

[2] - http://zzz.bwh.harvard.edu/plink/data.shtml

Output

The simulation will write Plink files to output.map and output.ped and visualization files to output.pdf. An alternate name can be specified with the --output argument.

The output PLINK files can be run through GERMLINE to obtain IBD information on the simulated genomes. You may have to fix this bug before using the software: https://stackoverflow.com/questions/12961336/i-am-unable-to-run-a-c-program-in-debianubuntu-that-works-in-redhatcentos

Runtime

Some informal experiments have shown that, given a fixed pedigree structure, the runtime is polynomial on the length of the chromosome being simulated. For example, if you want to simulate the length of Chromosome 4 (about 190 million bp), it will take about 20 minutes. If you want to simulate the length of the entire human genome (about 3 billion bp), it will take about 3 days.

ancestral-genome-reconstruction's People

Contributors

gabcbrown avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.