We are interested in reconstructing ancestral SNPs for known, highly inter-related pedigrees. This code is part of the ongoing research of Gabriela Brown and Dr. Sara Mathieson at Swarthmore College.
We are building off of the work done by Georgi et al. in 2014 on the heritability of Bipolar Affective Disorder. [1]
[1] - http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004229
This code uses pipenv to install and manage python dependencies: https://pypi.python.org/pypi/pipenv
To install, run pip install pipenv
or brew install pipenv
.
To start a pipenv environment and install all python dependencies, run
pipenv install
, which will use the Pipenv file in the repository.
(Optional) To install GERMLINE, follow the instructions at: http://www.cs.columbia.edu/~gusev/germline/
pipenv run python3 genome_simulator.py [-h] [--verbose] [--length LENGTH] [--output OUTPUT] pedigree_file family_id
pedigree_file
- Path to the file that specifies the pedigree structure to use
in the simulation.
family_id
- The FAMILY_ID of the pedigree in the pedigree_file.
--length
- Integer length in base pairs of chromosome to simulate. Default value is 5000.
--output
- Filename to write output files to (.ped, .map, .pdf).
--verbose
- Increase output verbosity.
--help
- Output usage information.
The pedigree data is supplied by the pedigree_file
and family_id
arguments.
The pedigree file should contain a single pedigree in a format similar to the PLINK format. [2] If you have a PLINK file, run convert_format.py
to get the right input format.
The pedigree file must be space delimited, with no header, and have the following columns:
FAMILY_ID
-- A single word or number per pedigree. This should be the same for every row.INDIVIDUAL_ID
-- A unique identifying number.FATHER_ID
-- TheINDIVIDUAL_ID
of the father. Use 0 for unknown/missing individuals.MOTHER_ID
-- TheINDIVIDUAL ID
of the mother. Use 0 for unknown/missing individuals.SEX
-- Use 1 for male, 2 for female.
In addition,
- Every
INDIVIDUAL_ID
present in theFATHER_ID
andMOTHER_ID
columns must also have their own row. The file is space delimited, with no header. - The mother and father of an individual must precede the individual in the dataset.
[2] - http://zzz.bwh.harvard.edu/plink/data.shtml
The simulation will write Plink files to output.map and output.ped and visualization files to output.pdf. An alternate name can be specified with the --output argument.
The output PLINK files can be run through GERMLINE to obtain IBD information on the simulated genomes. You may have to fix this bug before using the software: https://stackoverflow.com/questions/12961336/i-am-unable-to-run-a-c-program-in-debianubuntu-that-works-in-redhatcentos
Some informal experiments have shown that, given a fixed pedigree structure, the runtime is polynomial on the length of the chromosome being simulated. For example, if you want to simulate the length of Chromosome 4 (about 190 million bp), it will take about 20 minutes. If you want to simulate the length of the entire human genome (about 3 billion bp), it will take about 3 days.