PMB-project

The aim of this project is the identification of amino acids starting from the genome sequence.
This would serve to calculate the amino acid requirements in different populations and see if and how this correlates with diet.

1000 Genome Project

The human genomes considered in this study are from the 1000 Genomes Project (https://www.internationalgenome.org/data/ ).
This latter consist in a catalogue of common human genetic variation. It has been developed using openly consented samples from people who declared themselves to be healthy at the time of collection.
It was the first project to sequence the genomes of a large number of people and data was quickly made available to the worldwide scientific community through freely accessible public databases.
Infact, the reference data resources generated by the project are heavily used by the biomedical science community.
The final data set (associated with the third and final phase: 2nd May 2013) contains data for 2,504 individuals from 26 populations and 84.4 million variants. Low coverage and exome sequence data are present for all of these individuals, 24 individuals were also sequenced to high coverage for validation purposes.

Pygeno package

The package that has been chosen in order to conduct this project analysis is called Pygeno (https://github.com/tariqdaouda/pyGeno), a Phyton package for precision medicine and proteogenomics.

VCF files

The data that was analyzed was provided in VCF format (Variant Call Format).
Is a format used in bioinformatics for storing gene sequence variations. In this format only the variations are stored relative to a reference genome. The structure of the file is divided into an header, which provides metadata describing the body of the file, and a body, which contains the actual data.

It is important to keep in mind that VCF files come in many formats and Pygeno only supports a certain number of them.
In our case some problem with the parser occurred and files generated from PLINK have been chosen as they came already filtered, so as not to modify the Pygeno parser. For further information look at https://github.com/tariqdaouda/pyGeno.

Installation

To install the application the user has to clone the repository PMB-project and use pip:

git clone https://github.com/Emma-si/PMB-project.git
cd PMB-project
pip install -r requirements.txt

The file requirements.txt already comprehends all the required packages and specifies also the necessary versions to run the program.

Attention: It is important to know that this application is only compatible with Linux and macOS machines.

Usage

When installed, the user can run the program from command line. The python version used in this project is python3.

python3 main.py --genome <reference genome> --vcf_file <vcf file path> --num_processes <number of processes>

where:

<reference genome> is the the identifier for the genome to use as reference.
<vcf file path> is the path to the vcf file in the user computer.
<number of processes> is the number of processes to use in the multiprocessing (Not required. If not specified the setting is automatic according to the available cpu).

List of available reference genome

The program only supports reference genome already contained in the Pygeno package list. Here the list is reported for clarity and ease of use.

Human -> GRCh37.75
Human -> GRCh37.75_Y-Only
Human -> GRCh38.78
Mouse -> GRCm38.78

Structure of the project

After setting in input the reference genome (among the ones available in the above list), the path of the vcf file the user intend to utilize and possibly the number of processes, the program compresses the files which, in order to be used by Pygeno, must be presented in the 'tar.gz' format.

After the compression, an snp file is created with the information collected from the original vcf file. An SNP (single nucleotide polymorphism) is a germline substitution of a single nucleotide at a specific position in the genome. According to Pygeno requirements, the creation of the snp presupposes that the files are organized and presented in a specific way. A manifest.ini file must be included in the same archive as the gzipped vcf file. This manifest.ini file contains a list of information about the snp and the mantainer that, in this program, are compiled with dummy data.

Set the number of processes

The program gives the user the possibility to define, as input, the number of processes to use. The choice of the number of processes should be carefully considered. Too many processes could, in fact, create problems with the cpu without bringing benefits. If the user does not define a number of processes, this is automatically assigned by the program based on the available cpu.

A pool of processes is initialized in order to get a speedup in the processing, with each process running a separate chunk of the protein ids list that needs to be analyzed. The list is passed to the protein worker function wich extracts the modified sequence of a list of proteins and filters it by substituting the variations.

Asynchronous multiprocessing

The choice of the processes parallelization has been made to reduce the processing time which otherwise, for a complete genome, would have been extremely long. By parallelizing the processing, breaking into chunks and running each chunk separately, a speedup can be obtained. The choice of asynchronous processing was made considering that, in the case of the program, each process can work independently from the others, without having to wait for a response.

When all the processes are closed, all the temporary tables created for each process are united in a final merged table that can be visualized as output in a separate file.

To avoid causing memory issues, since the load of data to analyze can sometimes be substantial, all the temporary tables and the loaded snp are deleted.

Testing

In the "test" folder, all the data required to run the tests.py file can be found. The latter contains the tests runned in order assert the program precision, using both unit and integration testing. To run it from the command line:

pytest tests.py

An example

This section has the purpose of clairfying the use of the application in order to facilitate its usage. Below follows an example of the program running, which can be reproduced with the files contained in the "example" folder.

The vcf file present in this folder is a small selection of genome variations on chromosome twenty-one alone. The choice to present such a small file as an example was made to allow in a short time to understand the functioning and expected results of the program. The reference genome in this case is GRCh37.75. To run it from command line:

python3 main.py --genome GRCh37.75 --vcf_file ./example/chr21_NA20502.vcf

Despite the reduced dimension of the example file, the processing will take some time to complete, according to the power of the user machine. This is due to the fact that, even if the variations are located only in a specific region of the genome, the program needs to investigate its entirety.

trellixvulnteam / pmb-project_eg6u Goto Github PK

pmb-project_eg6u's Introduction

PMB-project

1000 Genome Project

Pygeno package

VCF files

Installation

Usage

List of available reference genome

Structure of the project

Set the number of processes

Asynchronous multiprocessing

Testing

An example

pmb-project_eg6u's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent