Giter Site home page Giter Site logo

pmb-project_eg6u's Introduction

PMB-project

The aim of this project is the identification of amino acids starting from the genome sequence.
This would serve to calculate the amino acid requirements in different populations and see if and how this correlates with diet.

1000 Genome Project

The human genomes considered in this study are from the 1000 Genomes Project (https://www.internationalgenome.org/data/ ).
This latter consist in a catalogue of common human genetic variation. It has been developed using openly consented samples from people who declared themselves to be healthy at the time of collection.
It was the first project to sequence the genomes of a large number of people and data was quickly made available to the worldwide scientific community through freely accessible public databases.
Infact, the reference data resources generated by the project are heavily used by the biomedical science community.
The final data set (associated with the third and final phase: 2nd May 2013) contains data for 2,504 individuals from 26 populations and 84.4 million variants. Low coverage and exome sequence data are present for all of these individuals, 24 individuals were also sequenced to high coverage for validation purposes.

Pygeno package

The package that has been chosen in order to conduct this project analysis is called Pygeno (https://github.com/tariqdaouda/pyGeno), a Phyton package for precision medicine and proteogenomics.

Pygeno Logo

VCF files

The data that was analyzed was provided in VCF format (Variant Call Format).
Is a format used in bioinformatics for storing gene sequence variations. In this format only the variations are stored relative to a reference genome. The structure of the file is divided into an header, which provides metadata describing the body of the file, and a body, which contains the actual data.

It is important to keep in mind that VCF files come in many formats and Pygeno only supports a certain number of them.
In our case some problem with the parser occurred and files generated from PLINK have been chosen as they came already filtered, so as not to modify the Pygeno parser. For further information look at https://github.com/tariqdaouda/pyGeno.

Installation

To install the application the user has to clone the repository PMB-project and use pip:

git clone https://github.com/Emma-si/PMB-project.git
cd PMB-project
pip install -r requirements.txt

The file requirements.txt already comprehends all the required packages and specifies also the necessary versions to run the program.

Attention: It is important to know that this application is only compatible with Linux and macOS machines.

Usage

When installed, the user can run the program from command line. The python version used in this project is python3.

python3 main.py --genome <reference genome> --vcf_file <vcf file path> --num_processes <number of processes>

where:

  • <reference genome> is the the identifier for the genome to use as reference.
  • <vcf file path> is the path to the vcf file in the user computer.
  • <number of processes> is the number of processes to use in the multiprocessing (Not required. If not specified the setting is automatic according to the available cpu).

List of available reference genome

The program only supports reference genome already contained in the Pygeno package list. Here the list is reported for clarity and ease of use.

  • Human -> GRCh37.75
  • Human -> GRCh37.75_Y-Only
  • Human -> GRCh38.78
  • Mouse -> GRCm38.78

Structure of the project

After setting in input the reference genome (among the ones available in the above list), the path of the vcf file the user intend to utilize and possibly the number of processes, the program compresses the files which, in order to be used by Pygeno, must be presented in the 'tar.gz' format.

After the compression, an snp file is created with the information collected from the original vcf file. An SNP (single nucleotide polymorphism) is a germline substitution of a single nucleotide at a specific position in the genome. According to Pygeno requirements, the creation of the snp presupposes that the files are organized and presented in a specific way. A manifest.ini file must be included in the same archive as the gzipped vcf file. This manifest.ini file contains a list of information about the snp and the mantainer that, in this program, are compiled with dummy data.

Set the number of processes

The program gives the user the possibility to define, as input, the number of processes to use. The choice of the number of processes should be carefully considered. Too many processes could, in fact, create problems with the cpu without bringing benefits. If the user does not define a number of processes, this is automatically assigned by the program based on the available cpu.

A pool of processes is initialized in order to get a speedup in the processing, with each process running a separate chunk of the protein ids list that needs to be analyzed. The list is passed to the protein worker function wich extracts the modified sequence of a list of proteins and filters it by substituting the variations.

Asynchronous multiprocessing

The choice of the processes parallelization has been made to reduce the processing time which otherwise, for a complete genome, would have been extremely long. By parallelizing the processing, breaking into chunks and running each chunk separately, a speedup can be obtained. The choice of asynchronous processing was made considering that, in the case of the program, each process can work independently from the others, without having to wait for a response.

When all the processes are closed, all the temporary tables created for each process are united in a final merged table that can be visualized as output in a separate file.

To avoid causing memory issues, since the load of data to analyze can sometimes be substantial, all the temporary tables and the loaded snp are deleted.

Testing

In the "test" folder, all the data required to run the tests.py file can be found. The latter contains the tests runned in order assert the program precision, using both unit and integration testing. To run it from the command line:

pytest tests.py

An example

This section has the purpose of clairfying the use of the application in order to facilitate its usage. Below follows an example of the program running, which can be reproduced with the files contained in the "example" folder.

The vcf file present in this folder is a small selection of genome variations on chromosome twenty-one alone. The choice to present such a small file as an example was made to allow in a short time to understand the functioning and expected results of the program. The reference genome in this case is GRCh37.75. To run it from command line:

python3 main.py --genome GRCh37.75 --vcf_file ./example/chr21_NA20502.vcf

Despite the reduced dimension of the example file, the processing will take some time to complete, according to the power of the user machine. This is due to the fact that, even if the variations are located only in a specific region of the genome, the program needs to investigate its entirety.

pmb-project_eg6u's People

Contributors

emma-si avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.