Giter Site home page Giter Site logo

nahilsobh / gene_prioritization_pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from knoweng-research/gene_prioritization_pipeline

0.0 2.0 0.0 592.4 MB

Network based prioritization of genes-associated-phenotype

License: Other

Dockerfile 0.97% Makefile 9.27% Python 89.76%

gene_prioritization_pipeline's Introduction

KnowEnG's Gene Prioritization Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH, BD2K Center of Excellence, Gene Prioritization Pipeline.

This pipeline ranks the rows of a given spreadsheet, where spreadsheet's rows correspond to gene-labels and columns correspond to sample-labels. The ranking is based on correlating gene expression data (network smoothed) against pheno-type data.

There are four prioritization methods, using either pearson or t-test as the measure of correlation:

Options Method Parameters
Simple Correlation simple correlation correlation
Bootstrap Correlation bootstrap sampling correlation bootstrap_correlation
Correlation with network regularization network-based correlation net_correlation
Bootstrap Correlation with network regularization bootstrapping w network correlation bootstrap_net_correlation

Note: all of the correlation methods mentioned above use the Pearson or t-test correlation measure method.


How to run this pipeline with Our data


1. Clone the Gene_Prioritization_Pipeline Repo

 git clone https://github.com/KnowEnG/Gene_Prioritization_Pipeline.git

2. Install the following (Ubuntu or Linux)

apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.19.1
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage

3. Change directory to Gene_Prioritization_Pipeline

cd Gene_Prioritization_Pipeline

4. Change directory to test

cd test

5. Create a local directory "run_dir" and place all the run files in it

make env_setup

6. Use one of the following "make" commands to select and run a clustering option:

Command Option
make run_pearson pearson correlation
make run_bootstrap_pearson bootstrap sampling with pearson correlation
make run_net_pearson pearson correlation with network regularization
make run_bootstrap_net_pearson bootstrap pearson correlation with network regularization
make run_t_test t-test correlation
make run_bootstrap_t_test bootstrap sampling with t-test correlation
make run_net_t_test t-test correlation with network regularization
make run_bootstrap_net_t_test bootstrap t-test correlation with network regularization

How to run this pipeline with Your data


Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in ./Gene_Prioritization_Pipeline/data/run_files/zTEMPLATE_GP_BENCHMARKS.yml

* Modify run_paramters file (YAML Format)

set the spreadsheet, network and phenotype data file names to point to your data

* Run the Gene Prioritization Pipeline:

  • Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH    
  • Run (in test directory with env_setup as described above)
python3 ../src/gene_prioritization.py -run_directory ./run_dir -run_file zTEMPLATE_GP_BENCHMARKS.yml

Description of "run_parameters" file


Key Value Comments
method correlation or net_correlation or bootstrap_correlation or bootstrap_net_correlation Choose gene prioritization method
correlation_measure pearson or t_test Choose correlation measure method
gg_network_name_full_path directory+gg_network_name Path and file name of the 4 col network file
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user supplied gene sets
phenotype_name_full_path directory+phenotype_response Path and file name of user supplied phenotype response file
results_directory directory Directory to save the output files
number_of_bootstraps 5 Number of random samplings
cols_sampling_fraction 0.9 Select 90% of spreadsheet columns
rwr_max_iterations 100 Maximum number of iterations without convergence in random walk with restart
rwr_convergence_tolerence 1.0e-2 Frobenius norm tolerence of spreadsheet vector in random walk
rwr_restart_probability 0.5 alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo
top_beta_of_sort 100 Number of top genes selected
top_gamma_of_sort 50 Number of top genes reported
max_cpu 4 Maximum number of processors to use in the parallel correlation section

gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = CCLE_Expression_ensembl.df
phenotype_name = CCLE_drug_ec50_cleaned_NAremoved_pearson.txt


Description of Output files saved in results directory


  • Any method saves separate files per phenotype with name {phenotype}_{method}_{correlation_measure}_{timestamp}_viz.tsv. Genes are sorted in descending order based on visualization_score.
Response Gene_ENSEMBL_ID quantitative_sorting_score visualization_score baseline_score
phenotype 1 gene 1 float float float
... ... ... ... ...
phenotype 1 gene n float float float
  • Any method saves sorted genes for each phenotype with name ranked_genes_per_phenotype_{method}_{correlation_measure}_{timestamp}_download.tsv.
Ranking phenotype 1 phenotype 2 ... phenotype n
1 gene
(most significant)
gene
(most significant)
... gene
(most significant)
... ... ... ... ...
n gene
(least significant)
gene
(least significant)
... gene
(least significant)
  • Any method saves spreadsheet with top ranked genes per phenotype with name top_genes_per_phenotype_{method}_{correlation_measure}_{timestamp}_download.tsv.
Genes phenotype 1 ... phenotype n
gene 1 1/0 ... 1/0
... ... ... ...
gene n 1/0 ... 1/0

References:

gene_prioritization_pipeline's People

Contributors

nahilsobh avatar dlanier avatar xichen24 avatar candicegjing avatar noorshalabi avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.