KnowEnG's Gene Set Characterization Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Gene Set Characterization Pipeline.

This pipeline ranks a user supplied gene set against a KnowEnG's gene sets collection.

There are three gene set characterization methods that one can choose from:

Options	Method	Parameters
Fisher exact test	Fisher	fisher
Discriminative Random Walks with Restart	DRaWR	DRaWR
Net Path	Net Path	net_path

How to run this pipeline with Our data

###1. Clone the GeneSet_Characterization_Pipeline Repo

 git clone https://github.com/KnowEnG-Research/GeneSet_Characterization_Pipeline.git

###2. Install the following (Ubuntu or Linux)

apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.18.0
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage

###3. Change directory to GeneSet_Characterization_Pipeline

cd GeneSet_Characterization_Pipeline

###4. Change directory to test

cd test

###5. Create a local directory "run_dir" and place all the run files in it

make env_setup

###6. Select and run a gene set characterization option:

Run fisher pipeline

make run_fisher

Run DRaWR pipeline

make run_drawr

Run DRaWR pipeline

make run_netpath

How to run this pipeline with Your data

Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in the GeneSet_Characterization_Pipeline/data/run_files template_run_parameters.yml

* Run the GeneSet Characterization Pipeline:

Update PYTHONPATH enviroment variable

export PYTHONPATH='./src':$PYTHONPATH

python3 ../src/geneset_characterization.py -run_directory ./ -run_file template_net_path.yml

Description of "run_parameters" file

Key	Value	Comments
method	DRaWR or fisher or net_path	Choose DRaWR or fisher or Net Path as the gene set characterization method
pg_network_name_full_path	directory+pg_network_name	Path and file name of the 4 col property file
gg_network_name_full_path	directory+gg_network_name	Path and file name of the 4 col network file(only needed in DRaWR)
spreadsheet_name_full_path	directory+spreadsheet_name	Path and file name of user supplied gene sets
gene_names_map	directory+gene_names_map	Map ENSEMBL names to user specified gene names
results_directory	directory	Directory to save the output files
rwr_max_iterations	500	Maximum number of iterations without convergence in random walk with restart(needed in DRaWR or Net Path)
rwr_convergence_tolerence	0.0001	Frobenius norm tolerence of spreadsheet vector in random walk(needed in DRaWR or Net Path)
rwr_restart_probability	0.5	alpha in `V_(n+1) = alpha * N * Vn + (1-alpha) * Vo` (needed in DRaWR or Net Path)
k_space	100	number of the new space dimensions in SVD(only needed in Net Path)
pg_network_name = kegg_pathway_property_gene.edge
gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
gene_names_map = ProGENI_rwr20_STExp_GDSC_500_MAP.rname.gxc.tsv

Description of Output files saved in results directory

Output files of all three methods save sorted properties for each gene set with name {method}_ranked_by_property{timestamp}.df.

user gene set name1	user gene set name2	...	user gene set name n
property name (string) (most significant)	property name (string) (most significant)	...	property name (string) (most significant)
...	...	...	...
property name (string) (least significant)	property name (string) (least significant)	...	property name (string) (least significant)

Fisher method saves one output file with seven columns and it is sorted in ascending order based on pval. The name of the file is fisher_sorted_by_property_score_{timestamp}.df.

user_gene_set	property_gene_set	pval	universe_count	user_count	property_count	overlap_count
string	string	float	int	int	int	int

DRaWR method saves two output file with five columns and it is sorted in ascending order based on difference_score. The files are DRaWR_sorted_by_gene_score_{timestamp}.df and DRaWR_sorted_by_property_score_{timestamp}.df

user_gene_set	gene_node_id	difference_score	query_score	baseline_score
string	string	float	float	float

user_gene_set	property_gene_set	difference_score	query_score	baseline_score
string	string	float	float	float

Net Path method saves one output file with three columns and it is sorted in ascending order based on cosine_sum. The name of the file is net_path_sorted_by_property_score_{timestamp}.df.

user_gene_set	property_gene_set	cosine_sum
string	string	float

nahilsobh / geneset_characterization_pipeline Goto Github PK

geneset_characterization_pipeline's Introduction

KnowEnG's Gene Set Characterization Pipeline

How to run this pipeline with Our data

How to run this pipeline with Your data

* Create your run directory

* Change directory to the run_directory

* Create your results directory

* Create run_paramters file (YAML Format)

* Run the GeneSet Characterization Pipeline:

Description of "run_parameters" file

Description of Output files saved in results directory

geneset_characterization_pipeline's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent