This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Gene Set Characterization Pipeline.
This pipeline ranks a user supplied gene set against a KnowEnG's gene sets collection.
There are three gene set characterization methods that one can choose from:
Options | Method | Parameters |
---|---|---|
Fisher exact test | Fisher | fisher |
Discriminative Random Walks with Restart | DRaWR | DRaWR |
Net Path | Net Path | net_path |
###1. Clone the GeneSet_Characterization_Pipeline Repo
git clone https://github.com/KnowEnG-Research/GeneSet_Characterization_Pipeline.git
###2. Install the following (Ubuntu or Linux)
apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.18.0
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage
###3. Change directory to GeneSet_Characterization_Pipeline
cd GeneSet_Characterization_Pipeline
###4. Change directory to test
cd test
###5. Create a local directory "run_dir" and place all the run files in it
make env_setup
###6. Select and run a gene set characterization option:
- Run fisher pipeline
make run_fisher
- Run DRaWR pipeline
make run_drawr
- Run DRaWR pipeline
make run_netpath
Follow steps 1-3 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in the GeneSet_Characterization_Pipeline/data/run_files template_run_parameters.yml
- Update PYTHONPATH enviroment variable
export PYTHONPATH='./src':$PYTHONPATH
- Run
python3 ../src/geneset_characterization.py -run_directory ./ -run_file template_net_path.yml
Key | Value | Comments |
---|---|---|
method | DRaWR or fisher or net_path | Choose DRaWR or fisher or Net Path as the gene set characterization method |
pg_network_name_full_path | directory+pg_network_name | Path and file name of the 4 col property file |
gg_network_name_full_path | directory+gg_network_name | Path and file name of the 4 col network file(only needed in DRaWR) |
spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user supplied gene sets |
gene_names_map | directory+gene_names_map | Map ENSEMBL names to user specified gene names |
results_directory | directory | Directory to save the output files |
rwr_max_iterations | 500 | Maximum number of iterations without convergence in random walk with restart(needed in DRaWR or Net Path) |
rwr_convergence_tolerence | 0.0001 | Frobenius norm tolerence of spreadsheet vector in random walk(needed in DRaWR or Net Path) |
rwr_restart_probability | 0.5 | alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo (needed in DRaWR or Net Path) |
k_space | 100 | number of the new space dimensions in SVD(only needed in Net Path) |
pg_network_name = kegg_pathway_property_gene.edge |
||
gg_network_name = STRING_experimental_gene_gene.edge |
||
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv |
||
gene_names_map = ProGENI_rwr20_STExp_GDSC_500_MAP.rname.gxc.tsv |
- Output files of all three methods save sorted properties for each gene set with name {method}_ranked_by_property{timestamp}.df.
user gene set name1 | user gene set name2 | ... | user gene set name n |
---|---|---|---|
property name (string) (most significant) |
property name (string) (most significant) |
... | property name (string) (most significant) |
... | ... | ... | ... |
property name (string) (least significant) |
property name (string) (least significant) |
... | property name (string) (least significant) |
- Fisher method saves one output file with seven columns and it is sorted in ascending order based on
pval
. The name of the file is fisher_sorted_by_property_score_{timestamp}.df.
user_gene_set | property_gene_set | pval | universe_count | user_count | property_count | overlap_count |
---|---|---|---|---|---|---|
string | string | float | int | int | int | int |
- DRaWR method saves two output file with five columns and it is sorted in ascending order based on
difference_score
. The files are DRaWR_sorted_by_gene_score_{timestamp}.df and DRaWR_sorted_by_property_score_{timestamp}.df
user_gene_set | gene_node_id | difference_score | query_score | baseline_score |
---|---|---|---|---|
string | string | float | float | float |
user_gene_set | property_gene_set | difference_score | query_score | baseline_score |
---|---|---|---|---|
string | string | float | float | float |
- Net Path method saves one output file with three columns and it is sorted in ascending order based on
cosine_sum
. The name of the file is net_path_sorted_by_property_score_{timestamp}.df.
user_gene_set | property_gene_set | cosine_sum |
---|---|---|
string | string | float |