Giter Site home page Giter Site logo

geneset_characterization_pipeline's Introduction

KnowEnG's Gene Set Characterization Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Gene Set Characterization Pipeline.

This pipeline ranks a user supplied gene set against a KnowEnG's gene sets collection.

There are three gene set characterization methods that one can choose from:

Options Method Parameters
Fisher exact test Fisher fisher
Discriminative Random Walks with Restart DRaWR DRaWR
Net Path Net Path net_path

How to run this pipeline with Our data


###1. Clone the GeneSet_Characterization_Pipeline Repo

 git clone https://github.com/KnowEnG-Research/GeneSet_Characterization_Pipeline.git

###2. Install the following (Ubuntu or Linux)

apt-get install -y python3-pip
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install scipy==0.18.0
pip3 install scikit-learn==0.17.1
apt-get install -y libfreetype6-dev libxft-dev
pip3 install matplotlib==1.4.2
pip3 install pyyaml
pip3 install knpackage

###3. Change directory to GeneSet_Characterization_Pipeline

cd GeneSet_Characterization_Pipeline

###4. Change directory to test

cd test

###5. Create a local directory "run_dir" and place all the run files in it

make env_setup

###6. Select and run a gene set characterization option:

  • Run fisher pipeline
make run_fisher
  • Run DRaWR pipeline
make run_drawr
  • Run DRaWR pipeline
make run_netpath

How to run this pipeline with Your data


Follow steps 1-3 above then do the following:

* Create your run directory

mkdir run_directory

* Change directory to the run_directory

cd run_directory

* Create your results directory

mkdir results_directory

* Create run_paramters file (YAML Format)

Look for examples of run_parameters in the GeneSet_Characterization_Pipeline/data/run_files template_run_parameters.yml

* Run the GeneSet Characterization Pipeline:

  • Update PYTHONPATH enviroment variable
export PYTHONPATH='./src':$PYTHONPATH    
  • Run
python3 ../src/geneset_characterization.py -run_directory ./ -run_file template_net_path.yml

Description of "run_parameters" file


Key Value Comments
method DRaWR or fisher or net_path Choose DRaWR or fisher or Net Path as the gene set characterization method
pg_network_name_full_path directory+pg_network_name Path and file name of the 4 col property file
gg_network_name_full_path directory+gg_network_name Path and file name of the 4 col network file(only needed in DRaWR)
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user supplied gene sets
gene_names_map directory+gene_names_map Map ENSEMBL names to user specified gene names
results_directory directory Directory to save the output files
rwr_max_iterations 500 Maximum number of iterations without convergence in random walk with restart(needed in DRaWR or Net Path)
rwr_convergence_tolerence 0.0001 Frobenius norm tolerence of spreadsheet vector in random walk(needed in DRaWR or Net Path)
rwr_restart_probability 0.5 alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo (needed in DRaWR or Net Path)
k_space 100 number of the new space dimensions in SVD(only needed in Net Path)
pg_network_name = kegg_pathway_property_gene.edge
gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
gene_names_map = ProGENI_rwr20_STExp_GDSC_500_MAP.rname.gxc.tsv

Description of Output files saved in results directory


  • Output files of all three methods save sorted properties for each gene set with name {method}_ranked_by_property{timestamp}.df.
user gene set name1 user gene set name2 ... user gene set name n
property name (string)
(most significant)
property name (string)
(most significant)
... property name (string)
(most significant)
... ... ... ...
property name (string)
(least significant)
property name (string)
(least significant)
... property name (string)
(least significant)
  • Fisher method saves one output file with seven columns and it is sorted in ascending order based on pval. The name of the file is fisher_sorted_by_property_score_{timestamp}.df.
user_gene_set property_gene_set pval universe_count user_count property_count overlap_count
string string float int int int int
  • DRaWR method saves two output file with five columns and it is sorted in ascending order based on difference_score. The files are DRaWR_sorted_by_gene_score_{timestamp}.df and DRaWR_sorted_by_property_score_{timestamp}.df
user_gene_set gene_node_id difference_score query_score baseline_score
string string float float float
user_gene_set property_gene_set difference_score query_score baseline_score
string string float float float
  • Net Path method saves one output file with three columns and it is sorted in ascending order based on cosine_sum. The name of the file is net_path_sorted_by_property_score_{timestamp}.df.
user_gene_set property_gene_set cosine_sum
string string float

geneset_characterization_pipeline's People

Contributors

xichen24 avatar candicegjing avatar nahilsobh avatar schen154 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.