COVID19_secondary_analysis

Secondary analysis of transcriptomes of SARS-CoV-2 infection models to characterize COVID-19

This repository contains the various input files, generated outputs and scripts associated with our paper titled above. These scripts can be used to generate or reproduce the data published in our main text and other supplemental items.

Requirements
All the required packages and dependencies (both R and Python) can be found in requirements.txt and requirements.R files. These requirements can be installed manually or through docker containers to quickly set up a virtual environment containing all the dependencies. If Docker is not installed, please use the following link: https://docs.docker.com/get-docker/ to install.

Building a Docker container
We provide a Dockerfile that can automatically be used to build a docker container, containing all the necessary modules and packages needed to run our framework. After cloning this repository, the below command can be used to build a dockerimage from the base directory of the project.

  $ docker build -t mycontainer

Once done, the following command can be used to run the image and start a bash session

  $ docker run --name mycontainer -it mycontainer bash

To move files to and from the docker container docker cp command can be used

We also share the R objects and cytoscape (https://cytoscape.org/) session files associated with all the figures published in our work. The folder structure of this repository is described below:

input_data/: This directory contains the various files used as inputs to our study
- Count data/ - Raw counts from the three SARS-CoV-2 studies (two in vitro models and one in vivo model) used in our research. In case of the two animal models, the corresponding human orthologs used are also available
- Lung Markers/ - Lung scRNA-seq markers from three different human lung studies utilized in our work.
- SARS-CoV-2 DEGs/ - Individual differentially expressed gene (DEG) lists identified from the three input studies. along with the consensus transcriptomic signature.Also included are the DEGs from nasopharyngeal swabs from human COVID-19 patients (GSE152075) and SARS-CoV-2 human interactants.
- other data/ - this folder contains gene-phenotype/trait associations (compressed files) from both GWAS Catalog (https://www.ebi.ac.uk/gwas/) and PheGenI (https://www.ncbi.nlm.nih.gov/gap/phegeni) used in our study.
Scripts/: This directory includes the script files used in our analysis.
- GetConsensus.R - To filter the differentially expressed gene (DEG) lists from each individual study and obtain a consensus transcriptomic signature. It has the following options:
  - --files: Comma-separated list of result files from differential expression analysis from RUVSeq. In this project we use the results from running the DifferentialExpression module in CSBB-v3.0 (https://github.com/praneet1988/Computational-Suite-For-Bioinformaticians-and-Biologists)
  - --org_assemblies (optional): A comma-separated list of Ensembl assembly IDs, one for each input study. Used to identify and map the human ortholog gene symbols for studies with non-human samples. Valid assembly IDs can be found at https://uswest.ensembl.org/info/about/species.html. If not given, all gene symbols are assumed to belong to the same organism.
  - --logFC: A log₂fc threshold value for filtering significant DEGs in each individual study (default value = 0.6)
  - --pvalue: A p-value (FDR corrected) threshold for filtering significant DEGs (default value = 0.6)
  - --k: Genes upregulated or downregulated in k or more studies are considered to be part of the consensus signature (default value = 2)
  - --outpath: Path to the output directory where the consensus DEGs will be written to.
- MCL_Clustering.R: To build an interaction-network of consensus DEGs and run the Markov clustering (MCL) algorithm for identifying perturbed protein modules from a given set of DEGs. The options for this script include:
  - --deg_file: A file containing the consensus DEGs and the virus-host interactome (if used). Must include one gene per each line.
  - --PPI_file (Optional): A tab-delimited file containing the set of human PPI. The latest version of STRING human PPIs can be downloaded from https://string-db.org/cgi/download?species_text=Homo+sapiens. Alternatively, interactions from other sources can also be used. If not given, this script uses the filtered STRING PPIs used in our project.
  - --filter : A condition to filter the PPI links prior to running the clustering algorithm. In this study, we only retained the interactions with a combined_score ≥ 0.9 or experimental_score ≥ 0.7
  - --inflation_value: Inflation parameter in MCL algorithm (default value = 2.5)
  - --max_iter: Maximum number of iterations for the MCL algorithm (default value = 100)
  - --outpath: Path to the output directory where the file containing the final MCL cluster memberships will be stored.
- Marker_enrichments.R: To compute cell type marker enrichments among a given set of candidate gene modules. Supported options for this script are:
  - --marker_file: Text file containing cell type marker genes. It should contain 4 mandatory columns corresponding to the cell type ("cell"), gene marker ("gene"), fold change ("logFC") and the adjusted p-value ("pval_adj").
  - --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
  - --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
  - --outpath: Path to the output directory.
- GWAS_enrichments.py: To compute enrichments of phenotypic traits compiled from NHGRI-EBI GWAS Catalog database (https://www.ebi.ac.uk/gwas/home). Before performing the enrichment analysis, the experimental factor ontology (EFO) tree is parsed to obtain the child term (and its associations) for each of the GWAS Catalog traits. The EFO OBO file can be found at https://www.ebi.ac.uk/efo/efo.obo while the latest version of GWAS Catalog associations are available at https://www.ebi.ac.uk/gwas/docs/file-downloads.
  - --obo_file Path to the EFO OBO file (.txt).
  - --assoc_file: A tab-delimited file containing the GWAS Catalog associations.
  - --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
  - --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
  - --remove_intergenic: A Boolean flag to indicate whether to remove the intergenic associations.
  - --outpath: Path to the output directory
- PheGenI_enrichments.R: To compute module enrichments among the NCBI PheGenI (https://www.ncbi.nlm.nih.gov/gap/phegeni) traits. The allowed options include:
  - --assoc_file: A tab-delimited file containing phenotype-genotype associations from NCBI PheGenI.
  - --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
  - --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
  - --p_value: A p-value threshold for filtering associations (default value = 1e-05).
  - --remove_intergenic: A Boolean flag to indicate whether to remove the intergenic associations.
  - --outpath: Path to the output directory
- Miscellaneous scripts:
  - COVID_enrichments.R - Useful for producing the Supplemental tables from our work.
  - Utils.R - Contains helper functions used in our enrichment analysis script.
  - COVID_benchmarking.R - Randomized trials conducted to test the robustness of individual DEGs and the consensus transcriptome from the three input disease models used in our framework. Also included are the experiments used to validate the level of connectivity observed among the consensus signature along with their interactions with the SARS-CoV-2 virus-host interactants.
  - RUVSeq.R - Can be used to find DEGs from raw transcript counts using RUVSeq and edgeR packages.
RData/ - R objects to reproduce the results from our benchmarking experiments.
Figures_data/ - Contains the cytoscape session files to generate the network figures (both main and Supplemental) presented in our work. These session objects also contain the input networks that were used to generate the visualizations

sudhirghandikota / covid19_secondary_analysis Goto Github PK

covid19_secondary_analysis's Introduction

COVID19_secondary_analysis

Secondary analysis of transcriptomes of SARS-CoV-2 infection models to characterize COVID-19

covid19_secondary_analysis's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent