Giter Site home page Giter Site logo

covid19_secondary_analysis's Introduction

COVID19_secondary_analysis

Secondary analysis of transcriptomes of SARS-CoV-2 infection models to characterize COVID-19


This repository contains the various input files, generated outputs and scripts associated with our paper titled above. These scripts can be used to generate or reproduce the data published in our main text and other supplemental items.

Requirements
All the required packages and dependencies (both R and Python) can be found in requirements.txt and requirements.R files. These requirements can be installed manually or through docker containers to quickly set up a virtual environment containing all the dependencies. If Docker is not installed, please use the following link: https://docs.docker.com/get-docker/ to install.

Building a Docker container
We provide a Dockerfile that can automatically be used to build a docker container, containing all the necessary modules and packages needed to run our framework. After cloning this repository, the below command can be used to build a dockerimage from the base directory of the project.

  $ docker build -t mycontainer

Once done, the following command can be used to run the image and start a bash session

  $ docker run --name mycontainer -it mycontainer bash

To move files to and from the docker container docker cp command can be used

We also share the R objects and cytoscape (https://cytoscape.org/) session files associated with all the figures published in our work. The folder structure of this repository is described below:

  • input_data/: This directory contains the various files used as inputs to our study

    • Count data/ - Raw counts from the three SARS-CoV-2 studies (two in vitro models and one in vivo model) used in our research. In case of the two animal models, the corresponding human orthologs used are also available

    • Lung Markers/ - Lung scRNA-seq markers from three different human lung studies utilized in our work.

    • SARS-CoV-2 DEGs/ - Individual differentially expressed gene (DEG) lists identified from the three input studies. along with the consensus transcriptomic signature.Also included are the DEGs from nasopharyngeal swabs from human COVID-19 patients (GSE152075) and SARS-CoV-2 human interactants.

    • other data/ - this folder contains gene-phenotype/trait associations (compressed files) from both GWAS Catalog (https://www.ebi.ac.uk/gwas/) and PheGenI (https://www.ncbi.nlm.nih.gov/gap/phegeni) used in our study.

  • Scripts/: This directory includes the script files used in our analysis.

    • GetConsensus.R - To filter the differentially expressed gene (DEG) lists from each individual study and obtain a consensus transcriptomic signature. It has the following options:

      • --files: Comma-separated list of result files from differential expression analysis from RUVSeq. In this project we use the results from running the DifferentialExpression module in CSBB-v3.0 (https://github.com/praneet1988/Computational-Suite-For-Bioinformaticians-and-Biologists)
      • --org_assemblies (optional): A comma-separated list of Ensembl assembly IDs, one for each input study. Used to identify and map the human ortholog gene symbols for studies with non-human samples. Valid assembly IDs can be found at https://uswest.ensembl.org/info/about/species.html. If not given, all gene symbols are assumed to belong to the same organism.
      • --logFC: A log2fc threshold value for filtering significant DEGs in each individual study (default value = 0.6)
      • --pvalue: A p-value (FDR corrected) threshold for filtering significant DEGs (default value = 0.6)
      • --k: Genes upregulated or downregulated in k or more studies are considered to be part of the consensus signature (default value = 2)
      • --outpath: Path to the output directory where the consensus DEGs will be written to.
    • MCL_Clustering.R: To build an interaction-network of consensus DEGs and run the Markov clustering (MCL) algorithm for identifying perturbed protein modules from a given set of DEGs. The options for this script include:

      • --deg_file: A file containing the consensus DEGs and the virus-host interactome (if used). Must include one gene per each line.
      • --PPI_file (Optional): A tab-delimited file containing the set of human PPI. The latest version of STRING human PPIs can be downloaded from https://string-db.org/cgi/download?species_text=Homo+sapiens. Alternatively, interactions from other sources can also be used. If not given, this script uses the filtered STRING PPIs used in our project.
      • --filter : A condition to filter the PPI links prior to running the clustering algorithm. In this study, we only retained the interactions with a combined_score ≥ 0.9 or experimental_score ≥ 0.7
      • --inflation_value: Inflation parameter in MCL algorithm (default value = 2.5)
      • --max_iter: Maximum number of iterations for the MCL algorithm (default value = 100)
      • --outpath: Path to the output directory where the file containing the final MCL cluster memberships will be stored.
    • Marker_enrichments.R: To compute cell type marker enrichments among a given set of candidate gene modules. Supported options for this script are:

      • --marker_file: Text file containing cell type marker genes. It should contain 4 mandatory columns corresponding to the cell type ("cell"), gene marker ("gene"), fold change ("logFC") and the adjusted p-value ("pval_adj").
      • --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
      • --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
      • --outpath: Path to the output directory.
    • GWAS_enrichments.py: To compute enrichments of phenotypic traits compiled from NHGRI-EBI GWAS Catalog database (https://www.ebi.ac.uk/gwas/home). Before performing the enrichment analysis, the experimental factor ontology (EFO) tree is parsed to obtain the child term (and its associations) for each of the GWAS Catalog traits. The EFO OBO file can be found at https://www.ebi.ac.uk/efo/efo.obo while the latest version of GWAS Catalog associations are available at https://www.ebi.ac.uk/gwas/docs/file-downloads.

      • --obo_file Path to the EFO OBO file (.txt).
      • --assoc_file: A tab-delimited file containing the GWAS Catalog associations.
      • --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
      • --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
      • --remove_intergenic: A Boolean flag to indicate whether to remove the intergenic associations.
      • --outpath: Path to the output directory
    • PheGenI_enrichments.R: To compute module enrichments among the NCBI PheGenI (https://www.ncbi.nlm.nih.gov/gap/phegeni) traits. The allowed options include:

      • --assoc_file: A tab-delimited file containing phenotype-genotype associations from NCBI PheGenI.
      • --cluster_file: A two-column, tab-delimited file containing genes (first column) and their corresponding MCL cluster memberships (second column).
      • --min_genes: Minimum number of genes need to be present in a candidate cluster (default value = 5).
      • --p_value: A p-value threshold for filtering associations (default value = 1e-05).
      • --remove_intergenic: A Boolean flag to indicate whether to remove the intergenic associations.
      • --outpath: Path to the output directory
    • Miscellaneous scripts:

      • COVID_enrichments.R - Useful for producing the Supplemental tables from our work.
      • Utils.R - Contains helper functions used in our enrichment analysis script.
      • COVID_benchmarking.R - Randomized trials conducted to test the robustness of individual DEGs and the consensus transcriptome from the three input disease models used in our framework. Also included are the experiments used to validate the level of connectivity observed among the consensus signature along with their interactions with the SARS-CoV-2 virus-host interactants.
      • RUVSeq.R - Can be used to find DEGs from raw transcript counts using RUVSeq and edgeR packages.
  • RData/ - R objects to reproduce the results from our benchmarking experiments.

  • Figures_data/ - Contains the cytoscape session files to generate the network figures (both main and Supplemental) presented in our work. These session objects also contain the input networks that were used to generate the visualizations

covid19_secondary_analysis's People

Contributors

sudhirghandikota avatar

Stargazers

Mike Lape avatar  avatar

Watchers

James Cloos avatar  avatar

Forkers

mihikasharma92

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.