Giter Site home page Giter Site logo

nardus / isg_composition Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 2.66 MB

Code for the machine learning component of Shaw, Rihn, Mollentze, et al. (2021) "The ‘antiviral state’ has shaped the CpG composition of the vertebrate interferome".

License: GNU General Public License v3.0

R 87.73% Makefile 12.27%

isg_composition's Introduction

Prediction of interferon stimulated and repressed genes using sequence composition

DOI

Code for the machine learning component of Shaw, Rihn, Mollentze, et al. (2021) "The ‘antiviral state’ has shaped the CpG composition of the vertebrate interferome". Raw data for this publication are available from DOIs 10.5525/gla.researchdata.1159 and 10.5281/zenodo.5035606, and will be downloaded automatically as needed by this analysis pipeline.

The aim of this part of the analysis was to investigate to what extent interferon-stimulated genes, interferon-repressed genes, and remaining genes are compositionally distinct, and to identify the most important compositional features allowing them to be distinguished.

Usage

Analyses were run using R version 3.5.1. Required R libraries are managed by packrat. A full run requires approximately 15GB of disk space, and by default will run in parallel across 8 threads.

To re-run the entire pipeline, use

Rscript packrat/init.R --args --bootstrap-packrat
make all

Run make help for further details.

Folder structure

[*] Indicates folders which will be created while running the pipeline

└─isg_composition/
   ├─CalculatedData/...................... [*] Final pre-processed data used to train models
   │ 
   ├─Data/ ............................... Raw data (most files downloaded from the main data  
   │   │                                   repositories as needed)
   │   ├─ISG_identities.csv                Interferome data, from Shaw et al. 2017 (downloaded
   │   │                                   from https://isg.data.cvr.ac.uk)
   │   ├─consistent_irgs.csv               IRGs consistently found across all experiments
   │   ├─MouseHoldout/                     [*] Holdout data describing the mouse interferome
   │   │                                       (from Dölken et al., 2008)
   │   ├─Other_Experiments/                [*] Data from the knockdown- and other experiments
   │   │                                       described in the manuscript
   │   ├─Part1_Dinucs/                     [*] Expression data and dinucleotide biases (see 
   │   │                                       "Data" below)
   │   ├─Part2_Codons/                     [*] Codon and codon-pair biases, etc
   │   └─SelectedGenes/                    [*] Genes analyzed in the manuscript
   │
   ├─Output/ ............................. [*] Generated output files
   │
   ├─packrat/ ............................ Record of R libraries used
   │
   ├─Scripts/ ............................ Main pipeline scripts
   │
   └─Makefile ............................ Record of workflow and dependencies between files

Data

Data will be downloaded as needed while running the pipeline, but can also be obtained using make download_data (but note that not all data from the manuscript will be extracted, and most files in the "Other_Experiments" folder will be renamed - refer to the original repositories linked above for the canonical versions of these files). The core genome composition data retreived are split across several files:

  • In Data/Part1_Dinucs/:
    • This folder contains the expression data and dinucleotide composition summaries for the longest transcript of each gene
    • Files named [species]_cds_new_dat_fpkm_dups.txt contains data for coding sequences only
    • Files named [species]_cdna_new_dat_fpkm_dups.txt contains the same calculations across the entire coding sequence
    • Most files are derived from cds/cdna datasets provided by the Ensembl FTP server
      • Where [species] is replaced with 'biomart': For humans only, contains all human genes, downloaded manually from Ensembl Biomart
  • In Data/Part2_Codons/:
    • Contains codon and codon-pair biases for the same transcripts as above
    • File names have the format [species]_cds_new_cpb_dat_dups.txt
    • Note that there is no corresponding cdna version here, since these features are valid for coding regions only

isg_composition's People

Contributors

nardus avatar

Stargazers

Megan Griffiths avatar Talha Karabıyık avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.