Giter Site home page Giter Site logo

edawson / tidysig Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 517 KB

A tidyverse-style package for plotting mutational signatures and context counts.

License: MIT License

R 99.10% Rebol 0.06% Makefile 0.17% Python 0.67%
mutational-signatures ggplot2 tidyverse tidy-data cancer-genomics

tidysig's Introduction

tidysig

Eric T. Dawson
February 2020

R-CMD-check

Introduction

tidysig is an R package for plotting mutational signatures / mutational contexts in the tidyverse style. It produces ggplot2 plots of SBS96 and ID83 features which can then be modified with standard ggplot2 layers. It attempts to make plotting signatures simpler by requiring strict formatting and abstracting as much as possible.

Installation

tidysig can be installed with devtools:

library(devtools)
devtools::install_github("edawson/tidysig")

To build from the GitHub source:

git clone --recursive https://github.com/edawson/tidysig
cd tidysig
Rscript scripts/devtools_install.R

If you want compatibility with SignatureAnalyzer, you'll need to install HDF5. If you've installed SignatureAnalyzer locally, this is already on your computer. Otherwise, you can install is for linux:

sudo apt-get install libhdf5-serial-dev

Or Mac OS X:

 brew install hdf5

The required R packages are listed in DESCRIPTION. Most are standard TidyVerse packages (plus cowplot). In addition, hdf5r is needed for SigantureAnalyzer files.

Compatibility

tidysig is currently compatible with SigProfilerExtractor (as of version 1.0.3), with plans to support SignatureAnalyzer.

Nota bene: There is a known incompatibility with some SigProfilerExtractor outputs, which have a first column titled "MutationsType" instead of "MutationType." This can be remedied with the following dplyr::rename call after reading in the file:

sigprofiler_results <- read_tsv("sigprofiler_results/SBS96/Suggested_Solution/De_Novo_Solution/De_Novo_Solution_Signatures_SBS96.txt")


sigprofiler_results <- sigprofiler_results %>%
  rename(MutationType=MutationsType)

Tidy signature representation

Internally, tidysig converts SigProfiler outputs to a tidy data format with four variables for SBS96 signatures and six variables for ID83 signatures.

SBS96 Columns:

Column Name Signature Change Context Amount
Description The name of the signature or sample The genomic change (i.e., T>N or C>N) The trinucleotide context of the variant. The amount, either as a raw counts or as a proportion.

ID83 Features

Column Name Signature Type Length Motif Motif Length Amount
Description The name of the signature or sample INS or DEL (insertion or deletion) The length of the insertion or deletion The motif surrounding the variant (i.e., within a C/T homopolymer, within a repeat, within microhomology) The length of the motif (in basepairs or repeat units) The amount, either as a raw counts or as a proportion.

Usage

Load a SigProfilerExtractor file as input and plot all signatures:

library(tidysig)
library(readr)

sigprofiler_results <- read_tsv("sigprofiler_results/SBS96/Suggested_Solution/De_Novo_Solution/De_Novo_Solution_Signatures_SBS96.txt")

df <- transform_sigprofiler_df(sigprofiler_results)

all_sig_plot <- plot_SBS96_signature(df)

Please download the PDF to view SBS96 plot: Download PDF.

In addition, the resulting plots can be modified:

## Counts can be normalized to proportions using the countsAsProportions argument
all_sig_plot_proportions <- plot_SBS96_signature(df, countsAsProportions=TRUE)

## You can apply the same y-axis limits to all subplots to make comparison between signatures easier.
all_sig_plot_proportions_norm <- plot_sbs96_signature(df, countsAsProportions=TRUE, ylimits=c(0,0.5)

## To plot a specific signature, you can filter using standard dplyr commands.
sig_96A_plot <- plot_SBS96_signature(df %>% dplyr::filter(Signature == "96A"))

## Or, use the %in% operator for multiple signatures:
sig_96A_96B_plot <- plot_SBS96_signature(df %>% dplyr::filter(Signature %in% c("96A", "96B")))

## Plots can be saved using cowplot/ggplot2's save_plot function.
save_plot("all_sigs.pdf", all_sig_plot ,base_height = 6, base_asp=2)

## For single signatures, you can use the save_signature_plot function
save_signature_plot(sig_96A_plot, "sig_96A_plot.pdf")

You can layer on standard ggplot2 layers. Here's an example where we remove sample names and change the theme to theme_bw():

activ <- transform_sigprofiler_df(
    read_tsv("sigprofiler_results/SBS96/Suggested_Solution/De_Novo_Solution/De_Novo_Solution_Activities_SBS96.txt")
)

plot_signature_activities(activ %>%
       group_by(Sample) %>%
       mutate(high = ifelse(sum(Amount) > 1000, "High", "Low")),
    countsAsProportions = F,
    showSampleNames = T,
    facetGroupVariable = "high") +
theme(axis.text.x = element_blank()) +
theme_bw()

Prerequisites

You'll need to run SigProfiler to generate the inputs for tidysig. SigProfiler can be installed with PIP. Note, I've frozen on specific versions - check pip for the latest ones if you want to try them.

## if on a compute cluster, run:
## module load python

## Install SigProfiler to user directory using pip:
pip install --user SigProfilerExtractor==1.0.3

## Install SigProfilerMatrixGenerator:
pip install --user SigProfilerMatrixGenerator==1.1.0

The SigProfilerHelper utilities can be used to run SigProfiler from the command line, rather than running it in a python REPL:

git clone https://github.com/edawson/SigProfilerHelper sigprofilerhelper

You need to first install a reference genome, such as GRCh37 (hg19):

python sigprofilerhelper/install_reference.py -g GRCh37

You can then generate a mutational counts file:

python sigprofilerhelper/generate_matrix -m <maf_file>

This will produce a directory (default name: sigprof_input) which contains the inputs for SigProfilerExtractor. Another helper script can take this as input and produce mutational signatures:

## Run SigProfilerExtractor for an SBS96 counts matrix,
## for 1 to 7 signatures,
## using 16 cores and 1000 iterations
python sigprofilerhelper/run_sigrofiler.py -t sigprof_input/output/SBS/PROJECT.SBS96.all -s 1 -e 7 -i 1000 -c 16

If you're on Biowulf (or another cluster using SLURM, you can write the following wrapper script:

#!/bin/bash
module load python

python sigprofilerhelper/run_sigrofiler.py -t sigprof_input/output/SBS/PROJECT.SBS96.all -s 1 -e 7 -i 1000 -c ${SLURM_CPUS_PER_TASK}

Save this file (as an example, to "run_sigpro.sh") and submit it to a queue like so:

sbatch --cpus-per-task=16 --mem=20g --error=sigpro.err.txt --ouput=sigpro.out.txt run_sigpro.sh

In a few hours (usually 3-5), you'll get output in a directory called sigpro_results, which will contain the inputs for tidysig.

Citing the R package

You are free to use tidysig under the broadly-permissive MIT license. We ask that you cite it in the following manner:

tidysig: a tidyverse-style package for plotting mutational signatures. https://github.com/edawson/tidysig, Version <VERSION>. Eric T. Dawson. 2020.

tidysig's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

wooyaalee

tidysig's Issues

Some PCAWG CSVs trigger warnings / errors

When calling transform_sigprofiler_df, some CSV files will trigger warnings like this:

Transforming SigProfiler SBS96 signature / counts file.
Warning messages:
1: Trying to compute distinct() for variables not found in the data:
- `MutationType`
This is an error, but only a warning is raised for compatibility reasons.
The operation will return the input unchanged. 
2: Trying to compute distinct() for variables not found in the data:
- `MutationType`
This is an error, but only a warning is raised for compatibility reasons.
The operation will return the input unchanged.

This is likely because the variable munging process is still not completely robust to every version of SigProfiler dataframe observed. Hopefully, further testing will catch more of these. The best fix is likely to use something like janitor to automatically relabel the variables.

Function: plot_context

While it's currently possible to plot mutational context proportions, the API for doing so is bad. There should just be a plot_context wrapper to plot_signature().

ylims remove large values

As is typical for ggplot2, ylim() truncates plot data rather than just "zooming" the axes.

To fix this, I need to use coord_cartesian(expand=FALSE, ylim=ylimits) rather than just ylimits.

Integrate site classification from presig

It would be useful to add the site classification functions from presig to tidysig. This way, users would not have to switch programming languages. The classification algorithm of presig is pretty simple, so there's not a lot of work to accomplish this assuming we can get an easy-to-use VCF reader.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.