griffithlab / genvisr Goto Github PK

View Code? Open in Web Editor NEW

203.0 19.0 62.0 38.95 MB

Genome data visualizations

License: Creative Commons Zero v1.0 Universal

R 100.00%

genvisr's Introduction

GenVisR

Please cite: "Skidmore et al. 2016 GenVisR: Genomic Visualizations in R Bioinformatics 32, 3012-3014" pubmed

Bioconductor

Intuitively visualizing and interpreting data from high-throughput genomic technologies continues to be challenging. "Genomic Visualizations in R" (GenVisR) attempts to alleviate this burden by providing highly customizable publication-quality graphics supporting multiple species and focused primarily on a cohort level (i.e., multiple samples/patients). GenVisR attempts to maintain a high degree of flexibility while leveraging the abilities of ggplot2 and bioconductor to achieve this goal.

Install from Bioconductor

For the majority of users we recommend installing GenVisR from the release branch of Bioconductor, Installation instructions using this method can be found on the GenVisR landing page on Bioconductor.

Please note that GenVisR imports a few packages that have "system requirements", in most cases these requirements will already be installed. If they are not please follow the instructions to install these packages given in the R terminal. Briefly these packages are: "libcurl4-openssl-dev" and "libxml2-dev"

Development

Development for GenVisR occurs on the griffith lab github repository available here. For users wishing to contribute to development we recommend cloning the GenVisR repo there and submitting a pull request. Please note that development occurs on the R version that will be available at each Bioconductor release cycle. This ensures that GenVisR will be stable for each Bioconductor release but it may necessitate developers download R-devel.

We also encourage users to report bugs and suggest enhancements to GenVisR on the github issue page available here:

To install the latest development version of GenVisR (not recommended for most users):

# install and load devtools package
install.packages("devtools")
library(devtools)

# install GenVisR from github
install_github("griffithlab/GenVisR")

Vignettes

Documentation for GenVisR can be found on the bioconductor landing page in the form of vignettes available here GenVisR. Tutorials can also be found on biostars.org. Vignettes can also be viewed from within R.

# view vignettes
vignette(package="GenVisR")

genvisr's People

Contributors

Stargazers

Watchers

Forkers

ahwagner zlskidmore kcampbel hjanime gatoravi jkunisak hiuyu b1234561 cauyrd gpcr cbrueffer khalidm xtmgah vd4mmind cloudbroken lixiangchun fw1121 merckey lfthwjx kkrysiak snashraf lptolik salviadr gnsljw yangkangyf freestatman pradyumnasagar shenglinmei jayendrashinde91 biocodings mmoisse baiyuanxiang zacshi zhengyabiao nemochina2008 mjz1 mashranga butterflyskip flzt1949 inambioinfo nrflynn2 meghna-verma smw1414 gyd1990 bnwolford hrk2109 markgene anandksrao ipstone vjbaskar flywind2 skyclub3 smartgamer jinyancool yakun-pang eclipsezhao pawanramamali khzhu joegage nataliehajduga

genvisr's Issues

gene_plot error detected for CBFB gene

PTEN has only 1 transcript in UCSC however three transcripts are being plotted, this might be a bug in the master table creation within this function.

Change Plot track to align vertically

We should change the plot_track function to not only plot labels/tracks horizontally (default) but as an option vertically as well

Cache txdb for given range

We should investigate caching all relevant data from txdb in first call, to speed up subsequent calls to the same region.

waterfall hierarchy does not remove duplicate entries

The waterfall_hierarchyTRV function which is designed to selectively remove mutations based on a hierarchy does not remove duplicate entries. as an example:
if in =
samp1 MLL3 missense
samp1 MLL3 missense
samp1 MLL3 intronic

out would =
samp1 MLL3 missense
samp1 MLL3 missense

This does not matter for plotting in the main plot however it would affect the mutation recurrence cutoff parameter in theory.

The fix should be to just unique the data frame at the end of the function

`NULL` in `transform` errors out

NULL is not usable as input for the transform attribute in genCov.

Point binning

We should introduce a means of binning points (like in a coverage plot) to a reasonable size, given the parameters of the resulting plot.

X axis numbers

They're meaningless. We should give them meaning.

Intron ribbons

We need to highlight when we are viewing a compressed region.

Gene buffers

We need a mechanism for identifying distinct genes, and keeping a minimum gene.buffer distance between them.

Intron buffer

We need to remove the large intron before/after the gene in the gene_plot data frame. Consider adding a specified intron buffer on either end of the data frame (default 1kb?)

Cosmic Track significantly increases runtime

Cosmic track in the lolliplot significantly increases run time due to biomaRt inefficiencies. another methodology should be used to address this issue

When specifying isoforms in genCov plots no longer align

the new feature in genCov is causing plots not to align, specifically the xlimits for the plots are NaN. This is related to the log transforms occuring. @ahwagner do you have any ideas?

allow user to select isoform in gencov function

Often a user may only be interested in select isoforms, we should allow the user the ability to select which isoforms are desireable to display.

Currently the only options that exist are to display all isoforms or to reduce into a summary view.

Add calculate gender parameter

LOH plots should reflect whether or not there are two 'X' chromosomes. Default status should be to not calculate 'X', unless user specifies to.

mutSpec

When plotting clinical colors, it's currently difficult to differentiate between variables. For example, if I wanted ER status to be (positive=blue) and HER2 status to be (positive=red), I'd have to rename the variable values to something like HER2_positive, ER_positive, HER2_negative, etc. An alternative solution would be for the user to input a clinical data frame of colors (rather than variable values).

After opening a PDF graphics device, calling mutSpec multiple times only produces a single page of a plot. I'm guessing the layers from different mutSpec calls are getting put on top of each other.

You may want to rename the mutRecur.layers and main.layers. When I was trying to add ggtitle as a layer to the entire plot, I assumed it would be added to main.layers. But adding it to main.layers plotted nothing (ggtitle gets overridden internally there).

When drop_mutation=T, the mutation type colors can change depending on which types are present for a given set of samples. I think it would make more sense to have a static set of default colors that the user can alter manually.

NA in master table when specifying a genomic region with 1 or 0 UTR/CDS segments

waterfall doesn't align when number of genes == 1

the waterfall plot "internal objects" do not align with the bar chart when genes = =1 and main.grid==T, This is due to how the grid is constructed.

Add option to specify samples to plot in mutSpec

This will help when there are samples with no mutations in the viewed space but also if you want to only plot a subset of the samples in your input file.

Set up lolliplot gene track to be proportional to plotting space

Currently lolliplot will plot lollis on top of each other continually shrinking the gene track as lollis are placed, this should be altered such that the gene track always takes up x% of vertical plotting space

Install requires RMySQL

Hi there,

When I tried to install using the instructions in README.md, I got the following error:

> devtools::install_github("griffithlab/GenVisR")
Downloading GitHub repo griffithlab/GenVisR@master
Installing GenVisR
Installing 1 packages: FField
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore CMD INSTALL  \
  '/private/var/folders/zz/zyxvpxvq6csfxvn_n00004c0000130/T/RtmpRvZbaZ/devtools73056e75bea7/griffithlab-GenVisR-17ba59b'  \
  --library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library' --install-tests 

* installing *source* package ‘GenVisR’ ...
** R
** data
*** moving datasets to lazyload DB
** tests
** preparing package for lazy loading
Creating a generic function for ‘nchar’ from package ‘base’ in package ‘S4Vectors’
Warning in .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "externalRefMethod" of class "expressionORfunction"; definition not updated
Warning in .recacheSubclasses(def@className, def, doSubclasses, env) :
  undefined subclass "externalRefMethod" of class "functionORNULL"; definition not updated
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  there is no package called ‘RMySQL’
ERROR: lazy loading failed for package ‘GenVisR’
* removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/GenVisR’
Error: Command failed (1)

After running install.packages('RMySQL') and re-running devtools::install_github("griffithlab/GenVisR"), the install completed successfully.

Reduce functionality broken in geneViz

Somewhere along the way the reduce functionality was broken, (i.e. currently the function errors out if reduce is set to TRUE)

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19

gr <- GRanges(seqnames=c("chr10"), ranges=IRanges(start=c(89622195), end=c(89729532)), strand=strand(c("+")))

geneViz(txdb, gr, genome, reduce=TRUE)

I think this occurred with the addition of txnames in the data frame.
@ahwagner can you take a look at this when you get a chance?

Add the option to apply a # of genes cutoff to mutSpec plot

This will allow for cutting the figure at a reasonable level when the difference between recurrence cutoffs is large (one shows to few while the other shows to many genes for a pretty plot).

GenCov additional data quality filters needed

After looking over this code some more I feel additional quality checks would be useful to limit potential problems encountered by users, specifically:

if a user specifies an ambiguous strand in the granges object we should grab features for both strands, currently ambiguous strands are not supported

we should check that if a user specifies an isoform it actually exists, and if not return an error, currently the code will error out with an uninformative message

consistent color palletes

Currently the MAF and the MGI file inputs have different color palletes which is confusing when switching between them (e.g., Nonsense mutations are grey in the MAF format while Silent mutations are grey in the MGI format).

In the waterfall plot add the option to display clinical information

in the plotting space below the main plot, add in the option to display clinical information from a user imported data frame and align to the main plot

Get GenVisR vigette to work with the vignette() function

It should be possible to get the GenVisR vignette to show in the list of available vignettes when one does vignette() in an R session after loading the GenVisR library.

Then this will open the PDF:
vignette("GenVisR")

Cleanup Namespace

Currently the package depends on "UniProt.ws", putting this package in the imports field instead of depends will cause lolliplot to fail.

It would be good to eliminate the dependency if possible, more research is needed to determine the cause of this.

Lolliplot should have a parameter for hard cutoffs when stacking points

Currently there is no limit for points stacking on one another, this behavior would make the graphic unreadable given a large degree of stacking (100+). a parameter should be created to eliminate this behavior, further gene height should be set to be proportional to the graphic device to avoid inconsistent gene heights between graphics.

example figure for genCov draws UTRs inconsistently

Notice that in the vignette example for genCov the UTRs for the different isoforms look odd. In the top isoform there is just one UTR feature plotted in the middle of the gene. I the second isoform the UTRs are plotted on top of and overlapping with the coding portion of the first and last exon. In the third isoform the UTRs look conventional (as expected).

rename mutSpec function?

The mutation landscape function is currently named mutSpec. This could be confusing as it makes me think of mutation spectrum which this plot is not and for which we kind of have another function (TvTi).

Add coersion of factors entered in the "genes" argument of mutSpec

I got the following error when using a factor for my gene list

mutSpec(tNHL_variants, file_type="MGI",label_x=T,rmv_silent=T,genes=three$gene_name)
Error: Aesthetics must either be length one, or the same length as the dataProblems:gene, trv_type

However, this worked.

mutSpec(tNHL_variants, file_type="MGI",label_x=T,rmv_silent=T,genes=as.character(three$gene_name))

Add in "User Intron" space

In gene_plot.R, we should provide some mechanism for leaving specified intronic regions uncompressed if the transformIntronic flag is set to true.

Does conversion of c.notation to p.notation really work as described?

In the vignette docs it is stated "It is recommended for amino acid change to be in p. notation however lolliplot will attempt to convert from c. notation to p.notation by subtracting the 5’ UTR transcript length from the c. coordinate, when employing this functionality the user must specify an ensembl data set via the ensembl.dataset parameter." Isn't the c.notation at the cDNA/transcript level? Converting to p.notation would require more than just subtracting the UTR length.

number of samples in mutSpec is incorrect

During the various optional subsets of data in mutSpec samples have the potential to be removed from the levels of the data frame's the funciton uses. When plotting the title the levels of x$sample is grabbed to plot n=x.

This functionality used to work fine but now if a sample is plotted with nothing (i.e. NA) it is not counted toward the number of levels resulting in a lower than expected number.

Summary: Everything is still plotted as it should be but the way n is calculated for the title needs to be fixed.

Set X-axis limits on coverage plot

Currently X-axis limits are inferred from the coverage input file and the user defined Granges object for the coverage and gene plots respectively, The function expects that the Granges object matches the coverage in terms of range.

This should be changed, we should set the x-axis limits in ggplot based on the Granges object the user defines.

Bug when specifying a small genomic range

When specifying a small range in a genomic range object the coverage plot produces an error:

Error in grid.Call.graphics(L_raster, x$raster, x$x, x$y, x$width, x$height, :
Empty raster

To reproduce specify a grange object for input as follows:
gr <- GRanges(seqnames=c("chr16"), ranges=IRanges(start=c(67063051), end=c(67063191)), strand=strand(c("+")))

File Format Conversion Function

I think it would be helpful to have a function (or series of functions) that convert from one file format to another. For example, from a MAF file to the long data frame format that a lot of genviz functions take. Maybe also something handy like a MAF to VCF (and vice versa) converter for offline use. There are probably other file formats too that could be included.

Warnings: In loop_apply(n, do.ply) : Stacking not well defined when ymin != 0

Happens on the gene samples with mutation subplot of the waterfall plot, it occurs because the x axis is reversed, i.e. negative becomes positive, a necessity for matching the main plot.

It would be good to get rid of this warning somehow, wither suppress it or the warning might go away if the code is modified to use stat='identity'

Error: Error: Results must be all atomic, or all data frames

In the mutations_heatmap function, subfunctions plot_bar an error occurs when processing a large data sets.

For example running the code with a recurrence cutoff of 20 seems to work fine however a recurrence cutoff of 0 will produce the error. needs to be looked into further (test file is bcla maf file).

Add some visualizations of what you get from GenVisR right to the README.md

A lot of people will quickly glance at a github repo to see if its something they are interested in. Maybe we should have some images right there so that people can get a visual sense of what it is all about without getting into the Vignette.

add strand information to genCov plot

We should add arrows to denote strand for the gene features in the genCov plot (similar to UCSC). This would require grid and the arrow parameter in geom_segment(), a new data frame will have to be created to accomplish this.

Transition/Transversion ratio add on

From a quick literature search the Transition/Transversion ratio can significantly vary not only between species but also on the type of data, WGS/Exome/Mitochondrial etc.

Given this I propose letting the user add in the rates as a pre-defined data frame structure if that is something the user wishes to plot.

Support for addtional file_types in waterfall plot

Currently only TGI annotation files and MAF version 2.4 are supported by the waterfall plot, It might be worthwhile to support additional file types (VCF, older MAF versions, etc.)

This would mean adding code in the following functions:
hiearchial_remove_trv_type.R
mutation_heatmap.R
plot_heatmap.R

Option to reduce resolution of genCov

We should have an option to reduce the resolution of genCov by x%. For most cases it is not necessary to plot at single base resolution.

At the least this would help speed up vignette creation

Lolliplot fetch domain function inoperable

The biomart query used in lolliplot.fetchdomain no longer works, I have checked the code has not changed since its inception in Feb.

Either biomaRt query structure for the interProd-1 database has changed that biomaRt functionality is broken.

I believe it will be possible in the interim to change the biomaRt query and restore functionality using the ensembl mart however it would be best to somehow store protein domain information within GenVisR or set up a server to hold this information.

at least for H.sapiens

Bug In Lolliplot disconnect between amino acid change and protien

Currently the Protein plotted as well as the domains are in amino acid coordinates, a discrepancy would occur if for the amino_acid_change column someone input c. nomenclature instead of p. (i.e. gave the coding dna sequence location instead of amino acid sequence location)

This needs to be corrected by requiring p. or converting everything into that

alternativley switch everything to c. and require that.

function to chage is: mutationObs.R and possibly cosmicObs.R