tidymodels / corrr Goto Github PK

Explore correlations in R

License: Other

R 54.17% HTML 45.83%

corrr's Introduction

corrr

corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse. This, along with the primary corrr functions, is represented below:

You can install:

the latest released version from CRAN with

install.packages("corrr")

the latest development version from GitHub with

# install.packages("remotes") 
remotes::install_github("tidymodels/corrr")

Using corrr

Using corrr typically starts with correlate(), which acts like the base correlation function cor(). It differs by defaulting to pairwise deletion, and returning a correlation data frame (cor_df) of the following structure:

A tbl with an additional class, cor_df
An extra “term” column
Standardized variances (the matrix diagonal) set to missing values (NA) so they can be ignored.

API

The corrr API is designed with data pipelines in mind (e.g., to use %>% from the magrittr package). After correlate(), the primary corrr functions take a cor_df as their first argument, and return a cor_df or tbl (or output like a plot). These functions serve one of three purposes:

Internal changes (cor_df out):

shave() the upper or lower triangle (set to NA).
rearrange() the columns and rows based on correlation strengths.

Reshape structure (tbl or cor_df out):

focus() on select columns and rows.
stretch() into a long format.

Output/visualizations (console/plot out):

fashion() the correlations for pretty printing.
rplot() the correlations with shapes in place of the values.
network_plot() the correlations in a network.

Databases and Spark

The correlate() function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the cor_df object. This allows for those results integrate with the rest of the corrr API.

Examples

library(MASS)
library(corrr)
set.seed(1)

# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))

# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA

# Correlate
x <- correlate(d)
class(x)
#> [1] "cor_df"     "tbl_df"     "tbl"        "data.frame"
x
#> # A tibble: 6 × 7
#>   term        v1       v2       v3       v4       v5      v6
#>   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#> 1 v1    NA        0.684    0.716    0.00187 -0.00769 -0.0237
#> 2 v2     0.684   NA        0.702   -0.0248   0.00495 -0.0161
#> 3 v3     0.716    0.702   NA       -0.00171  0.0205  -0.0566
#> 4 v4     0.00187 -0.0248  -0.00171 NA        0.452    0.442 
#> 5 v5    -0.00769  0.00495  0.0205   0.452   NA        0.424 
#> 6 v6    -0.0237  -0.0161  -0.0566   0.442    0.424   NA

NOTE: Previous to corrr 0.4.3, the first column of a cor_df dataframe was named “rowname”. As of corrr 0.4.3, the name of this first column changed to “term”.

As a tbl, we can use functions from data frame packages like dplyr, tidyr, ggplot2:

library(dplyr)

# Filter rows by correlation size
x %>% filter(v1 > .6)
#> # A tibble: 2 × 7
#>   term     v1     v2     v3       v4      v5      v6
#>   <chr> <dbl>  <dbl>  <dbl>    <dbl>   <dbl>   <dbl>
#> 1 v2    0.684 NA      0.702 -0.0248  0.00495 -0.0161
#> 2 v3    0.716  0.702 NA     -0.00171 0.0205  -0.0566

corrr functions work in pipelines (cor_df in; cor_df or tbl out):

x <- datasets::mtcars %>%
       correlate() %>%    # Create correlation data frame (cor_df)
       focus(-cyl, -vs, mirror = TRUE) %>%  # Focus on cor_df without 'cyl' and 'vs'
       rearrange() %>%  # rearrange by correlations
       shave() # Shave off the upper triangle for a clean result
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
       
fashion(x)
#>   term  mpg drat   am gear qsec carb   hp   wt disp
#> 1  mpg                                             
#> 2 drat  .68                                        
#> 3   am  .60  .71                                   
#> 4 gear  .48  .70  .79                              
#> 5 qsec  .42  .09 -.23 -.21                         
#> 6 carb -.55 -.09  .06  .27 -.66                    
#> 7   hp -.78 -.45 -.24 -.13 -.71  .75               
#> 8   wt -.87 -.71 -.69 -.58 -.17  .43  .66          
#> 9 disp -.85 -.71 -.59 -.56 -.43  .39  .79  .89
rplot(x)

datasets::airquality %>% 
  correlate() %>% 
  network_plot(min_cor = .2)
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
If you think you have encountered a bug, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

corrr's People

Contributors

Stargazers

Watchers

Forkers

ttriche xtmgah helen-r iamkbpark tpopenfoose mdancho84 clairelevy kristenzhao jaminday cimentadaj edgararuiz-zz deerluluolivia marvinlawphd uashogeschoolutrecht ameya225 gridl ksmvsn kakasa09 romainfrancois minghao2016 karagul rserran jun-lizst bedantaguru alex33261 misonog huseyn-guliyevdsa gravitytrope diffgee restevesd mattwarkentin clinicopath tashkeev-alex aariq ma-tech parkvillegeek pcart234 ntduc11 ecortina antoine-sachet thisisdaryn pkq jameslairdsmith tonyelhabr michaelgrund m11103 zchristian955 jmbarbone mhahsler burnsal

corrr's Issues

Allow empty arguemnet for rgather()

Substitute empty arguemnet for rgather() as everything()

to create/add a cor_df class?

My "correlation" is not generated by correlate(). It is a pairwise summary of intersection. I can convert it to tibble. But it is still not accepted by rearrange or rplot.
Is there any way I can add a cor_df class to my "correlation"? It seems that this is the only addition correlation does.
Thank you in advance.

p-values with p.adjust corrections

The bigger the correlation matrix, the greater the chance that many of the correlations that make such a pretty pattern don't differ significantly from zero. It would be nice to see just a matrix of p-values, and as color-coding to the correlations themselves. Thanks for considering it!

New version of / alternative to`rplot`

Below is code for a similar plot to rplot, but making use of geom_tile and geom_text:

library(corrr)

rs <- mtcars %>% correlate() %>% rearrange(absolute = FALSE)

order <- rs %>% 
  select(-rowname) %>% 
  colnames()

rs %>% 
  stretch(na.rm = TRUE) %>%
  mutate_at(c("x", "y"), forcats::fct_relevel, ... = order) %>% 
  ggplot(aes(x, y, fill = r)) +
    geom_tile() +
    geom_text(aes(label = as.character(fashion(r))), color = "white", size = 3) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_fill_gradientn(colors = c("darkred", "firebrick2", "goldenrod2", "gray", "springgreen2", "dodgerblue2", "darkblue")) +
    theme_minimal() +
    labs(x = NULL, y = NULL)

With some tweaking, it can be applied to cord_df after focus:

rs <- mtcars %>% 
  correlate() %>% 
  focus(mpg:hp)

order <- rs %>% 
  select(-rowname) %>% 
  as.matrix() %>%
  #abs() %>% 
  seriation::seriate()

row_order <- rs$rowname[seriation::get_order(order, dim = 1)]
col_order <- colnames(rs[-1])[seriation::get_order(order, dim = 2)]


rs %>% 
  gather(colname, r, -rowname) %>% 
  mutate(rowname = factor(rowname, levels = row_order),
         colname = factor(colname, levels = col_order)) %>% 
  ggplot(aes(rowname, colname, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(corrr::fashion(r))), color = "white", size = 3) +
  scale_fill_gradientn(colors = c("darkred", "firebrick2", "goldenrod2", "gray", "springgreen2", "dodgerblue2", "darkblue"))

Create standard evaluation for focus()

Feature request: arithmetic with correlation data frames

Showing which correlations change between treatments would be of great use. Is there a way to subtract one correlation data frame of one treatment with the other treatment, and still be in a object class that corrr can plot?

Feature Request: New function that returns covariance matrix

Great package. Something to consider is a function similar to correlate() that would return the covariance matrix. This will greatly assist with financial analysis.

Adding `cor_df` method for `fashion`

After reading the docs of fashion I understand that it's not entirely made to work with a cor_df object and that's nice because you don't depend on it. However, there's no actual method for cor_df.

Perhaps it would be more consistent to add also fashion.cor_df that performs the same thing but preserves the cor_df and tbl classes for nicer printing and consistency within the pipeline.

library(corrr)
#> Loading required package: dplyr
#> Warning: package 'dplyr' was built under R version 3.4.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
d <- correlate(mtcars)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'

class(d)
#> [1] "cor_df"     "tbl_df"     "tbl"        "data.frame"

d %>% fashion() %>% class()
#> [1] "data.frame" "noquote"

.default methods for plotting functions.

The plotting methods (rplot and network_plot) require default methods.

They either need to use as_cordf and then implement .cordf methods,

They need to do the plotting with something like a matrix, and .cordf methods need to convert to matrix first.

Straight lines in the network plot

Hi Simon,
Just wondering if it would be possible to have the option in the future to make the connecting lines straight instead of round in the network plot.

why is R-3.3.1 required (vs. 3.3.0)?

Are there any 3.3.1-specific features that corrr requires?

`focus_if` as method to focus on variables that meet certain criteria.

Hi,

I am trying to use dplyr with corrr. I have opened a question in Stackoverflow and was advised to go here for an answer or to report a bug (to be honest I'm not sure if it is a bug)

I have a data frame of correlations where I'm trying to only show the correlations above 10% I want to then plot this using the corrr package

I take the correlation of my data set, then filter to where the absolute value is >.1 but it fails on the network plot segment

Error in UseMethod("network_plot") : no applicable method for 'network_plot' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

A reproducible example is below. Thank you for your time and for investing in such a useful package

library(tidyverse)
library(corrr)

# Create the Dataframe
mydf <- data.frame(a=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   b=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   c=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   d=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   e=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   f=sample(rnorm(n = 100,sd = 15),replace=TRUE))


test <- mydf %>% 
  correlate(method = "spearman") %>% 
  gather("n", "corr", 2:7, na.rm = TRUE) %>% 
  filter(abs(corr) > 0.1) %>% 
  spread(rowname,corr) %>%
  network_plot(legend = TRUE)

Integrate plotting functions

As the number of plotting functions grows, this can become arbitrary and confusing.

Given that ggplot2 is used to create them, perhaps there is a way to unify them within a plot function that specifies the geoms and other params. This might work as the plots often share parameters such as colour scales, use of legends, etc.

If it's not possible to get it to this level in terms of the API for users, it might still be a good thing to implement under the hood.

rplot() prints error

When using rplot(), the following error (with times adjusted) is printed repeatedly:

Jul 20 15:51:57 rsession[3832] : Error: this application, or a library it uses, has passed an invalid numeric value (NaN, or not-a-number) to CoreGraphics API and this value is being ignored.Please fix this problem.

Change docs to reference correlation data frames (cor_df)

Currently, docs refer to a correlate() correlation matrix. Rather, describe this is a correlation data frame (cor_df)

Add `graph_from_cordf` function

ggraph package is now on CRAN and will be excellent for plotting networks of correlations.

There is a routine operation to convert a cor_df to a graph suitable for plotting (example that prunes correlations with abs value less than .3):

cor_df %>%
  shave() %>% 
  stretch(na.rm = TRUE) %>%
  filter(abs(r) > .3) %>% 
  igraph::graph_from_data_frame()  # Note requirement of igraph package

The resulting object can be fed directly to ggraph.

This could be a unique function like graph_from_cordf, or an extension of the igraph S3 method graph_from_data_frame. I'd lean towards the former so as not to create unexpected behaviour for the igraph operation.

Rearrange function files

There are three main types of functions:

Those that alter internal information but retain square matrix structure, and thus the cor_df class (e.g., rarrange, shave)
Those that reshape the square matrix structure (e.g., rselect, rgather)
Those used for generating output (e.g., rplot)

Each of these sets of functions should be placed in separate and appropriately named R files.

Create function to handle rowname

A generic method is required to execute a function over a cor_df in which the column rowname is removed, function executed, and then rowname appended back. This is becoming a routine task, thus need for function.

Feature: cor.test results

Results of of cor.test() are considered useful by many (e.g., p-values). However, at present, there's no foreseeable way to incorporate his information. One thought is to add this information as attributes to a cor_df.

Either way, here's some example code that tidies the output of cor.test(), which could replace the use of cor in correlate() if used correctly:

library(dplyr)

d <- mtcars
var_pairs <- t(combn(names(d), 2)) %>%
  as_data_frame() %>% 
  setNames(c("x", "y"))

var_pairs %>% 
  dplyr::mutate(r.test = purrr::map2(x, y, ~ stats::cor.test(d[[.x]], d[[.y]])),
                r.test = purrr::map(r.test, broom::tidy)) %>%
  tidyr::unnest(r.test)

`group_by_factors`

group_by_factors could be a boolean argument added to correlate(). If group_by_factors == TRUE, then correlate() is run separately for each grouping created by any factor variables present in the data set. Basically, it's a shorthand function to implement the tidyr::nest() method described here.

Few notes for implementing:

If TRUE, Would return a grouped data frame. Will need ways to apply other corrr functions to such an object.
By default (group_by_factors == FALSE), correlate() will need to keep() only numeric variables. This is a good idea anyway. If any variables (factors, character, etc) are dropped, this should be printed out as a warning. When group_by_factors == TRUE, non-numeric variables are used by nest() first, leaving only numeric for the correlations.

Conceived after this tweet.

r-to-Z (and back) for correlation means

Your corrr package is really handy and for getting means of correlations, I'd love to see a function that does the r-to-Z transform, gets the mean, and then transforms it back. Thanks for considering it!

Dealing with missing values

Great package! Just curious: you chose to make use = pairwise.complete.obs the default for the correlate function. That approach is not without problems, see e.g. B.W. Lewis' examples here. Doesn't chosing pairwise.complete.obs as the default increase the risk that users won't even notice that they might have a problem? (Or that the might have NA values at all?)

Guides show when rplot() legend = TRUE

When legend = TRUE, rplot() produces a size legend and also tags the colour scale with r. This is not the case in network_plot().

Doesn't work (dev version of ggplot?)

Doesn't work for me. I have installed the latest development version of ggplot2 and all my ggplot extensions stopped to work. The developer of ggplot2 thinks this is tough luck for the developers of those extensions.

airquality %>% correlate() %>% network_plot()
Warning: Ignoring unknown aesthetics: x, y
Error in zero_range(from) : x must be length 1 or 2

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] corrr_0.2.1 dplyr_0.5.0

loaded via a namespace (and not attached):
 [1] ggrepel_0.5        Rcpp_0.12.7        assertthat_0.1     grid_3.3.1         R6_2.2.0           plyr_1.8.4         gtable_0.2.0       DBI_0.5-1          magrittr_1.5       scales_0.4.0       ggplot2_2.1.0.9001 lazyeval_0.2.0    
[13] labeling_0.3       tools_3.3.1        munsell_0.4.3      colorspace_1.2-6   tibble_1.2

BTW - have you seen the DescTools::PlotWeb function?

rotating X-axis labels in Rplot

Hello and thanks for this cool package!

Following your 2017 lightning talk:

corrr::correlate(mtcars[, 1:4], method = "pearson") %>%
  corrr::rearrange(absolute=F) %>% #  rearrange cols and rows
  corrr::shave()%>% # shave upper triangle
  corrr::rplot(print_cor = T,legend = T)# dot plot

I get this graph. Now I want to do it on a larger dataset (59 columns), wtih long names (10-15 character).

I did not find a way to rotate X-axis labels, so that the column names do not overlap and hide each other. Is it possible?

thanks,
Abel

Allow shave() to be called after fashion()?

Thanks for a great package. At this time, the following works:

mtcars %>% 
    dplyr::select(hp, wt, mpg) %>% 
    correlate() %>% 
    shave() %>% 
    fashion()

But changing the order of shave() and fashion():

mtcars %>% 
    dplyr::select(hp, wt, mpg) %>% 
    correlate() %>% 
    fashion() %>% 
    shave()

causes the following error to print:

Error in UseMethod("shave") : 
  no applicable method for 'shave' applied to an object of class "c('data.frame', 'noquote')"

Would it be sensible to make a change to allow shave() to be called after fashion()?

Improvements to network_plot()

ggraph (https://github.com/thomasp85/ggraph) looks promising as a package for visualising the correlations. For an example, see: http://varianceexplained.org/r/stacksurveyr/

This might be used to replace current approach to network_plot(), as well as open new plotting possibilities.

However, at present, ggraph is in development and relies on a development version of ggplot2. This issue should be investigated once there has been a stable release.

GeomRepel break with ggplot2 2.2.0

network_plot gives error after updating ggplot2 to 2.2.0:

produces an error message:     Error: GeomTextRepel was built with an incompatible version of ggproto.
Please reinstall the package that provides this extension.

fashion() for all tbl

Currently, fashion() only handles a cor_df, but this is an unnecessary restriction. Instead, let it handle any tbl, in which it fashions numeric values and missing values. Return the result as same class as input object.

How to custom color nodes in network_plot()?

I have a data set, where the data comes in different categories. These categories I would like to indicate as the color of the nodes. The categories I store in a numeric vector AnnotVector.

I tried:

corrr::correlate(DATA, method = "pearson" %>% network_plot()
# Works WO colors. 

# Then
# COLOR = AnnotVector # Use once works
COLOR = 'darkblue'
corrr::correlate(DATA, method = "pearson" %>% network_plot() + geom_point(color=COLOR)
# no coloring

Create support for matrix class

If someone wants to submit something other than a cor_df (e.g., a square matrix) to any relevant functions, create default conditions that attempt to coerce the object to a cor_df (with new function as_cor_df), and then run the function.

Network_plot in README has legend despite code not specifying it

Issue: dplyr mutate(), filter() and various other funs coerce "cor_df" to "tbl_df"

Just a heads up. I don't know if this is something you will want to consider, but if you try to filter() or mutate() objects with the "cor_df" class they will be coerced to "tbl_df". Here's the link that describes the issue and the resolution:

tidyverse/dplyr#719

Add legend to rplot() and network_plot()

Both plot functions can offer a boolean argument (legend = TRUE/FALSE) that determines whether to display a legend for the size of the correlations.

My initial preference is to have this set as FALSE by default, so as not to distract from the immediate visual information.

Error-information from corrr-package

Dr. Simon Jackson,

First – thanks for the nice R-package “corrr” you have made. When I used the version 0.2.1 I run into a problem. I’m not sure whether it is a bug or I’m doing something wrong. I’m confronted with the following problem:

I used the following code:

rdf <- correlate(newRANG, method = "spearman")network_plot(rdf, legend = TRUE, colours = c("slategrey", "palegreen"))When I tried to visualize the relations, I got the following indication that something is wrong:

Error in calcCurveGrob(x, x$debug) : end points must not be identical

Note! I did not run into any problem when I used, method = "pearson", i.e. the plot was made correctly.
I hope you have time to look into the problem. It will be of great help.

Kind regards,
githubtobben

Why is the color scale opposite to intuition?

For many people (including me) blue is associated with negative values and red with positives. This is for instance reflected in the temperature scale:

Why is corrr using an opposite scale, where positive correlation is indicated by blue and negative correlation with red? Is it just coincidence or was it picked on purpose?

To support my argument the pals package for instance suggests a coolwarm scale as an efficient scale, going from blue (small/negative value) to red (high values). See https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html

Especially since the legend is not shown by default (which the wrong default imho), colors should match intuition.

rplot not working with plain tibbles

I find it odd that this code below is not working:

datasets::mtcars %>%
correlate() %>%
filter(mpg > .1) %>%
rplot

Error in UseMethod("rplot") :
no applicable method for 'rplot' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

Can we have an option to keep the leading zeros in fashion()?

Just a TRUE/FALSE argument would do it, what do you think?

Create tests for fashion()

See commit: 53daf0a

R Version

Just wanted to report that all tests and vignettes work fine in R version 3.1.2. While I see that is not sufficient motivation to reduce the tried and true required version number, there are people out there who cannot update their R versions for a variety of reasons. So if you would be willing to do that I am sure you will appease some. A workaround is to fork and change the required version number.

Make colours adjustable in rplot() and network_plot()

Add an argument that allows the colours to be adjusted in the plotting functions.

In rplot(), this could be a 3-element vector of the low, mid and high colours. E.g., `c("red", "white", "blue").

In network_plot(), this could either be a 2-element vector of colours for negative and positive, or adjust code to accommodate a 3-element vector just like suggestion above for rplot()

Remove mirror from rgather, but create relevant functions

Setting upper and lower triangles can be independently handled by alternative functions (e.g., na_upper, na_lower). These would be useful for plotting too.

convert na_x functions to a single shave()

instead of na_upper, have shave(upper = TRUE).
Will centralise the functionality

Border around circles as an option?

Example:

rplot(shave(correlate(mtcars)), legend = TRUE)

See x=drat and y=carb, it is invisible, of course it is almost transparent as there is no correlation. But it would be nice to have (as an option) a black circle around it to help with focusing on that spot?

Something like this:

rplot(shave(correlate(mtcars)), legend = TRUE, border = TRUE, borderCol = "black")

Thanks to @jcberny for the suggestion.