Giter Site home page Giter Site logo

corrr's Introduction

corrr

R-CMD-check CRAN_Status_Badge Codecov test coverage

corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse. This, along with the primary corrr functions, is represented below:

You can install:

  • the latest released version from CRAN with
install.packages("corrr")
  • the latest development version from GitHub with
# install.packages("remotes") 
remotes::install_github("tidymodels/corrr")

Using corrr

Using corrr typically starts with correlate(), which acts like the base correlation function cor(). It differs by defaulting to pairwise deletion, and returning a correlation data frame (cor_df) of the following structure:

  • A tbl with an additional class, cor_df
  • An extra “term” column
  • Standardized variances (the matrix diagonal) set to missing values (NA) so they can be ignored.

API

The corrr API is designed with data pipelines in mind (e.g., to use %>% from the magrittr package). After correlate(), the primary corrr functions take a cor_df as their first argument, and return a cor_df or tbl (or output like a plot). These functions serve one of three purposes:

Internal changes (cor_df out):

  • shave() the upper or lower triangle (set to NA).
  • rearrange() the columns and rows based on correlation strengths.

Reshape structure (tbl or cor_df out):

  • focus() on select columns and rows.
  • stretch() into a long format.

Output/visualizations (console/plot out):

  • fashion() the correlations for pretty printing.
  • rplot() the correlations with shapes in place of the values.
  • network_plot() the correlations in a network.

Databases and Spark

The correlate() function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the cor_df object. This allows for those results integrate with the rest of the corrr API.

Examples

library(MASS)
library(corrr)
set.seed(1)

# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))

# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA

# Correlate
x <- correlate(d)
class(x)
#> [1] "cor_df"     "tbl_df"     "tbl"        "data.frame"
x
#> # A tibble: 6 × 7
#>   term        v1       v2       v3       v4       v5      v6
#>   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#> 1 v1    NA        0.684    0.716    0.00187 -0.00769 -0.0237
#> 2 v2     0.684   NA        0.702   -0.0248   0.00495 -0.0161
#> 3 v3     0.716    0.702   NA       -0.00171  0.0205  -0.0566
#> 4 v4     0.00187 -0.0248  -0.00171 NA        0.452    0.442 
#> 5 v5    -0.00769  0.00495  0.0205   0.452   NA        0.424 
#> 6 v6    -0.0237  -0.0161  -0.0566   0.442    0.424   NA

NOTE: Previous to corrr 0.4.3, the first column of a cor_df dataframe was named “rowname”. As of corrr 0.4.3, the name of this first column changed to “term”.

As a tbl, we can use functions from data frame packages like dplyr, tidyr, ggplot2:

library(dplyr)

# Filter rows by correlation size
x %>% filter(v1 > .6)
#> # A tibble: 2 × 7
#>   term     v1     v2     v3       v4      v5      v6
#>   <chr> <dbl>  <dbl>  <dbl>    <dbl>   <dbl>   <dbl>
#> 1 v2    0.684 NA      0.702 -0.0248  0.00495 -0.0161
#> 2 v3    0.716  0.702 NA     -0.00171 0.0205  -0.0566

corrr functions work in pipelines (cor_df in; cor_df or tbl out):

x <- datasets::mtcars %>%
       correlate() %>%    # Create correlation data frame (cor_df)
       focus(-cyl, -vs, mirror = TRUE) %>%  # Focus on cor_df without 'cyl' and 'vs'
       rearrange() %>%  # rearrange by correlations
       shave() # Shave off the upper triangle for a clean result
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'
       
fashion(x)
#>   term  mpg drat   am gear qsec carb   hp   wt disp
#> 1  mpg                                             
#> 2 drat  .68                                        
#> 3   am  .60  .71                                   
#> 4 gear  .48  .70  .79                              
#> 5 qsec  .42  .09 -.23 -.21                         
#> 6 carb -.55 -.09  .06  .27 -.66                    
#> 7   hp -.78 -.45 -.24 -.13 -.71  .75               
#> 8   wt -.87 -.71 -.69 -.58 -.17  .43  .66          
#> 9 disp -.85 -.71 -.59 -.56 -.43  .39  .79  .89
rplot(x)

datasets::airquality %>% 
  correlate() %>% 
  network_plot(min_cor = .2)
#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

corrr's People

Contributors

antoine-sachet avatar cimentadaj avatar drsimonj avatar edgararuiz avatar edgararuiz-zz avatar emilhvitfeldt avatar hfrick avatar jameslairdsmith avatar jsta avatar juliasilge avatar krlmlr avatar mattwarkentin avatar michaelgrund avatar s-scherrer avatar thisisdaryn avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

corrr's Issues

Create support for matrix class

If someone wants to submit something other than a cor_df (e.g., a square matrix) to any relevant functions, create default conditions that attempt to coerce the object to a cor_df (with new function as_cor_df), and then run the function.

Why is the color scale opposite to intuition?

For many people (including me) blue is associated with negative values and red with positives. This is for instance reflected in the temperature scale:


Why is corrr using an opposite scale, where positive correlation is indicated by blue and negative correlation with red? Is it just coincidence or was it picked on purpose?

To support my argument the pals package for instance suggests a coolwarm scale as an efficient scale, going from blue (small/negative value) to red (high values). See https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html

Especially since the legend is not shown by default (which the wrong default imho), colors should match intuition.

Add legend to rplot() and network_plot()

Both plot functions can offer a boolean argument (legend = TRUE/FALSE) that determines whether to display a legend for the size of the correlations.

My initial preference is to have this set as FALSE by default, so as not to distract from the immediate visual information.

p-values with p.adjust corrections

The bigger the correlation matrix, the greater the chance that many of the correlations that make such a pretty pattern don't differ significantly from zero. It would be nice to see just a matrix of p-values, and as color-coding to the correlations themselves. Thanks for considering it!

Border around circles as an option?

Example:

rplot(shave(correlate(mtcars)), legend = TRUE)

See x=drat and y=carb, it is invisible, of course it is almost transparent as there is no correlation. But it would be nice to have (as an option) a black circle around it to help with focusing on that spot?

Something like this:

rplot(shave(correlate(mtcars)), legend = TRUE, border = TRUE, borderCol = "black")

Integrate plotting functions

As the number of plotting functions grows, this can become arbitrary and confusing.

Given that ggplot2 is used to create them, perhaps there is a way to unify them within a plot function that specifies the geoms and other params. This might work as the plots often share parameters such as colour scales, use of legends, etc.

If it's not possible to get it to this level in terms of the API for users, it might still be a good thing to implement under the hood.

Manually adjust correlation limits in plots

Currently, network_plot() and rplot() limit the colours and alpha etc to -1 and 1. However, with the addition of operations, these limits can be escaped.

Furthermore, it might just be nice for users to adjust these limits to suit their own purpose.

Thanks to @jcberny for the suggestion.

rplot() prints error

When using rplot(), the following error (with times adjusted) is printed repeatedly:

Jul 20 15:51:57 rsession[3832] : Error: this application, or a library it uses, has passed an invalid numeric value (NaN, or not-a-number) to CoreGraphics API and this value is being ignored.Please fix this problem.

Feature: cor.test results

Results of of cor.test() are considered useful by many (e.g., p-values). However, at present, there's no foreseeable way to incorporate his information. One thought is to add this information as attributes to a cor_df.

Either way, here's some example code that tidies the output of cor.test(), which could replace the use of cor in correlate() if used correctly:

library(dplyr)

d <- mtcars
var_pairs <- t(combn(names(d), 2)) %>%
  as_data_frame() %>% 
  setNames(c("x", "y"))

var_pairs %>% 
  dplyr::mutate(r.test = purrr::map2(x, y, ~ stats::cor.test(d[[.x]], d[[.y]])),
                r.test = purrr::map(r.test, broom::tidy)) %>%
  tidyr::unnest(r.test)

R Version

Just wanted to report that all tests and vignettes work fine in R version 3.1.2. While I see that is not sufficient motivation to reduce the tried and true required version number, there are people out there who cannot update their R versions for a variety of reasons. So if you would be willing to do that I am sure you will appease some. A workaround is to fork and change the required version number.

Allow shave() to be called after fashion()?

Thanks for a great package. At this time, the following works:

mtcars %>% 
    dplyr::select(hp, wt, mpg) %>% 
    correlate() %>% 
    shave() %>% 
    fashion()

But changing the order of shave() and fashion():

mtcars %>% 
    dplyr::select(hp, wt, mpg) %>% 
    correlate() %>% 
    fashion() %>% 
    shave()

causes the following error to print:

Error in UseMethod("shave") : 
  no applicable method for 'shave' applied to an object of class "c('data.frame', 'noquote')"

Would it be sensible to make a change to allow shave() to be called after fashion()?

.default methods for plotting functions.

The plotting methods (rplot and network_plot) require default methods.

They either need to use as_cordf and then implement .cordf methods,

OR

They need to do the plotting with something like a matrix, and .cordf methods need to convert to matrix first.

rotating X-axis labels in Rplot

Hello and thanks for this cool package!

Following your 2017 lightning talk:

corrr::correlate(mtcars[, 1:4], method = "pearson") %>%
  corrr::rearrange(absolute=F) %>% #  rearrange cols and rows
  corrr::shave()%>% # shave upper triangle
  corrr::rplot(print_cor = T,legend = T)# dot plot

I get this graph. Now I want to do it on a larger dataset (59 columns), wtih long names (10-15 character).

img

I did not find a way to rotate X-axis labels, so that the column names do not overlap and hide each other. Is it possible?

thanks,
Abel

New version of / alternative to`rplot`

Below is code for a similar plot to rplot, but making use of geom_tile and geom_text:

library(corrr)

rs <- mtcars %>% correlate() %>% rearrange(absolute = FALSE)

order <- rs %>% 
  select(-rowname) %>% 
  colnames()

rs %>% 
  stretch(na.rm = TRUE) %>%
  mutate_at(c("x", "y"), forcats::fct_relevel, ... = order) %>% 
  ggplot(aes(x, y, fill = r)) +
    geom_tile() +
    geom_text(aes(label = as.character(fashion(r))), color = "white", size = 3) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    scale_fill_gradientn(colors = c("darkred", "firebrick2", "goldenrod2", "gray", "springgreen2", "dodgerblue2", "darkblue")) +
    theme_minimal() +
    labs(x = NULL, y = NULL)

new_plot

With some tweaking, it can be applied to cord_df after focus:

rs <- mtcars %>% 
  correlate() %>% 
  focus(mpg:hp)

order <- rs %>% 
  select(-rowname) %>% 
  as.matrix() %>%
  #abs() %>% 
  seriation::seriate()

row_order <- rs$rowname[seriation::get_order(order, dim = 1)]
col_order <- colnames(rs[-1])[seriation::get_order(order, dim = 2)]


rs %>% 
  gather(colname, r, -rowname) %>% 
  mutate(rowname = factor(rowname, levels = row_order),
         colname = factor(colname, levels = col_order)) %>% 
  ggplot(aes(rowname, colname, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(corrr::fashion(r))), color = "white", size = 3) +
  scale_fill_gradientn(colors = c("darkred", "firebrick2", "goldenrod2", "gray", "springgreen2", "dodgerblue2", "darkblue"))

focus_plot

`focus_if` as method to focus on variables that meet certain criteria.

Hi,

I am trying to use dplyr with corrr. I have opened a question in Stackoverflow and was advised to go here for an answer or to report a bug (to be honest I'm not sure if it is a bug)

I have a data frame of correlations where I'm trying to only show the correlations above 10% I want to then plot this using the corrr package

I take the correlation of my data set, then filter to where the absolute value is >.1 but it fails on the network plot segment

Error in UseMethod("network_plot") : no applicable method for 'network_plot' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

A reproducible example is below. Thank you for your time and for investing in such a useful package

library(tidyverse)
library(corrr)

# Create the Dataframe
mydf <- data.frame(a=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   b=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   c=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   d=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   e=sample(rnorm(n = 100,sd = 15),replace=TRUE),
                   f=sample(rnorm(n = 100,sd = 15),replace=TRUE))


test <- mydf %>% 
  correlate(method = "spearman") %>% 
  gather("n", "corr", 2:7, na.rm = TRUE) %>% 
  filter(abs(corr) > 0.1) %>% 
  spread(rowname,corr) %>%
  network_plot(legend = TRUE)

fashion() for all tbl

Currently, fashion() only handles a cor_df, but this is an unnecessary restriction. Instead, let it handle any tbl, in which it fashions numeric values and missing values. Return the result as same class as input object.

rplot not working with plain tibbles

I find it odd that this code below is not working:

datasets::mtcars %>%
correlate() %>%
filter(mpg > .1) %>%
rplot

Error in UseMethod("rplot") :
no applicable method for 'rplot' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

Add `graph_from_cordf` function

ggraph package is now on CRAN and will be excellent for plotting networks of correlations.

There is a routine operation to convert a cor_df to a graph suitable for plotting (example that prunes correlations with abs value less than .3):

cor_df %>%
  shave() %>% 
  stretch(na.rm = TRUE) %>%
  filter(abs(r) > .3) %>% 
  igraph::graph_from_data_frame()  # Note requirement of igraph package

The resulting object can be fed directly to ggraph.

This could be a unique function like graph_from_cordf, or an extension of the igraph S3 method graph_from_data_frame. I'd lean towards the former so as not to create unexpected behaviour for the igraph operation.

Feature request: arithmetic with correlation data frames

Showing which correlations change between treatments would be of great use. Is there a way to subtract one correlation data frame of one treatment with the other treatment, and still be in a object class that corrr can plot?

Straight lines in the network plot

Hi Simon,
Just wondering if it would be possible to have the option in the future to make the connecting lines straight instead of round in the network plot.

to create/add a cor_df class?

My "correlation" is not generated by correlate(). It is a pairwise summary of intersection. I can convert it to tibble. But it is still not accepted by rearrange or rplot.
Is there any way I can add a cor_df class to my "correlation"? It seems that this is the only addition correlation does.
Thank you in advance.

How to custom color nodes in network_plot()?

I have a data set, where the data comes in different categories. These categories I would like to indicate as the color of the nodes. The categories I store in a numeric vector AnnotVector.

I tried:

corrr::correlate(DATA, method = "pearson" %>% network_plot()
# Works WO colors. 

# Then
# COLOR = AnnotVector # Use once works
COLOR = 'darkblue'
corrr::correlate(DATA, method = "pearson" %>% network_plot() + geom_point(color=COLOR)
# no coloring

Create function to handle rowname

A generic method is required to execute a function over a cor_df in which the column rowname is removed, function executed, and then rowname appended back. This is becoming a routine task, thus need for function.

GeomRepel break with ggplot2 2.2.0

network_plot gives error after updating ggplot2 to 2.2.0:

produces an error message:     Error: GeomTextRepel was built with an incompatible version of ggproto.
Please reinstall the package that provides this extension.

Dealing with missing values

Great package! Just curious: you chose to make use = pairwise.complete.obs the default for the correlate function. That approach is not without problems, see e.g. B.W. Lewis' examples here. Doesn't chosing pairwise.complete.obs as the default increase the risk that users won't even notice that they might have a problem? (Or that the might have NA values at all?)

Error-information from corrr-package

Dr. Simon Jackson,

First – thanks for the nice R-package “corrr” you have made. When I used the version 0.2.1 I run into a problem. I’m not sure whether it is a bug or I’m doing something wrong. I’m confronted with the following problem:

I used the following code:

rdf <- correlate(newRANG, method = "spearman")network_plot(rdf, legend = TRUE, colours = c("slategrey", "palegreen"))When I tried to visualize the relations, I got the following indication that something is wrong:

Error in calcCurveGrob(x, x$debug) : end points must not be identical

Note! I did not run into any problem when I used, method = "pearson", i.e. the plot was made correctly.
I hope you have time to look into the problem. It will be of great help.

Kind regards,
githubtobben

Make colours adjustable in rplot() and network_plot()

Add an argument that allows the colours to be adjusted in the plotting functions.

In rplot(), this could be a 3-element vector of the low, mid and high colours. E.g., `c("red", "white", "blue").

In network_plot(), this could either be a 2-element vector of colours for negative and positive, or adjust code to accommodate a 3-element vector just like suggestion above for rplot()

Adding `cor_df` method for `fashion`

After reading the docs of fashion I understand that it's not entirely made to work with a cor_df object and that's nice because you don't depend on it. However, there's no actual method for cor_df.

Perhaps it would be more consistent to add also fashion.cor_df that performs the same thing but preserves the cor_df and tbl classes for nicer printing and consistency within the pipeline.

library(corrr)
#> Loading required package: dplyr
#> Warning: package 'dplyr' was built under R version 3.4.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
d <- correlate(mtcars)
#> 
#> Correlation method: 'pearson'
#> Missing treated using: 'pairwise.complete.obs'

class(d)
#> [1] "cor_df"     "tbl_df"     "tbl"        "data.frame"

d %>% fashion() %>% class()
#> [1] "data.frame" "noquote"

Doesn't work (dev version of ggplot?)

Doesn't work for me. I have installed the latest development version of ggplot2 and all my ggplot extensions stopped to work. The developer of ggplot2 thinks this is tough luck for the developers of those extensions.

airquality %>% correlate() %>% network_plot()
Warning: Ignoring unknown aesthetics: x, y
Error in zero_range(from) : x must be length 1 or 2

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] corrr_0.2.1 dplyr_0.5.0

loaded via a namespace (and not attached):
 [1] ggrepel_0.5        Rcpp_0.12.7        assertthat_0.1     grid_3.3.1         R6_2.2.0           plyr_1.8.4         gtable_0.2.0       DBI_0.5-1          magrittr_1.5       scales_0.4.0       ggplot2_2.1.0.9001 lazyeval_0.2.0    
[13] labeling_0.3       tools_3.3.1        munsell_0.4.3      colorspace_1.2-6   tibble_1.2        

BTW - have you seen the DescTools::PlotWeb function?

`group_by_factors`

group_by_factors could be a boolean argument added to correlate(). If group_by_factors == TRUE, then correlate() is run separately for each grouping created by any factor variables present in the data set. Basically, it's a shorthand function to implement the tidyr::nest() method described here.

Few notes for implementing:

  • If TRUE, Would return a grouped data frame. Will need ways to apply other corrr functions to such an object.
  • By default (group_by_factors == FALSE), correlate() will need to keep() only numeric variables. This is a good idea anyway. If any variables (factors, character, etc) are dropped, this should be printed out as a warning. When group_by_factors == TRUE, non-numeric variables are used by nest() first, leaving only numeric for the correlations.

Conceived after this tweet.

r-to-Z (and back) for correlation means

Your corrr package is really handy and for getting means of correlations, I'd love to see a function that does the r-to-Z transform, gets the mean, and then transforms it back. Thanks for considering it!

Rearrange function files

There are three main types of functions:

  • Those that alter internal information but retain square matrix structure, and thus the cor_df class (e.g., rarrange, shave)
  • Those that reshape the square matrix structure (e.g., rselect, rgather)
  • Those used for generating output (e.g., rplot)

Each of these sets of functions should be placed in separate and appropriately named R files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.