paulvanderlaken / ppsr Goto Github PK

View Code? Open in Web Editor NEW

73.0 73.0 8.0 1.52 MB

R implementation of Predictive Power Score

License: GNU General Public License v3.0

R 100.00%

ppsr's Introduction

Monty Hall

My attempt at simulating the Monty Hall game show problem in both R and Python.

Both R and Python scripts follow a functional approach.

However, in R, I actually simulated the doors and Monty Hall closing them. In Python, I took a more direct approach by inverting the initial result on players switching. This way, the R version of the code is more flexible. For instance, in case someone wants to change the number of (wrong) doors Monty Hall opens.

I visualized the cumulative wins a player would have if he/she'd consecutively play against Monty Hall.

In R, I visualized these cumulative wins using ggplot2. For Python, I used matplotlib.

ppsr's People

Contributors

Stargazers

Watchers

Forkers

anaduque79 georgi-petkov ogweno clinicopath statunizaga igor-siciliani samgg ividownham

ppsr's Issues

Prevent overfitting

Can we implement crossvalidation in order to prevent overfitting?
Or should we pick different default hyperparameters?

New levels in field breaks something

Hello. Great work with this package. I've used the python lib previously and needed to use an R implementation for a project.

I can't share a reprex because of the sensitive nature of the data I'm working with. I'll try to create dummy data to recreate the issue.

The size of the data is nrow < 3000 records. The field in question has 10 levels, but one of the levels only occurs once. The issue seems to be when the data is split into train/test sets some levels with low frequency aren't captured in the training data, which seems to break when parsnip tries to predict using a value the model wasn't trained on.

I received the error while using ppsr::visualize_pps but looks like score_model is the culprit, if I had to guess I'd say this line:

yhat = stats::predict(model, new_data = test)[[1]]

Implement other algorithms

preferably tree-based models, or some svms?

Plots: Hard to read numeric values inside the darker boxes...

Paul -
great pkg - thanks!

Issue:
Trying color yellow,
to more easily read the black numbers in a box...

visualize_matrix(iris, color = "#FFFF00")

But,
the color displayed in the plot
is still the same original default blue color,
ie:
the box colors do not change
from the default blue...

My objective is to make the numeric values (colored black)
inside the darker boxes,
more easily readable.

Hard to read these numeric values
in the darker boxes, right now...

Help!.
SFd99
latest Rstudio & R, Ubuntu Linux 20.04
ppsr pkg version: 0.0.0.9100

Error: For a classification model, the outcome should be a factor.

Great package, thanks for sharing @paulvanderlaken I had my eye over this Predictive Power Score methodology for a while now and glad you did this!
I am trying to use the package as a plug-and-play function using Titanic dataset but am getting the following error which I think you can avoid by converting character classes into factors in the backend when detected. Is that correct?

> pp <- ppsr::visualize_predictors(dft, "Survived")
Error: For a classification model, the outcome should be a factor.

> rlang::last_error()
<error/rlang_error>
For a classification model, the outcome should be a factor.
Backtrace:
 1. ppsr::visualize_predictors(dft, "Survived")
 2. ppsr::score_predictors(df, y)
 3. ppsr::score(x = df[[x]], y = df[[y]], ...)
 5. parsnip::fit.model_spec(model, formula = y ~ x, data = df)
 6. parsnip:::form_form(object = object, control = control, env = eval_env)

I guess you've got the idea as you do something similar here.

devtools::check() does not work well with testing parallelized functions

Error: For a classification model, the outcome should be a factor.

I love this package, thank you for sharing!

In this latest release, I'm unable to analyze data containing both ordered factors and character vectors ("Error: For a classification model, the outcome should be a factor."). Before the update, ppsr handled the same data I have quite fine. For reference, my actual data has income brackets operationalized as ordered factors and a character vector for sex. I suspect it has something to do with parsnip? Here's a reprex using built-in data from Orange (I don't think the warnings are important, just the error at the very end):

library(ppsr)
data("Orange")
set.seed(seed = 1)
Orange$sex <- sample(c('male', 'female'), 35, replace= T)
visualize_pps(Orange)
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the age-Tree relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the circumference-Tree relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the sex-Tree relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in diag(cm)/rowSums(cm): longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in 2 * precision * recall: longer object length is not a multiple of
#> shorter object length
#> Warning in precision + recall: longer object length is not a multiple of shorter
#> object length
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the Tree-age relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the circumference-age relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the sex-age relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the Tree-circumference relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the age-circumference relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the sex-circumference relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Warning in score(df, x = g[["x"]][i], y = g[["y"]][i], ...): There are on average only 7 observations in each test-set for the Tree-sex relationship.
#> Model performance will be highly instable. Fewer cv_folds are advised.
#> Error: For a classification model, the outcome should be a factor.

^{Created on 2021-01-28 by the reprex package (v1.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2021-01-28                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                               
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.2)                       
#>  cli           2.2.0      2020-11-20 [1] CRAN (R 4.0.2)                       
#>  colorspace    2.0-0      2020-11-11 [1] CRAN (R 4.0.2)                       
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.2)                       
#>  digest        0.6.27     2020-10-24 [1] CRAN (R 4.0.2)                       
#>  dplyr         1.0.3      2021-01-15 [1] CRAN (R 4.0.2)                       
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)                       
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.1)                       
#>  fansi         0.4.2      2021-01-15 [1] CRAN (R 4.0.2)                       
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                       
#>  generics      0.1.0      2020-10-31 [1] CRAN (R 4.0.2)                       
#>  ggplot2       3.3.3      2020-12-30 [1] CRAN (R 4.0.2)                       
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                       
#>  gtable        0.3.0      2019-03-25 [1] CRAN (R 4.0.2)                       
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.2)                       
#>  htmltools     0.5.1.1    2021-01-22 [1] CRAN (R 4.0.2)                       
#>  knitr         1.31       2021-01-27 [1] CRAN (R 4.0.3)                       
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.2)                       
#>  magrittr      2.0.1      2020-11-17 [1] CRAN (R 4.0.2)                       
#>  munsell       0.5.0      2018-06-12 [1] CRAN (R 4.0.2)                       
#>  parsnip       0.1.5      2021-01-19 [1] CRAN (R 4.0.2)                       
#>  pillar        1.4.7      2020-11-20 [1] CRAN (R 4.0.2)                       
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.2)                       
#>  ppsr        * 0.0.0.9200 2021-01-29 [1] Github (paulvanderlaken/ppsr@6135d83)
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.2)                       
#>  R6            2.5.0      2020-10-28 [1] CRAN (R 4.0.2)                       
#>  reprex        1.0.0      2021-01-27 [1] CRAN (R 4.0.2)                       
#>  rlang         0.4.10     2020-12-30 [1] CRAN (R 4.0.2)                       
#>  rmarkdown     2.6        2020-12-14 [1] CRAN (R 4.0.2)                       
#>  rpart         4.1-15     2019-04-12 [1] CRAN (R 4.0.3)                       
#>  rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.0.2)                       
#>  scales        1.1.1      2020-05-11 [1] CRAN (R 4.0.2)                       
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.2)                       
#>  stringi       1.5.3      2020-09-09 [1] CRAN (R 4.0.2)                       
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.2)                       
#>  tibble        3.0.5      2021-01-15 [1] CRAN (R 4.0.2)                       
#>  tidyr         1.1.2      2020-08-27 [1] CRAN (R 4.0.2)                       
#>  tidyselect    1.1.0      2020-05-11 [1] CRAN (R 4.0.2)                       
#>  vctrs         0.3.6      2020-12-17 [1] CRAN (R 4.0.2)                       
#>  withr         2.4.1      2021-01-26 [1] CRAN (R 4.0.2)                       
#>  xfun          0.20       2021-01-06 [1] CRAN (R 4.0.2)                       
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.2)                       
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

I'm unsure how to create a reprex to submit as an issue to the parsnip repo, if this is the cause of the problem.

Hope this helps!

Prettier ggplot2 themes

The PPS barplot does not need horizontal gridlines.
Similarly the heatmap does not need gridlines at all.

General efficiency gains & downsampling

Are there ways, including downsampling, that can improve the efficiency of the current code

Suggestion: in correlation Fxs() make include_missings= FALSE, as default....

Hi Paul,

ppsr viz plots values and colors are
much more readable now!.

Easy Suggestion: :-)
in the correlation plot fxs (single and "both"),
pls make the param:
.... , include_missings = FALSE as a default value.

Right now,
the default value is TRUE,
and including vars w/o correlations
takes away valuable space in the correlation plots.
(ie: Species var in iris,
is just extra empty space...).

Thanks! / Dank U!.
SFd99
Rstudio/Ubuntu Linux 20.04

Uniform way of presenting data

score_predictors now gives a list, while score_matrix returns a matrix.
Better to let both return a matrix?
Is it logical that the matrix is encoded [y, x]. It feels counterintuitive...

tuning parameters of rpart

Thanks a lot for implementing PPS in R.

The decision tree of sklearn is not set with the same option than R. As sklearn default parameters are so small, I think sklearn's trees are much finer than R's ones leading to better score in Python than in R. For example, I can't reproduce the Suvived vs Sex example based on Titanic dataset. Is there a way to tune the default parameters of rpart?

And more generally, how do you think I could try other algorithms as you nicely implemented PPS with the huge library of models offered by parsnip?

Best.

Getting errors for parallel on macOSx

F1 is incorrect when cv_folds > 1

From my point of view, the factor transformation is handling the levels of y and yhat independently, which is incorrect.
Could you check the F1 calculation and my commit 37c9692?

I think there should be a test case for F1 calculation

The Titanic dataset is interesting for tracking various combinations of variable type. I have no time to work on it now, but it might be included in the package as a demo file. I think there might be a problem with the TicketID variable as it has many levels, but I didn't how this handled in the Python code.

Best.

df = read.csv("https://raw.githubusercontent.com/8080labs/ppscore/master/examples/titanic.csv")
dim(df)
head(df)

# Preparation of the Titanic dataset
# - Selecting a subset of columns
# - Renaming the column names to be more clear
# - Changing some data types

df = df[,c("Survived", "Pclass", "Sex", "Age", "Ticket", "Fare", "Embarked")]
colnames(df) = c("Survived", "Class", "Sex", "Age", "TicketID", "TicketPrice", "Port")

sapply(df, class)

df = within(df, {
  Survived = factor(Survived)
  Class = factor(Class)
  Sex = factor(Sex)
  Port = factor(Port)
})
sapply(df, class)
sapply(df, table)
sapply(df, function(x) length(unique(x)))

Error trying to install ppsr

After running the commands in R Studio:
install.packages('devtools')
devtools::install_github('https://github.com/paulvanderlaken/ppsr')
the following error message occurs:

Downloading GitHub repo paulvanderlaken/ppsr@HEAD
Error: Failed to install 'ppsr' from GitHub:
lazy-load database '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/pkgbuild/R/pkgbuild.rdb' is corrupt
In addition: Warning messages:
1: In get0(oNam, envir = ns) : restarting interrupted promise evaluation
2: In get0(oNam, envir = ns) : internal error -3 in R_decompress1

R version 4.0.3 (2020-10-10)
R Studio Version 1.4.970
MacBook Pro 2016 running macOS Catalina 10.15.7

Text color not consistent in visualizatopn functions

White in both, black in others

Parallelization now implemented twice/thrice, for the different score_* functions

One could call score_predictors from the score_df and score_matrix functions, and have all parallelization performed in score_predictors.

Upside is less code maintainance.
Downside is bottleneck with downtime of cores during each y variable (for variable 1, cores 1 through n might be done and waiting for core m to finish, only then can all move to variable 2 and again wait for complete finish).

Not sure what the right move here is.

Correlation plot does not sort axes alphabetically

See problem case here https://www.linkedin.com/feed/update/urn:li:activity:6795823044590882816/

My guess is that it has something to do with whether missings are included or not.

Very low PPS for correlated data

Hello,

I'm having a bit of a weird result where the scores are essentially all zero, despite the fact there are clearly some correlations. Maybe a problem of my understanding of PPS, or a gremlin in the calculation. Here's a reprex:

library(ppsr)
library(tibble) # I'm manually providing a simplified real data frame so this makes it easier to type out!

data <- tribble(
  ~varA, ~varB,
  25.000, 0.08,
  20.000, 0.09,
  16.000, 0.08,
  10.000, 0.09,
  5.000, 0.10,
  2.000, 0.12,
  1.000, 0.16,
  0.500, 0.17,
  0.100, 0.25,
  0.045, 0.45
)

cor(data) # clearly (loosely) correlated
#>            varA       varB
#> varA  1.0000000 -0.5852726
#> varB -0.5852726  1.0000000

score_matrix(data) # but very low PPS here??
#>      varA         varB
#> varA    1 1.110223e-16
#> varB    0 1.000000e+00

plot(data$varA, data$varB) # maybe the form of the relationship is a problem?

# deliberately subsetting data to a roughly linear portion
reduced_data <- data[data$varA < 5 & data$varA > 0.045, ]

cor(reduced_data) # even better correlation coeff.
#>            varA       varB
#> varA  1.0000000 -0.8949743
#> varB -0.8949743  1.0000000

score_matrix(reduced_data) # even worse PPS!!!
#>      varA varB
#> varA    1    0
#> varB    0    1

plot(reduced_data$varA, reduced_data$varB) # looks linear enough

^{Created on 2021-01-20 by the reprex package (v0.3.0)}

Is this a problem of data set size and PPS doesn't work well with small data due to using decision trees?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.