Giter Site home page Giter Site logo

laresbernardo / lares Goto Github PK

View Code? Open in Web Editor NEW
230.0 11.0 49.0 319.15 MB

Analytics & Machine Learning R Sidekick

Home Page: https://laresbernardo.github.io/lares/

R 100.00%
r machine-learning analytics visualization r-package data-science rstats automl h2o api

lares's People

Contributors

bernardolares avatar laresbernardo avatar nfultz avatar patrikios avatar pbulsink avatar wibeasley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lares's Issues

Can the target of categorical numeric variable to be treated as "regression" target

Hi,
I have a target variable which value is " 1, 2, 3, 4, 5" only. When I use the h2o_automl function, it always treat is as multi class target and run classification on it. I want to run it as regression but fail so many times. Can you please add some options in the syntax to choose the target variable as " binary, multiclass, or regression?

Thank you!

Is it possible to auto-compare different models?

Hello, and thanks for this nice package. After using pycaret in python, I found lares quite convenient to finish the low-code auto-ML jobs.

A question about the functionality, Please. Is it possible for lares to compare different models? This is one killer feature of pycaret with the help of compare_models(), which gives the user a summary table. For simple academic use, it also helps demonstrate how different models perform as a fast exploration.

Not a valid input:

Good day to you. Very impressive package, which I am starting to familiarise myself with.
Using corr_var with any existing data frame column name as the target variable to focus on, I receive:
'not a valid input: column_name was transformed or does not exist.' The function does however produce output.
Am I getting something wrong or misunderstanding? Also I am not sure about the significance of the 'dummy' argument and the reference to 'dummy' in the resulting bar chart ?
Many thanks, and hoping you stay well. David.

How to handle dataset with NA

Hello- really nice package. I am having issues when my matrix has NA values obstructing the use of any of the lares functionality. How can I overcome the "not enough finite observation". I tried pvalue=FALSE option but it generated an error indicating that value is matched by multiple others. Would appreciate a work-around this. I think the error is coming from the cor.test. Ideally, I'd like to still calculate pvalues while accounting for only those paired samples that have non-NA values, similar to when using cor (use="pairwise.complete.obs")
Thank you

lares - plot_timeline size of group names

Hi @laresbernardo,

I find that my group names are truncated when I don't have enough entries per group to provide the padding needed to write out the full name of the group. Is there a way to pad the blocks in which the group names appear to prevent this?

I can share my data and script with you to demonstrate the issue.

Thanks!

Portfolio performance: some issues and questions

Hi Bernardo,

I tried out your portfolio performance functions. The graphs are very useful in conjunction with the automatic data scraping and processing. I have some questions and uncovered some issues you could have a look at if you have the time. I also have some suggestions for functionality but will put that in a separate issue,

Issues:

  • Christmas bug: sometimes in the period around Christmas some quotes disappear which has the effect that the daily performance (ROI) can be -100% or +1100% for some of these days. Below is an example from my experience.
    christmas bug
    afbeelding
  • 3 tickers that I used to invest in, were not found: PHAG.AS, PHPD.AS, ABLX (or ABLX.BR). The first two should be available since they are still traded on the Amsterdam stock exchange. See here and here as reference. I didn't find them myself on Yahoo Finance, so I don't know if it can be fixed easily. The last one is on Yahoo Finance but the company was taken over and removed from the stock exchange so I am not sure if you could still incorporate this data. -> Since I closed these positions, I could add the transactions manually to the deposit tab (fondos) but this would confound the ROI results. Do you have suggestions?
  • Foreign stocks: at the moment, I only invest in stocks in euros. In the past, I have invested in stocks in dollar and swiss francs also. As far as I can tell, the package does not take into account the exchange rates which change the values of the stocks in euros. Is there an easy way to take this into account or use this consistently in the package?
  • Dividends: dividends are recognized in the scraped historical info but get lost when merging with the transaction data (daily_stocks function) so they are not taken into account.

Questions:

  • How should I treat closed positions? Do I put these tickers in the Portfolio tab of the Excel or only in the transaction history?
  • Is there an easy way to obtain the total value of your investments + leftover cash from deposits per day? It would be nice to have internal rate of return calculations on these values (see upcoming suggestion).
  • Is there a way to add costs not related to transactions outside of the deposit tab? E.g. broker costs for using certain services. For me, they are part of the ROI with respect to my deposits so I would prefer to not put them there.

Sorry for the long post. Thank you for all the help!

ElKron

Volume of Mutual Funds are returning 0

Tried my own file of MFs with Lares' dummyPortfolio format. Loaded data in correctly but when I ran get_stocks_hist volume column was set to all zeroes. When I run dfp<-stocks_objects(df), dfp returns Lares' data, not mine.

error due to h2o package

Hello,

I'm trying to use the corr_cross() function from the {lares} package, but I encounter an issue, I suppose due to the {h2o} package:

Screenshot 2023-09-05 at 09 39 06

I tried to install the h2o package (install.packages("h2o")) without any success.

Anyone had this issue in the past and managed to fix it?

Any help would be greatly appreciated!
Antoine

mplot_full and mpolt_importance errors

Thank you for the lares:h2o_automl fix.
Worked through your blog examples. Dalex examples work fine.
Two issues:

  1. lares::mplot_full(tag = results$score$tag,
    score = results$scores$score,
    subtitle = "Titanic dataset")

throws this error:" Error in grouped_df_impl(data, unname(vars), drop) : Column tag is unknown"

  1. lares::mplot_importance(var = results$importance$variable,
    imp = results$importance$percentage,
    subtitle = "Titanic dataset")

throws this error: "The variables and importance values vectors should be the same length.
Currently, there are 6 variables and 0 importance values!
Error in lares::mplot_importance(var = results$importance$variable, imp = results$importance$percentage, :"

Error while trying out example code

Dear Bernardo,

I would very much like to use this package to supplement my own analysis of my investments. I tried to recreate the plots on your datascience+ post.

Unfortunately, I encountered the following error while running the code with the dummy portfolio provided:

> file <- system.file("docs", "dummyPortfolio.xlsx", package = "lares")
> df <- stocks_file(filename = file)
> dfp <- stocks_obj(df)
>>> Downloading historical data for each stock...
 00:00:04 [==============================================================] 100% | DONE                                                             
Calculations ready...
Error: Can't subset elements that don't exist.
x Location 1 doesn't exist.
i There are only 0 elements.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
In theme_lares2(legend = "top") :
  Font Arial Narrow is not installed, has other name, or can't be found

Do you have an idea of the source this error? Additionally, with a first try on part of my own portfolio, I get an error message related to dividends:

> dfp <- stocks_obj(df2)
>>> Downloading historical data for each stock...
Error: Problem with `mutate()` input `Dividend`.
x Column `DivReal` not found in `.data`
i Input `Dividend` is `.data$DivReal * .data$CumQuant`.
i The error occurred in group 1: Symbol = "MJP.PA".
Run `rlang::last_error()` to see where the error occurred.

The dividend problem is maybe related to my portfolio containing accumulating index funds?

In any case thank you for your help!

ElKron

different p-values for correlations

Hi,
I'm new and happened to chance upon your package. :)

I love the corr_var function, but realised that it gives me different p-values.

A <- tibble(x=1:4,y=c(0,1,2,7),z=c(5,9,7,3))
corr_var(A,x, plot =F, type = "spearman")

variables corr pvalue
2 y 0.9135 1
3 z -0.4000 1

cor.test(A$x, A$z, method = "spearman", exact = F)

Spearman's rank correlation rho

data: A$x and A$z
S = 14, p-value = 0.6
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.4

Could you please kindly advise? Thank you so much!

Error in `dplyr::filter()`:

When I run the command:
r <- h2o_automl(df, y = BG, max_models = 1, impute = FALSE, target = "C3")
the SELECTED MODEL: XGBoost_1_AutoML_4_20230324_103413 produces the following error msg:
Error in dplyr::filter():
ℹ In argument: .data$importance < 1/(nrow(imp) * 4).
Caused by error in .data$importance:
! Column importance not found in .data.

Getting error in corr_varr

Hi Lares,

I am currently trying to use your 'corr_varr' package and I am getting the following error:

"Not a valid input: Feature was transformed or does not exist.".

The variable is numeric and the dataset itself is completely numeric so not sure what the problem is.

Any help would be greatly appreciated :)

lares - plot_timeline sorts alphabetically

Hi @laresbernardo! Thanks so much for developing this fun and easy to use package.

I tried to use it on my own data (deliverables for a project), but find that the deliverables (in your case the events) are sorted according to the alphabetical order of the groups.

I can share my script and data with you if you want to explore?

Thanks! Anelda

issue with date format

Hi Bernardo,

Thanks for this package!

I encounter an error with the date format. Here is my code :

today <- as.character(Sys.Date())
cv <- data.frame(rbind(
 c("PhD Student in Statistics", "UCLouvain", "Academic", "2017-09-01", today),
 c("MSc in Econometrics", "Maastricht University", "Academic", "2015-09-01", "2016-08-31"),
 c("MSc in Economics", "KULeuven", "Academic", "2013-09-01", "2015-01-01"),
 c("Faculty Exchange", "UIUC, USA", "Academic", "2014-08-01", "2014-12-31"),
 c("BSc in Economics", "UCLouvain", "Academic", "2010-09-01", "2013-08-31"),
 c("Teaching Assistant in Statistics", "UCLouvain", "Work Experience", "2017-09-01", today),
 c("Private Tutor in Statistics and R", " ", "Work Experience", "2019-01-01", today),
 c("Data Scientist Consultant", "Business & Decision", "Work Experience", "2016-09-01", "2017-07-31"),
 c("European Public Affairs Intern", "BNP Paribas", "Work Experience", "2015-04-01", "2015-07-31"),
 c("Audit Intern", "Deloitte Luxembourg", "Work Experience", "2015-01-01", "2015-03-31"),
 c("International Tennis Chair Umpire", "International Tennis Federation", "Extra", "2015-05-01", today)
))
colnames(cv) <- order

plot_timeline(event = cv$Role, 
              start = cv$Start,
              end = cv$End,
              label = cv$Place, 
              group = cv$Type,
              save = FALSE,
              subtitle = "Antoine Soetewey")

And here is the error :

Error: Invalid input: time_trans works with objects of class POSIXct only

It used to work in the past as you can see here. I tried to change the format of the start and end dates by using as.Date(), ymd() and as.POSIXct but it wasn't successful.

By any chance do you know how I can fix this issue?

Thanks in advance.

Best,
Antoine

Correlations expressed as percentages

Hi there, very nice package here - as a social scientist I have found the correlation functions particularly useful (and beautiful) for exploring correlations.

I might suggest a small tweak -- in the corr_cross function and a few others, correlations are expressed as percentages, but this is really never done in the sciences that heavily use correlations. Perhaps there could be an argument added to these functions which would keep it in the original r metric (-1 to 1, or abs(corr)?

Expressing it as a percentage can lead to confusion for a couple main reasons:

  1. A correlation coefficient ranges from -1 to 1 -- and a percentage can not be negative. (Of course this only applies to the functions that do not take the absolute value of the correlation.)
  2. Expressing it as a percentage may lead people to confuse it with the coefficient of determination, or R^2, which is sometimes expressed (and thought of) as a percentage, because it is the percentage of the variance in the outcome variable explained by the predictor(s).
  3. Most scientists examining correlations will want them in the -1 to 1 metric (or the absolute value of that).
  4. I noticed all this upon finding a bug when examining some data (see below): There appears to be a bug when setting the argument type = 2 in the corr_cross function: for correlations ranging from .51 to .59, the x-axis is mislabelled. Because I don't desire percentages anyway, this is easily sidestepped by deleting all arguments to the final call to scale_y_continuous within the function. Or even better, setting the arguments to limits = c(min(ret$corr), max(ret$corr)) to scale the axis nicely to the data.

image

And with the arguments removed in scale_y_continuous:
image

In the next ~3-6months, once I learn to use git, I am happy to make a pull request and add this on behalf of anyone else who may have a similar suggestion!

Hopefully this is helpful, and it's really minor -- thanks again for the useful package!

dplyr 1.0.8

We're about to release dplyr 1.0.8 and as part of running our revdep checks, we've identified that the released version of lares has this issue:

── After ─────────────────────────────────────────────────────────────────────────────────────────────────────
> checking examples ... ERROR
  Running examples in ‘lares-Ex.R’ failed
  The error most likely occurred in:
  
  > ### Name: distr
  > ### Title: Compare Variables with their Distributions
  > ### Aliases: distr
  > 
  > ### ** Examples
  > 
  > Sys.unsetenv("LARES_FONT") # Temporal
  > data(dft) # Titanic dataset
  > 
  > # Relation for categorical/categorical values
  > dft %>% distr(Survived, Sex)
  Error in `distr()`: Can't subset `.data` outside of a data mask context.
  Backtrace:
      ▆
   1. ├─dft %>% distr(Survived, Sex)
   2. └─lares::distr(., Survived, Sex)
   3.   ├─<unknown>
   4.   └─rlang:::`$.rlang_fake_data_pronoun`(.data, "value")
   5.     └─rlang:::stop_fake_data_subset(call)
  Execution halted

1 error x | 0 warnings ✓ | 0 notes ✓

it appears however that this is fixed in the dev version.

Error at h2o_automl?

Hello Bernado, your package certainly has great potential use however I cannot it to work. With all attempts I get an error at "results <- lares::h2o_automl(df = dfm, seed = seed, max_time = 60)" or any such statement. The error message received = "Error in h2o::h2o.automl(x = setdiff(names(df), "tag"), y = "tag", training_frame = as.h2o(train), : unused argument (exclude_algos = c("StackedEnsemble", "DeepLearning"))". I have already successfully ran the update for the package so I'm wondering if the call to h2o is broken???

Error during installation

Hello,

I tried to install the library with:
devtools::install_github("laresbernardo/lares")

However when I do so I get the error:
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
namespace ‘processx’ 3.4.3 is being loaded, but >= 3.4.4 is required

I tried google and tried updating Rtools and Rstudio but it doesn't work.
I'm currently using R 4.0.

Thanks for your help,
Ben

corr_cross plot incomplete

Hi, thank you for your package, really easy to use and clear data within minutes! However, I might have found an issues regarding the corr_cross plot. Using plot=T without ceiling or max_pvalue, I found there are pairs of variables not displayed in the corr_cross plot.
I double-checked with other packages as well as with corr_cross(df, plot=F, pvalue=T) or corr_var for the individuals pairs, which verified the underlying correlations to be there, but not plotted.
Now, when I created dummies (to upload a reproducible sample here), then the correlations are plotted correctly! I have no clue what to do next....
Thank you and have a nice day

holidays(countries = "Italy", years = 2022) issue

Output:

A tibble: 9 x 10

holiday holiday_name holiday_type national observance bank nonwork season hother county

1 2022-03-02 Ash Wednesday Observance FALSE TRUE FALSE FALSE FALSE FALSE Italy
2 2022-03-19 Father's Day Observance FALSE TRUE FALSE FALSE FALSE FALSE Italy
3 2022-03-20 March Equinox Season FALSE FALSE FALSE FALSE TRUE FALSE Italy
4 2022-04-15 Good Friday Observance FALSE TRUE FALSE FALSE FALSE FALSE Italy
5 2022-04-17 Easter Sunday National holiday TRUE FALSE FALSE FALSE FALSE FALSE Italy
6 2022-04-18 Easter Monday National holiday TRUE FALSE FALSE FALSE FALSE FALSE Italy
7 2022-04-25 Liberation Day National holiday TRUE FALSE FALSE FALSE FALSE FALSE Italy
8 2022-04-25 The Feast of St Mark (Venice) Local holiday FALSE FALSE FALSE FALSE FALSE TRUE Italy
9 2022-11-01 All Saints' Day National holiday TRUE FALSE FALSE FALSE FALSE FALSE Italy

It doesn't give many other days like Christmas
compelte list: https://travel.thewom.it/destinazioni/news-lowcost/ponti-2023.html

It would help also to have a column for "bridge days" (ponti)

object 'cats' not found

Hi,

Great package! Thank you!! After I run the H20 autoML I get the following error:

Check results in H2O Flow's nice interface: http://localhost:54321/flow/index.html
Model selected: GBM_1_AutoML_20191030_214222
NOTE: The following variables were NOT important: "VEHICLE_LICENSE_STATE_OTHER", "REPORT_STATE_OTHER", "ACCESS_CONTROL_DESC_Full.Control", "ACCESS_CONTROL_DESC_No.Control"

Running predictions for data2$tag...
Error in grepl(" ", cats) : object 'cats' not found

ART.txt

Any ideas on what object "cats" is or how I can get it to work? Thank you!!

Sincerely,

tom

Working with factors

Hi Bernardo
Thanks for the function plot_timeline which is very useful
My remark is that the function doesn't handle repeated Role or Place since the number of factors should be equal to the number of rows. I have just tried the same job name in 2 different companies and it throws an error.
Thanks

Keep original row names when running h2o_automl

Hi Bernardo and congrats for lares, it is a very interesting package. When i pass a dataframe to h2o_automl it seems that the original rownames of the dataframe get lost. In my case, for example, the names of the rows represents the IDs of the customers, so it is very important to have the possibility to link the prediction to this ID. Is there any way to keep the original rownames so that when I extract the prediction with ''$scores_test they are still there?
Thank you in advance

Model evaluation plots

Hi,

Got to know your code thanks to your excellent article in datascienceplus where you show how to make plots for model evaluation.
I noticed that in your mplot_splits code that if you choose splits equal to 10 the order of the stacked bar gets fuzzy (you first get 1 then 10 then 2, ...).
When you are creating the dataframe p, in your code you have:

ggplot(aes(x = as.character(tag), y = p, label = as.character(p),
fill = as.character(quantile_tag)))

But if we reorder the fill (given by quantile_tag) by quantile we can go past the issue:

ggplot(aes(x = as.character(tag), y = p, label = as.character(p),
fill = reorder(as.character(quantile_tag),quantile)))

Hope this helps.

Error in (function (el, elname) : "axis.text.x.bottom" is not a valid theme element name.

If I run the h2o_automl part. I got the following error without any results.
Check results in H2O Flow's nice interface: http://localhost:54321/flow/index.html

Error in (function (el, elname) :
"axis.text.x.bottom" is not a valid theme element name.

I'm using a simple Titanic dataset.
############## 1. Import libraries and data

library(lares)
library(dplyr)
library(ggplot2)

############## 3. Data transformations

df <- balance_data(s, "tag", rate = 1, seed = seed) %>%
mutate_if(is.numeric, funs(log = log(.+1))) %>%
select(-name, -ticket, -cabin) %>%
mutate(tag = as.factor(tag),
pclass = as.factor(pclass))

results <- h2o_automl(df, max_time = 60, project = project, seed = seed)

corr_var for categorical variables?

Hi Lares,

Thank you for solving my p-value problem earlier on! I love using the corr_var function, for quick and easy analyses just to know how my data is.

I tried using corr_var with a factor variable, but it required me to list the factor specifically (e.g: gender_male). This caused the error message of not having enough observations to plot, since I wanted to plot only the max_pvalue = 0.05. However, if I used corr_cross, I would see that the gender is specifically correlated with my list of 60 variables.

Is there a similar way for corr_var to work with categorical variables like how it works with continuous variables? We don't need to know which factor in the categorical variables are correlated, just need to know which variables are correlated with the categorical variable of interest. :)

Hope to hear from you soon!

Thank you!

lares::holidays produces empty frame

I run the basic example:

> lares::holidays(countries = "Argentina")
>>> Extracting Argentina's holidays for 2023
# A tibble: 0 × 10
# ℹ 10 variables: holiday <date>, holiday_name <chr>, holiday_type <chr>, national <lgl>, observance <lgl>, bank <lgl>, nonwork <lgl>, season <lgl>,
#   hother <lgl>, county <fct>

Gives back empty data.frame, simmilar to other countries I tried.

Session Info:

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so;  LAPACK version 3.8.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] parsnip_1.1.0            scales_1.2.1             rsample_1.1.1            timetk_2.8.3             recipes_1.0.6            dplyr_1.1.2             
 [7] modeltime.h2o_0.1.1.9000 h2o_3.40.0.4             modeltime_1.2.6.9000     lubridate_1.9.2          prophet_1.0              rlang_1.1.1             
[13] Rcpp_1.0.10              glue_1.6.2               bettermc_1.2.2           readr_2.1.4              plotly_4.10.2            ggplot2_3.4.2           
[19] readxl_1.4.2             data.table_1.14.8        withr_2.5.0             

loaded via a namespace (and not attached):
  [1] rstudioapi_0.14     jsonlite_1.8.5      magrittr_2.0.3      farver_2.1.1        fs_1.6.2            vctrs_0.6.3         RCurl_1.98-1.12    
  [8] htmltools_0.5.5     dials_1.2.0         progress_1.2.2      curl_5.0.1          cellranger_1.1.0    pROC_1.18.2         parallelly_1.36.0  
 [15] StanHeaders_2.26.27 htmlwidgets_1.6.2   plyr_1.8.8          extraDistr_1.9.1    zoo_1.8-12          lifecycle_1.0.3     iterators_1.0.14   
 [22] pkgconfig_2.0.3     Matrix_1.5-4.1      R6_2.5.1            fastmap_1.1.1       future_1.32.0       tune_1.1.1          selectr_0.4-2      
 [29] digest_0.6.32       colorspace_2.1-0    furrr_0.3.1         patchwork_1.1.2     ps_1.7.5            crosstalk_1.2.0     labeling_0.4.2     
 [36] fansi_1.0.4         yardstick_1.2.0     timechange_0.2.0    httr_1.4.6          compiler_4.3.0      bit64_4.0.5         backports_1.4.1    
 [43] inline_0.3.19       pkgbuild_1.4.2      highr_0.10          R.utils_2.12.2      MASS_7.3-60         lava_1.7.2.1        sessioninfo_1.2.2  
 [50] loo_2.6.0           tools_4.3.0         zip_2.3.0           future.apply_1.11.0 nnet_7.3-18         R.oo_1.25.0         Metrics_0.1.4      
 [57] callr_3.7.3         R.cache_0.16.0      grid_4.3.0          checkmate_2.2.0     generics_0.1.3      gtable_0.3.3        tzdb_0.4.0         
 [64] R.methodsS3_1.8.2   class_7.3-21        tidyr_1.3.0         hms_1.1.3           xml2_1.3.4          utf8_1.2.3          foreach_1.5.2      
 [71] pillar_1.9.0        stringr_1.5.0       splines_4.3.0       lhs_1.1.6           lattice_0.21-8      renv_0.17.3         survival_3.5-3     
 [78] bit_4.0.5           tidyselect_1.2.0    knitr_1.43          gridExtra_2.3       stats4_4.3.0        xfun_0.39           hardhat_1.3.0      
 [85] timeDate_4022.108   matrixStats_1.0.0   rstan_2.21.8        stringi_1.7.12      DiceDesign_1.9      lazyeval_0.2.2      yaml_2.3.7         
 [92] workflows_1.1.3     evaluate_0.21       codetools_0.2-19    rpart.plot_3.1.1    lares_5.2.2         tibble_3.2.1        cli_3.6.1          
 [99] RcppParallel_5.1.7  rpart_4.1.19        munsell_0.5.0       processx_3.8.1      styler_1.10.1       globals_0.16.2      ellipsis_0.3.2     
[106] gower_1.0.1         prettyunits_1.1.1   bitops_1.0-7        GPfit_1.0-8         listenv_0.9.0       viridisLite_0.4.2   ipred_0.9-14       
[113] xts_0.13.1          prodlim_2023.03.31  openxlsx_4.2.5.2    purrr_1.0.1         crayon_1.5.2        rvest_1.0.3  

Ranked Cross-Correlations

Hi, Great package, but trying to plot ranked cross-correlations. Plot looks fine, but how do I get negative correlations bars to go to left of zero? I see it in some of your examples. Thanks!

I have attached a small part of my graph:
Capture

Non-zero Exit

I am trying to install lares package. Windows 10 64 bit, RStudio version 1.1.419, R version 3.6.1
I am installing like this:
install.packages('devtools')
devtools::install_github("laresbernardo/lares")

I am getting the following error when installing:
Error: Failed to install 'lares' from GitHub:
(converted from warning) installation of package ‘C:/Users/admin/AppData/Local/Temp/RtmpYXIxIf/file2ca01cf1145d/lares_4.7.tar.gz’ had non-zero exit status

Please help

Tax on capital gains and input of buy/sales

Great package!
Could you please include the capital gain taxes in the calculations?
Also, how should be the buys and sales distinguished in the input spreadsheet?
the tickers in the plots get on top of each other and it is hard to distinguish one from the other. Including color coding might make it easier to distinguish.

Thanks

Santiago

wrong categorization in missingness

Hey laresbernardo, thanks for the very nice package!
I'm currently investigating several packages for EDA, yours has a very broad range!

The missingness function has a slight bug - it's displaying not all variables containing NA in the "with" section and does not calculate percentage missing for them.

lares/R/missings.R

Lines 30 to 61 in 6b87a60

if (plot) {
obs <- nrow(df)*ncol(df)
miss <- sum(m$missing)
missp <- 100*miss/obs
note <- paste0("Total values: ", formatNum(obs, 0),
" | Total missings: ", formatNum(miss, 0),
" (",formatNum(missp, 1),"%)")
p <- is.na(df) %>% data.frame() %>% tidyr::gather() %>%
{if (!full)
filter(., key %in% m$variable) else .} %>%
mutate(type = ifelse(key %in% m$variable, "with", "without")) %>%
group_by(key) %>%
mutate(row_num = row_number()) %>%
mutate(perc = round(100*sum(value)/nrow(df),2)) %>%
mutate(label = ifelse(type == "with", paste0(key, " | ", perc,"%"), key)) %>%
arrange(value) %>%
ggplot(aes(x = reorder(label, perc), y = row_num, fill = value)) +
geom_raster() +
coord_flip() +
{if (full)
facet_grid(type ~ ., space = "free", scales = "free")} +
{if (summary)
scale_y_comma(note, expand = c(0, 0)) else
scale_y_comma(NULL, expand = c(0, 0))} +
scale_fill_grey(name = NULL, labels = c("Present", "Missing"), expand = c(0, 0)) +
labs(title = "Missing values", x = "", subtitle = if (!is.na(subtitle)) subtitle) +
theme_lares2(legend = "top") +
theme(axis.text.y = element_text(size = 8))
return(p)

I've found this is due to tidyr::gather not working as expected and can be easily solved by using the new function tidyr::pivot_longer

Here I have the plot function (I also used glue for paste in the note)

if (plot) {
  obs <- nrow(df) * ncol(df)
  miss <- sum(m$missing)
  missp <- 100 * miss/obs
  note <- glue::glue(
    "Total values: {lares::formatNum(obs, 0)} | ",
    "Total missings: {lares::formatNum(miss, 0)} ",
    "({lares::formatNum(missp, 1)}%)"
  )
  p <- df %>% mutate_all(is.na) %>% 
    tidyr::pivot_longer(cols = tidyselect::everything()) %>%
    {if (!full) filter(., name %in% m$variable) else .} %>% 
    mutate(type = ifelse(name %in% m$variable, "with", "without")) %>% 
    group_by(name) %>% 
    mutate(row_num = row_number()) %>% 
    mutate(perc = round(100 * sum(value)/nrow(df), 2)) %>% 
    mutate(label = ifelse(type == "with", paste0(name, " | ", perc, "%"), name)) %>% 
    arrange(value) %>% 
    ggplot(aes(x = reorder(label, perc), y = row_num, fill = value)) + 
    geom_raster() + 
    coord_flip() + 
    {if (full) facet_grid(type ~ ., space = "free", scales = "free")} + 
    {if (summary) lares::scale_y_comma(note, expand = c(0, 0)) else lares::scale_y_comma(NULL, expand = c(0, 0))} + 
    scale_fill_grey(name = NULL, labels = c("Present", "Missing"), expand = c(0, 0)) + 
    labs(title = "Missing values", x = "", subtitle = if (!is.na(subtitle)) subtitle) + 
    lares::theme_lares2(legend = "top") + 
    theme(axis.text.y = element_text(size = 8))
  
  return(p)
}

corr_cross function, contains argument, inescapable "Faceting variables" error?

It looks to me like the contains argument of the corr_cross function always returns a "Faceting variables" error.

Can someone confirm it still works?

I am using R 4.0.0 and all CRAN packages have been updated. Lares package version is 4.8.4.

lares::corr_cross(mtcars,
           contains = "mpg")

Error: Faceting variables must have at least one value

Not an Issue - A Question

Great work. Can you hear my applause?

A quick request: Would you provide a bit more explanation of the Tag vs Score Splits Comparison? I am not able to really understand how to interpret it.

Similarly, I am not sure I understand the business value of Cuts by Score.

I appreciate any detail you might provide. Hopefully others will find this helpful too.

Much thanks for sharing your work.

Error in `combine_vars()`: ! Faceting variables must have at least one value

Hi there,

Thanks for the lares package.

I have this dataframe pisameans.zip

And I want to get a correlation plot with of one variable in the df vs all the others.

I am using

corr_cross(pmeans[,-c(1:2)], contains = c("BELONG"))

but I get this error:

Error in `combine_vars()`:
! Faceting variables must have at least one value
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
Problem while computing `hjust = ifelse(.data$abs < max(.data$abs)/1.5, -0.1, 1.1)`.
ℹ no non-missing arguments to max; returning -Inf 

I don't use a facet (as in ggplot?) . Is this a bug?

Thanks!

Customising (applying limits to fill scale for) corr_cross

Hello, hope it's okay for me to ask for help here, otherwise happy to take it down and ask in the right place - it is probably less of an issue so much as me not yet understanding all the functions taking place "under the hood" of corr_cross. I'm still fiddling with aesthetics and I can't seem to get my plots to be consistent - I have a two-colour scale I'd like to use to split positive and negative correlations, but if no negative correlations are returned my positive correlations are given the fill usually given to the negative ones. I know it's a bit of a silly problem!

I've tried adding a limits argument to scale_fill_manual, which returns:
"Warning message: Continuous limits supplied to discrete scale.ℹ Did you mean limits = factor(...) or scale_*_continuous()? "

I tried the suggested limits=factor(-1, 1), which produced a plot that looked the same as without any limits argument.

I've tried using scale_fill_gradient / scale_fill_continuous, which returns:
Error: Discrete value supplied to continuous scale

Which has very much confused me, because it looks like the fill values are Schrödinger-style discrete and continuous?

Portfolio performance: some suggestions for functionality

Hi Bernardo,

Some suggestions for the portfolio performance functionality. Of course, you decide which you deem useful!

  • Have an option to show everything in another currency (e.g. euro, GBP...) for the plots and the reporting. In principle this is just a quality of life improvement if all investments are in the same currency anyway.
  • Add functions to calculate and plot the internal rate of return (IRR) of the full portfolio (investments + cash). The internal rate of return is the interest rate that you apply to all your deposits (annually compounded) at the time of deposit and which yields the most recent full portfolio value. For a long term diversified portfolio, this is often estimated around 6-8% so it is a nice performance measure for your investment strategy. I prefer to include (unused) cash from deposits in my portfolio value since it is an integral part of ones investment strategy: when to leave more cash for investments and when not. More information on IRR. IRR can be calculated through an optimization approach for root-finding. The plotting functionality would probably be very similar to what you already do for the ROI. Below is an example from my own calculations (IRR from start of investments up to day t, including the mean IRR over all calculations; you could also go for a daily change in IRR-graph).
    afbeelding
  • Optionality for foreign currencies: would require scraping FX data and incorporating that so not a quick fix according to my experience.

Thank you for your consideration!

ElKron

"No valid data provided" error when running lares::h2o_automl

Hello,
Read your excellent blog.
Tried reproducing your h2o_automl blog example.
Ran this code: results <- lares::h2o_automl(df = Titanic, max_models = 10)
Getting this error: "Error in roc.default(results$scores$tag, results$scores$score, ci = T) : No valid data provided."
h2o starts, begins evaluating GBM models, then halts.
Retried using code from your blog--same result

MAPE error ploting in regression model result

Hi,
I am Mahabub working as a data scientist in ALSTOM Transport, France. I have seen your package which is very nice and useful. Thanks for that.

I have an additional query is that , in the regression model result, the errors measures are Adjusted R square, RMSE and MAE. But as per the requirement of my project I also want to calculate and show MAPE in this plot.
So, could you please advice how can I include MAPE on the same graph?

Error in mplot_lineal

@laresbernardo
Hi,

I am trying to use the function

lares::mplot_lineal(tag = results$.outcome, score = results2$model$finalModel$fitted.values, subtitle = "regression", model_name = "regression linear ,regression = TRUE)

The data for the tag are the actual values and the data for score are the predicted values from the model. I keep getting an Error:

Error: 'mapping' must be created by 'aes()'

Any idea on why it is causing this? I just downloaded your R package and am trying to learn how to use it.

Thanks!

Problem installing the package

I am trying to install the package but the h2o package cannot be installed. What I have tried is:

  1. install.packages("lares") gives: ```
    Warning in install.packages : downloaded length 0 != reported length 0
    Warning in install.packages : URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip': Timeout of 60 seconds was reached
    Error in download.file(url, destfile, method, mode = "wb", ...) : download from 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip' failed
    Warning in install.packages : download of package ‘h2o’ failed

2. Then I tried to do `install.packages("h2o")` but I get this error: ```
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip'. Content type 'application/zip' length 250589495 bytes (239.0 MB)
downloaded 167.3 MB
Warning in install.packages :  downloaded length 0 != reported length 0
Warning in install.packages :  URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip': Timeout of 60 seconds was reached
Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip' failed
Warning in install.packages :  download of package ‘h2o’ failed
  1. Then I tried install.packages("h2o", dependencies = TRUE). The errors: ```
    Warning in install.packages :
    downloaded length 0 != reported length 0
    Warning in install.packages :
    URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip': Timeout of 60 seconds was reached
    Error in download.file(url, destfile, method, mode = "wb", ...) :
    download from 'https://cran.rstudio.com/bin/windows/contrib/4.3/h2o_3.42.0.2.zip' failed
    Warning in install.packages : download of package ‘h2o’ failed
4. Then I tried to follow the instructions from [here](https://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/Ruser/Rinstall.html). The same error.

Any ideas why I am getting this error? I can't use `lares `without the `h2o`. I am using `R` 4.3.1, `RStudio` 2023.09.0 Build 463, Windows 11.

font_exist is false if ttf file in subdir

Hi Bernardo,

I use Robyn for my company and tried to fix the warning "Arial Narrow not installed". I came accross the font_exists function from the lares package and I think my problem was that the font was installed but inside a nested directory.
So what I did was copy it in /usr/share/fonts which fixed the problem for me:

rproc-00:/usr/share/fonts# ls
'Arial Narrow.ttf'   cmap   cMap   fonts-go   opentype   truetype   type1   woff   X11
rproc-00:/usr/share/fonts# ls truetype/msttcorefonts/Arial*
 truetype/msttcorefonts/Arial_Black.ttf         truetype/msttcorefonts/Arial_Italic.ttf
 truetype/msttcorefonts/Arial_Bold_Italic.ttf  'truetype/msttcorefonts/Arial Narrow Regular.ttf'
 truetype/msttcorefonts/Arial_Bold.ttf          truetype/msttcorefonts/Arial.ttf

So i just wanted to let you know, because maybe the function could look in subdirs..
Thanks and sorry for the not so well formatted issue...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.