Giter Site home page Giter Site logo

jhrcook / mustashe Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 4.0 13.63 MB

A system for stashing and loading the results of long running computations.

Home Page: https://jhrcook.github.io/mustashe/

License: GNU General Public License v3.0

R 100.00%
r rlang cran package cache stash mustashe hacktoberfest

mustashe's Introduction

Joshua Cook

Computational genomics researcher at Vertex Pharmaceuticals.

jhrcook joshdoesa ORCID

For more information about myself, interests, and hobbies, checkout my website, CV, or resume (updates and tailored resumes coming soon). Feel free to get in touch!


๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป Projects

Currently, I'm working on the 2023 Advent of Code in my attempt to learn the Rust programming language.

To learn new skills and play with fun tools, I have worked on a variety of personal side-projects including

I've also made small contributions to other projects such as snakemake and the Fig autocomplete repository.

๐Ÿ“ Most recent blogs posts

All of my blog posts can be found here.

๐Ÿ“ž Connect with me

joshdoesa LinkedIn bostonprick https://joshuacook.netlify.app/post/index.xml

๐Ÿ’ธ Support me

joshuacook

Buy Me a Coffee at ko-fi.com




GitHub language statistics.

GitHub user statistics.

GitHub streak

Wakatime statistics

mustashe's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mustashe's Issues

Using stash() inside a function causes an error, "object 'yyy' not found"

I love mustashe, its here() awareness fits very neatly with my workflow.

However, now I want to use stash() inside a function, but code that works in the global environment does not work inside a function. The issue probably has to do with the environment in which the stash code is evaluated, but since I don't see any parameters to change that, I'm filing this as an issue. Below is a full reprex illustrating the problem.

A few points:

  • Renaming from xxx to yyy is only done to not get leakage of variable names across scope.
  • In my actual use case, I want to set yyy based on some logic inside the function and use depends_on to trigger a restash if the contents of yyy have changed. This results in a different error, but with the same symptoms (works in global environment but not inside a function.
  • I'm including in the reprex both versions, with and without depends_on.
library(mustashe)
## clear_stash() # Commented in posted version to protect from inadvertent clearing
#> Clearing stash.
set.seed(42)

xxx <- letters
stash("result_xxx", {sample(xxx,1)}, functional=TRUE ) |> print()
#> Stashing object.
#> [1] "q"

funky <- function() {
  yyy <- letters
  stash("result_yyy", {sample(yyy,1)}, functional=TRUE ) |> print()
}
funky()
#> Stashing object.
#> Error in sample(yyy, 1): object 'yyy' not found

xxxx <- letters
stash("result_xxxx", {sample(xxxx,1)}, functional=TRUE, depends_on="xxxx" ) |> print()
#> Stashing object.
#> [1] "e"

funky2 <- function() {
  yyyy <- letters
  stash("result_yyyy", {sample(yyyy,1)}, functional=TRUE, depends_on="yyyy" ) |> print()
}
funky2()
#> Error in make_hash(depends_on, .TargetEnv): Some dependencies are missing from the environment.

Created on 2022-03-01 by the reprex package (v2.0.1)

An option for stash to return invisibly the object itself

Since stash is currently returning NULL, a simple use for the return value would be the result of the expression. This would allow people to use stash in a more functional-programming way.

Here's a simplified example of some work I was trying to cache earlier.

library(mustashe)
library(ggplot2)
library(patchwork)

long_running_function <- function(i) {
  data.frame(x = runif(i^2), y = rnorm(i^2))
}

myplots <- list()
for(i in 1:4) {
  stash(paste0("df", i), {
    df <- long_running_function(i)
  })
  myplots[[i]] <- ggplot(df, aes(x = x,y = y)) + geom_point()
}
patchwork::wrap_plots(myplots) + plot_layout(ncol=2)

(This doesn't work since I misinterpreted the documentation)

What I'd like is something like this:

for(i in 1:4) {
  df <- stash(paste0("df", i), {
    long_running_function(i)
  }, functional = TRUE)
  myplots[[i]] <- ggplot(df, aes(x = x,y = y)) + geom_point()
}

In my thinking, the functional parameter would prevent the assign call to the global environment and would instead return the value.

stash() is unable to open .mustash/x.hash

I am unable to use the stash function below.

require(mustashe)

stash("x", {
+ Sys.sleep(5)
+  x <- 5
+ })

While following a tutorial on your website, I received the following error:

Stashing object.
Error in c_qsave(x, file, preset, algorithm, compress_level, shuffle_control,  :
  Failed to open .mustashe/x.hash. Check file path.
In addition: Warning message:
In dir.create(.stash_dir, recursive = TRUE) :
  cannot create dir '.mustashe', reason 'Permission denied'

I used install.packages("mustashe").

What did I do wrong?
Has this issue, or a similar one, occurred in the past?

Any help would be appreciated.

Failing macOS CI test on development version of R

Currently, the CI tests on macOS using the development version of R are failing. This is because the development version is not available for this operating system. It is available for Windows and Ubuntu, and the tests on these operating systems are successful. This issue will presumably be resolved when the development version of R reaches a stable beta.

If the functions in my code change?

Thank you for this package. Very useful.

I would like your advice.

In my code, I use functions from a package that is in active development. I want to re-run if any of the functions change. How would you set it up.? Thanks.

Is renv needed?

I am finding that there is a lot of extra time needed to update the environment for renv (due to downloading old packages, which for some reason took a long time), and I wonder if it is needed and worth the extra time and complexity.

For a standalone data analysis project, renv makes a lot of sense, because you want to be able to run it without updating it to reflect any changes to the code due to updated dependency packages. For a package however, if it does not work with the most recent versions of dependency packages, that would probably be considered a bug โ€“ and will cause it to be removed from CRAN eventually.

I therefore suggest removing renv from the project (a simple renv::deactivate() would update the .Rprofile file do do this).

[Mustashe Package] Stash function does not recognize user-defined function parameter 'vec'

Hey its @MichaelSodeke again. I opened a previous issue (now closed). I have a new issue LOL ๐Ÿ˜€ .

Attempt

I am trying to create a data frame containing all region names and their coordinates.
I attempted to automate this task with a foreachloop from the doParallel package
and created a user-defined function called getBounds parameterized by vec, which
takes in a vector of country names from the variable cntry. However this process is too
slow. So I decided to use the stash function from the mustashe package, but for
some reason the vec parameter is not reconized within stash for each variable:
lon1, lat1, lon2, lat2.

Why is stash function not recognizing vec parameter?


Code

# load packages
suppressMessages( require(dplyr) )
suppressMessages( require(doParallel) )
suppressMessages( require(mustashe) )
suppressMessages( require(tictoc) )
suppressMessages( require(maps) )
suppressMessages( require(mapdata) )

# create user-defined function to get bounds for each country
getBounds <- function(vec=NULL)
{
	lon1 <- NULL
	lat1 <- NULL
	lon2 <- NULL
	lat2 <- NULL

	no_cores <- detectCores() - 1
	cl <- makePSOCKcluster(no_cores)
	registerDoParallel(cl,cores=no_cores)

	tic()
	stash("lon1", {
		Sys.sleep(5)
		lon1 <- foreach (i=1:length(vec),.combine='c') %dopar% { maps::map("world",fill=F,region=vec[i],plot=F)$range[1] }
	})
	toc()

	tic()
	stash("lat1", {
		Sys.sleep(5)
		lat1 <- foreach (i=1:length(vec),.combine='c') %dopar% { maps::map("world",fill=F,region=vec[i],plot=F)$range[3] }
	})
	toc()

	tic()
	stash("lon2", {
		Sys.sleep(5)
		lon2 <- foreach (i=1:length(vec),.combine='c') %dopar% { maps::map("world",fill=F,region=vec[i],plot=F)$range[2] }
	})
	toc()

	tic()
	stash("lat2", {
		Sys.sleep(5)
		lat2 <- foreach (i=1:length(vec),.combine='c') %dopar% { maps::map("world",fill=F,region=vec[i],plot=F)$range[4] }
	})
	toc()
	
	stopCluster(cl)

	cbind(lon1,lat1,lon2,lat2)
}

Results

# get country names
cntry <- map("world",fill=F,plot=F)$names

# implement user-defined function
tic()
location <- getBounds(vec=cntry)
toc()


Stashing object.
Error in eval(a, envir = extra, enclos = obj$evalenv) :
  object 'vec' not found
> toc()
5.38 sec elapsed

If I can get the above to work, then I can use the locations variable to create the
desired data frame.

df <- data.frame(cntry,location)
head(df)

Logically, vec should be recognized, but for unknown reasons this is not the case.
How can I fix this issue?

CRAN submission comments

The following are the comments from CRAN after the first submission attempt:

  1. Please omit the redundant "in R" from the title.
  2. Please add \value to .Rd files regarding exported methods and explain the functions results in the documentation. Please write about the structure of the output (class) and also what the output means. (If a function does not return a value, please document that too, e.g. \value{No return value, called for side effects} or similar)
  3. \dontrun{} should only be used if the example really cannot be executed (e.g. because of missing additional software, missing API keys, ...) by the user. That's why wrapping examples in \dontrun{} adds the comment ("# Not run:") as a warning for the user. Does not seem necessary. Please replace \dontrun with \donttest.
  4. Please do not modifiy the .GlobalEnv. This is not allowed by the CRAN policies.

feature suggestion: option to save stashed objects to memory

I have a large dataframe that is built by doing computationally expensive joins of other dataframes. I have a function that returns this dataframe that looks like this:

data_df <- function(options){
    df <-  
       mustashe::stash(
      "stash_dataDF", 
      {df <- data.frame() # lots of expensive joins here}, 
      depends_on = c(getOption("DFs")),
      functional = TRUE) 
    if (missing(options)) return(df)
   else {
       # do some post-processing of df based on options
       return(df)
    }
}

When I run this and it looks up stash_dataDF and finds that it can load the stashed object, it then has to load stash_dataDF.qs from disk. In my case, this tends to take between 0.25 - 0.5 seconds. Which isn't much, but does add up if data_df() is called multiple times within a larger workflow. I wonder if it would be possible to add an option to save stash_dataDF as an in-memory object and then use the same hash checking as is currently done. stash_dataDF would still need to either be built fresh or loaded from stash_dataDF.qs at the beginning of each session, but then after that, this should be a lot faster than reading it from disk each time.

Cannot add functions as dependencies

Due to how functions are compiled in R (beyond my understanding of R internals), functions cannot be included as dependencies. When they are called, their digest() codes change.

fxn <- function(x) {
    x ** 2
}

print(digest(fxn))
#> "096a999044001f53680df00e72c01df4"
print(digest(fxn))
#> "096a999044001f53680df00e72c01df4"
a <- fxn(2)
print(digest(fxn))
#> "468fa787fc71aa60fe373bd0bf6f0fc7"
a <- fxn(2)
print(digest(fxn))
#> "d9aef36674c50dcfb03a6fae3f20ce8c"
a <- fxn(4)
print(digest(fxn))
#> "d9aef36674c50dcfb03a6fae3f20ce8c"

This is because of the following difference:

fxn <- function(x) {
    x ** 2
}

fxn
#> function(x) {
#>        x ** 2
#>    }

a <- fxn(2)
a <- fxn(2)

fxn
#> function(x) {
#>         x ** 2
#>     }
#> <bytecode: 0x7f80b3f24ee8>

This may be able to be fixed by accounting for the bytecode with class(var) == "function" in mustashe::make_hash().

Integration with 'log4r'

Integration with the 'log4r' package so that the messages can go to the logger.
It would also be nice to have the name of the object in the output message.

Add pre-commit

Add pre-commit to check for styling of vignettes, R code, and MD.

Configuration option to set default for the `functional=` argument in `stash()`?

Would it be useful to be able to set the default for the functional argument in stash("var", { ... }, functional = .default_functional)? The implementation would be easy enough and based off of that for the 'here' integration. It seems like some people generally want it to be TRUE while others want the original default behavior with default FALSE.

Thoughts @torfason?

CRAN submission issue

@torfason I tried to submit the latest version of 'mustashe' to CRAN a while back, but it got rejected because during the test-running process, it was leaving behind a .mustashe directory somewhere. I couldn't reproduce the issue on my computer and didn't have the bandwidth to pursue it further at the time. If memory serves, I recall it was only an issue on CRAN's linux servers, so I've been meaning to try the R CMD process on a linux computer I have access to, but just haven't gotten around to it yet.

Just wanted to update you on this because I feel bad that the latest features you've contributed haven't been distributed yet. It is on my list of things to do though. I apologize it has taken so long.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.