Comments (10)
But the original issue in this thread is more serious: different tasks in the same future_map()
somehow end up with the same seeds.
from furrr.
Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).
At least the part about why targets
is different. Keeping open the original issue about seeds in the same future_map()
call.
from furrr.
FYI @HenrikBengtsson
from furrr.
"The winter is coming" ... or parallel RNG is really hard when it comes to cover all scenario. I've got a bit of a backlog and this is really hard, so I don't have time to dive into it right now. It could be a simple reason and fix for your case, or it could be something bigger that requires design changes.
FWIW, there's also HenrikBengtsson/future.apply#108, which may or may not be related.
from furrr.
... there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence.
There's a root in the DAG, correct? If so, could you define a unique walk-through of the DAG from the root and outwards? That would allow you to order the nodes in a deterministic way (as long as the DAG does not change). With that, you could generate a unique set of RNG streams for your DAG so that they are assigned to the nodes in a deterministic way. In this sense, lapply/map/foreach uses a linear DAG where "next" is obvious (but one could come up with other walk-throughs that would also work, e.g. reverse)
from furrr.
Yes, igraph::too_sort()
can easily create a sequence from a DAG.
To rephrase: the recursiveness of nextRNGStream()
is in direct conflict with the most fundamental goals and guarantees of targets
. targets
is all about cache invalidation and computational reproducibility. If streams are recursive and one stream changes, the streams of half the DAG could change, and then all those targets would need to rerun.
Take a simple targets
pipeline that begins with a dataset and produces a downstream summary:
graph LR
data --> summary
The topological sort is trivial:
library(igraph)
graph <- graph_from_literal(data-+summary)
names(topo_sort(graph))
#> [1] "data" "summary"
To use recursive L'Ecuyer streams, the stream of summary
would be the nextRNGStream()
of data
. So far so good.
But then what if the user adds a new model
target to the pipeline?
graph LR
data --> model
data --> summary
The topological sort changes:
graph <- graph_from_literal(data-+model, data-+summary)
names(topo_sort(graph))
#> [1] "data" "model" "summary"
If streams are assigned recursively in topological order, then summary
gets the nextRNGStream()
of model
, not data
. The initial RNG state of summary
changes, and the target needs to rerun. This is counterintuitive and inefficient because summary
does not actually depend on the results of model
.
In addition, there are two DAGs now: an explicit DAG for the intended dependency relationships and an implicit DAG for the extra dependency relationships induced by the RNG streams. targets
would need to mash together both sets of edges like this:
graph LR
data --> model
data --> summary
model --> summary
In the general case, even medium-sized DAGs would contort into bizarre, unpredictable, disruptive abominations.
To prevent the whole paradigm of targets
from breaking down, the package uses the behaviors documented in https://books.ropensci.org/targets/random.html: the seed of each target is a deterministic function of its name and a fixed global seed. And at the end of that chapter, I explain a way (suggested by @shikokuchuo) to measure overlap empirically.
If at some point there is a way to generate safer deterministic seeds independently of one another, I will switch targets
to that.
from furrr.
Just realized #265 (comment) had typos in key places. Now fixed.
from furrr.
A bit more context for others who might jump in: targets
is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets
does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets
looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to choose an RNG state in a permanent and independent way. The problem with recursive L'Ecuyer streams is that the RNG state would have depend on some kind of ordering. No matter what that ordering turns out to be initially, it will have to change when the pipeline changes. And when that happens, the target-specific RNG streams change, which invalidates tasks that would otherwise be up to date.
from furrr.
(sorry @DavisVaughan for adding noise here; @wlandau , feel free to move this over to another issue of yours, if you think there's a better place to discuss this)
I might miss something, but the idea that we use for map-reduce calls in Futureverse is to pre-generate the RNG streams for all "tasks". This is expensive for numerous tasks, but I don't think there's another way to achieve this.
Here's the gist:
## Imaginary tasks
X <- 1:20
## Tasks are processed in random order.
## Can also skip already done tasks.
## Result will be the same regardless.
idxs <- sample.int(length(X), size = length(X))
## Pre-generate deterministic RNG streams for _all_ tasks
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list()
seeds[[1]] <- get(".Random.seed", envir = globalenv(), inherits = FALSE)
for (kk in 2:length(X)) seeds[[kk]] <- parallel::nextRNGStream(seeds[[kk-1]])
## Process tasks order give above with fully deterministic RNG seeds
y <- rep(NA_real_, times = length(X))
for (kk in idxs) {
## Use deterministic RNG stream for this task
seed <- seeds[[kk]]
assign(".Random.seed", value = seed, envir = globalenv(), inherits = FALSE)
y[kk] <- rnorm(n = 1L)
}
from furrr.
feel free to move this over to another issue of yours, if you think there's a better place to discuss this
Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).
from furrr.
Related Issues (20)
- Error indices from purrr() are misleading in furrr()
- Getting the same random result across `purrr::map()` and `furrr::future_map()` HOT 3
- `furrr` much slower than `purrr` (on Windows) HOT 2
- Export a wrapper around purrr `map()` to workers rather than `map()` itself
- Setting `seed = TRUE` / `.progress=TRUE` globally in the R session HOT 1
- Carrier HOT 2
- mgcv::gam(~s(pc = object not found)) when future::plan(multisession, workers > 1) HOT 1
- when large matrix or df split into list, does furrr workers only access list element? HOT 1
- error: external pointer is not valid is not a particularly user actionable message HOT 5
- terra::rast() doesn't return a variable with workers > 1 HOT 3
- Understanding memory usage and performance of `furrr::future_apply` HOT 5
- Implement map_vec
- Use of variables in glue() inside furrr loops HOT 1
- Progress bar showing 100% even if not completed
- Error when plotting a {sf} object
- How to pass to .f, character vectors in .x that reference global values?
- `future_map` not obviously faster than `map` in simple linear regression setting HOT 1
- Multiple calls of future_map() within a single plan() or script result in massive slowdown.
- furrr_options documentation doesn't say what to do with it
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from furrr.