From <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

Moved to <a class="issue-link js-issue-link" data-error-text="Failed to l

FYI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Just realized <a class="issue-link js-issue-link" data-error-text="Failed to load titl

(sorry <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Identical RNG state for each task, despite setting different seeds in different tasks about furrr HOT 10 OPEN

wlandau commented on July 29, 2024

Identical RNG state for each task, despite setting different seeds in different tasks

from furrr.

Comments (10)

wlandau commented on July 29, 2024 1

But the original issue in this thread is more serious: different tasks in the same future_map() somehow end up with the same seeds.

from furrr.

wlandau commented on July 29, 2024 1

Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).

At least the part about why targets is different. Keeping open the original issue about seeds in the same future_map() call.

from furrr.

wlandau commented on July 29, 2024

FYI @HenrikBengtsson

from furrr.

HenrikBengtsson commented on July 29, 2024

"The winter is coming" ... or parallel RNG is really hard when it comes to cover all scenario. I've got a bit of a backlog and this is really hard, so I don't have time to dive into it right now. It could be a simple reason and fix for your case, or it could be something bigger that requires design changes.

FWIW, there's also HenrikBengtsson/future.apply#108, which may or may not be related.

from furrr.

HenrikBengtsson commented on July 29, 2024

... there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence.

There's a root in the DAG, correct? If so, could you define a unique walk-through of the DAG from the root and outwards? That would allow you to order the nodes in a deterministic way (as long as the DAG does not change). With that, you could generate a unique set of RNG streams for your DAG so that they are assigned to the nodes in a deterministic way. In this sense, lapply/map/foreach uses a linear DAG where "next" is obvious (but one could come up with other walk-throughs that would also work, e.g. reverse)

from furrr.

wlandau commented on July 29, 2024

Yes, igraph::too_sort() can easily create a sequence from a DAG.

To rephrase: the recursiveness of nextRNGStream() is in direct conflict with the most fundamental goals and guarantees of targets. targets is all about cache invalidation and computational reproducibility. If streams are recursive and one stream changes, the streams of half the DAG could change, and then all those targets would need to rerun.

Take a simple targets pipeline that begins with a dataset and produces a downstream summary:

graph LR
  data --> summary

The topological sort is trivial:

library(igraph)
graph <- graph_from_literal(data-+summary)
names(topo_sort(graph))
#> [1] "data"    "summary"

To use recursive L'Ecuyer streams, the stream of summary would be the nextRNGStream() of data. So far so good.

But then what if the user adds a new model target to the pipeline?

graph LR
  data --> model
  data --> summary

The topological sort changes:

graph <- graph_from_literal(data-+model, data-+summary)
names(topo_sort(graph))
#> [1] "data"    "model"   "summary"

If streams are assigned recursively in topological order, then summary gets the nextRNGStream() of model, not data. The initial RNG state of summary changes, and the target needs to rerun. This is counterintuitive and inefficient because summary does not actually depend on the results of model.

In addition, there are two DAGs now: an explicit DAG for the intended dependency relationships and an implicit DAG for the extra dependency relationships induced by the RNG streams. targets would need to mash together both sets of edges like this:

graph LR
  data --> model
  data --> summary
  model --> summary

In the general case, even medium-sized DAGs would contort into bizarre, unpredictable, disruptive abominations.

To prevent the whole paradigm of targets from breaking down, the package uses the behaviors documented in https://books.ropensci.org/targets/random.html: the seed of each target is a deterministic function of its name and a fixed global seed. And at the end of that chapter, I explain a way (suggested by @shikokuchuo) to measure overlap empirically.

If at some point there is a way to generate safer deterministic seeds independently of one another, I will switch targets to that.

from furrr.

wlandau commented on July 29, 2024

Just realized #265 (comment) had typos in key places. Now fixed.

from furrr.

wlandau commented on July 29, 2024

A bit more context for others who might jump in: targets is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to choose an RNG state in a permanent and independent way. The problem with recursive L'Ecuyer streams is that the RNG state would have depend on some kind of ordering. No matter what that ordering turns out to be initially, it will have to change when the pipeline changes. And when that happens, the target-specific RNG streams change, which invalidates tasks that would otherwise be up to date.

from furrr.

HenrikBengtsson commented on July 29, 2024

(sorry @DavisVaughan for adding noise here; @wlandau , feel free to move this over to another issue of yours, if you think there's a better place to discuss this)

I might miss something, but the idea that we use for map-reduce calls in Futureverse is to pre-generate the RNG streams for all "tasks". This is expensive for numerous tasks, but I don't think there's another way to achieve this.

Here's the gist:

## Imaginary tasks
X <- 1:20

## Tasks are processed in random order.
## Can also skip already done tasks.
## Result will be the same regardless.
idxs <- sample.int(length(X), size = length(X))

## Pre-generate deterministic RNG streams for _all_ tasks
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list()
seeds[[1]] <- get(".Random.seed", envir = globalenv(), inherits = FALSE)
for (kk in 2:length(X)) seeds[[kk]] <- parallel::nextRNGStream(seeds[[kk-1]])


## Process tasks order give above with fully deterministic RNG seeds
y <- rep(NA_real_, times = length(X))
for (kk in idxs) {
  ## Use deterministic RNG stream for this task
  seed <- seeds[[kk]]
  assign(".Random.seed", value = seed, envir = globalenv(), inherits = FALSE)

  y[kk] <- rnorm(n = 1L)
}

from furrr.

wlandau commented on July 29, 2024

feel free to move this over to another issue of yours, if you think there's a better place to discuss this

Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).

from furrr.

Identical RNG state for each task, despite setting different seeds in different tasks about furrr HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent