Giter Site home page Giter Site logo

Comments (10)

wlandau avatar wlandau commented on July 29, 2024 1

But the original issue in this thread is more serious: different tasks in the same future_map() somehow end up with the same seeds.

from furrr.

wlandau avatar wlandau commented on July 29, 2024 1

Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).

At least the part about why targets is different. Keeping open the original issue about seeds in the same future_map() call.

from furrr.

wlandau avatar wlandau commented on July 29, 2024

FYI @HenrikBengtsson

from furrr.

HenrikBengtsson avatar HenrikBengtsson commented on July 29, 2024

"The winter is coming" ... or parallel RNG is really hard when it comes to cover all scenario. I've got a bit of a backlog and this is really hard, so I don't have time to dive into it right now. It could be a simple reason and fix for your case, or it could be something bigger that requires design changes.

FWIW, there's also HenrikBengtsson/future.apply#108, which may or may not be related.

from furrr.

HenrikBengtsson avatar HenrikBengtsson commented on July 29, 2024

... there is no such thing as a "next stream" because tasks run in a DAG instead of a linear sequence.

There's a root in the DAG, correct? If so, could you define a unique walk-through of the DAG from the root and outwards? That would allow you to order the nodes in a deterministic way (as long as the DAG does not change). With that, you could generate a unique set of RNG streams for your DAG so that they are assigned to the nodes in a deterministic way. In this sense, lapply/map/foreach uses a linear DAG where "next" is obvious (but one could come up with other walk-throughs that would also work, e.g. reverse)

from furrr.

wlandau avatar wlandau commented on July 29, 2024

Yes, igraph::too_sort() can easily create a sequence from a DAG.

To rephrase: the recursiveness of nextRNGStream() is in direct conflict with the most fundamental goals and guarantees of targets. targets is all about cache invalidation and computational reproducibility. If streams are recursive and one stream changes, the streams of half the DAG could change, and then all those targets would need to rerun.

Take a simple targets pipeline that begins with a dataset and produces a downstream summary:

graph LR
  data --> summary
Loading

The topological sort is trivial:

library(igraph)
graph <- graph_from_literal(data-+summary)
names(topo_sort(graph))
#> [1] "data"    "summary"

To use recursive L'Ecuyer streams, the stream of summary would be the nextRNGStream() of data. So far so good.

But then what if the user adds a new model target to the pipeline?

graph LR
  data --> model
  data --> summary
Loading

The topological sort changes:

graph <- graph_from_literal(data-+model, data-+summary)
names(topo_sort(graph))
#> [1] "data"    "model"   "summary"

If streams are assigned recursively in topological order, then summary gets the nextRNGStream() of model, not data. The initial RNG state of summary changes, and the target needs to rerun. This is counterintuitive and inefficient because summary does not actually depend on the results of model.

In addition, there are two DAGs now: an explicit DAG for the intended dependency relationships and an implicit DAG for the extra dependency relationships induced by the RNG streams. targets would need to mash together both sets of edges like this:

graph LR
  data --> model
  data --> summary
  model --> summary
Loading

In the general case, even medium-sized DAGs would contort into bizarre, unpredictable, disruptive abominations.

To prevent the whole paradigm of targets from breaking down, the package uses the behaviors documented in https://books.ropensci.org/targets/random.html: the seed of each target is a deterministic function of its name and a fixed global seed. And at the end of that chapter, I explain a way (suggested by @shikokuchuo) to measure overlap empirically.

If at some point there is a way to generate safer deterministic seeds independently of one another, I will switch targets to that.

from furrr.

wlandau avatar wlandau commented on July 29, 2024

Just realized #265 (comment) had typos in key places. Now fixed.

from furrr.

wlandau avatar wlandau commented on July 29, 2024

A bit more context for others who might jump in: targets is essentially GNU Make for R. When you compile a project in Make, if the time stamp of a source file is earlier than that of its object file, then Make skips the target. targets does the same thing, but for data analysis. To decide whether to skip or rerun a task, targets looks at the R code, the input data, the output data, other settings, and the RNG seed. If any of these things change, the target reruns. Otherwise, the pipeline skips the target to save time. So it's really important that a target be able to choose an RNG state in a permanent and independent way. The problem with recursive L'Ecuyer streams is that the RNG state would have depend on some kind of ordering. No matter what that ordering turns out to be initially, it will have to change when the pipeline changes. And when that happens, the target-specific RNG streams change, which invalidates tasks that would otherwise be up to date.

from furrr.

HenrikBengtsson avatar HenrikBengtsson commented on July 29, 2024

(sorry @DavisVaughan for adding noise here; @wlandau , feel free to move this over to another issue of yours, if you think there's a better place to discuss this)

I might miss something, but the idea that we use for map-reduce calls in Futureverse is to pre-generate the RNG streams for all "tasks". This is expensive for numerous tasks, but I don't think there's another way to achieve this.

Here's the gist:

## Imaginary tasks
X <- 1:20

## Tasks are processed in random order.
## Can also skip already done tasks.
## Result will be the same regardless.
idxs <- sample.int(length(X), size = length(X))

## Pre-generate deterministic RNG streams for _all_ tasks
RNGkind("L'Ecuyer-CMRG")
set.seed(42)
seeds <- list()
seeds[[1]] <- get(".Random.seed", envir = globalenv(), inherits = FALSE)
for (kk in 2:length(X)) seeds[[kk]] <- parallel::nextRNGStream(seeds[[kk-1]])


## Process tasks order give above with fully deterministic RNG seeds
y <- rep(NA_real_, times = length(X))
for (kk in idxs) {
  ## Use deterministic RNG stream for this task
  seed <- seeds[[kk]]
  assign(".Random.seed", value = seed, envir = globalenv(), inherits = FALSE)

  y[kk] <- rnorm(n = 1L)
}

from furrr.

wlandau avatar wlandau commented on July 29, 2024

feel free to move this over to another issue of yours, if you think there's a better place to discuss this

Moved to ropensci/targets#1139, starting with ropensci/targets#1139 (comment).

from furrr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.