ropensci-books / drake Goto Github PK

View Code? Open in Web Editor NEW

55.0 6.0 26.0 71.68 MB

The user manual for the drake R package

Home Page: https://books.ropensci.org/drake/

License: GNU General Public License v3.0

R 27.19% Shell 9.95% TeX 12.37% HTML 13.43% CSS 37.05%

reproducibility high-performance-computing r data-science drake makefile pipeline workflow reproducible-research rstats

drake's Introduction

Consider targets

drake is superseded. The targets R package is the long-term successor of drake, and it is more robust and easier to use. Please visit https://books.ropensci.org/targets/drake.html for full context and advice on transitioning.

The drake R package user manual

This is the development repository of the drake R package user manual, hosted here. Please feel free to discuss on the issue tracker and submit pull requests to add new examples and update old ones. The environment for collaboration should be friendly, inclusive, respectful, and safe for everyone, so all participants must obey this repository's code of conduct.

drake's People

Contributors

Stargazers

Watchers

drake's Issues

Reproducible random numbers

Ref: https://stackoverflow.com/questions/53268458/halting-drake-plan-makes-it-rebuild-targets-it-already-had-built-previously

CI with netlify and travis

Update static image functionality in the visualization chapter

ropensci/drake#481

Document drake_debug()

Just a mention next to drake_build() would suffice.

Authorship

Many of you have had a hand in writing and/or reviewing the documentation that became the user manual.

I have included you all in the DESCRIPTION file of the user manual. If you would like to be removed, please let me know or submit a pull request.

Dedicated vignette on wildcard templating

Ref: ropensci/drake#388. cc @lorenzwalthert

Guidance on knitr files

Using drake inside an R notebook.
knitr files inside targets.
knitr dependency detection:
- loadd() and readd() for ordinary objects
- file_in() and file_out() for ordinary input/output files
- knitr_in() for nested input files.

I have neglected to document most of (2), and I think people would be interested.

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

writeLines(con = "master.Rmd", text = paste(
  '---',
  'title: "master"',
  'output: html_document',
  '---\n',

  '```{r master_chunk}',
  'loadd(loadd_master)',
  'readd(readd_master)',
  'file_in("in_master.txt")',
  'file_out("out_master.txt")',
  'knitr_in("subordinate.Rmd")',
  '```',
  sep = "\n"
))

writeLines(con = "subordinate.Rmd", text = paste(
  '---',
  'title: "subordinate"',
  'output: html_document',
  '---\n',
  
  '```{r subordinate_chunk}',
  'loadd(loadd_subordinate)',
  'readd(readd_subordinate)',
  'file_in("in_subordinate.txt")',
  'file_out("out_subordinate.txt")',
  '```',
  sep = "\n"
))

tmp <- file.create(c(
  "in_master.txt",
  "out_master.txt",
  "in_subordinate.txt",
  "out_subordinate.txt"
))

plan <- drake_plan(
  x = render(knitr_in("master.Rmd")),
  loadd_master = 1,
  readd_master = 2,
  loadd_subordinate = 3,
  readd_subordinate = 4
)
config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2018-12-20 by the reprex package (v0.2.1)}

OOO

Thread for marking OOO time.

Do not refresh the FAQ if GITHUB_PAT env var is missing

Ref: #44, https://travis-ci.org/ropenscilabs/drake-manual/builds/442146239#L1260

New example chapter: remote data sources + OSF + triggers

Refs:

cc: @tiernanmartin

Document how to set template options for clustermq parallelism

Ref: ropensci/drake#543 (comment)

Explain how to find out why targets are out of date

Ref: ropensci/drake#496

Organize the chapters

Some chapters are example data analysis projects, and some focus on the technical details of drake. I think we should divide chapters into examples and non-examples. Is it possible to divide chapters into sections? In bookdown, it seems like sections are one level lower than chapters. As more examples accumulate (ref: #19, #20), we may need to have one bookdown repo for the manual and another for the examples.

Node clustering in graph visuals

Ref: ropensci/drake#463, ropensci/drake#229

Show code_to_plan(), plan_to_code(), and plan_to_notebook()

In the chapter on drake plans, these functions could help demonstrate the relationships between plans and traditional code files. Ref: ropensci/drake#547, #41.

Document new feature: multiple file outputs per target

ropensci/drake#469

Guidance on parallelism within targets

At the end of the HPC vignette. Use the from_plan() function from ropensci/drake#677.

Update caching options for clustermq parallelism

Stay tuned for ropensci/drake#531.

Demonstrate how to show/hide file_out() files in the graph visuals

Ref: ropensci/drake#486, ropensci/drake#487

Use drakepkg instead of digest in the section on workflows as packages

After @tiernanmartin and I work on drakepkg some more, we can write about it here.

database best practices

For #20, it would be great to align on best practices for working with database connection objects. @aedobbyn and @AlexAxthelm have much more experience with databases than I do. From @aedobbyn in ropensci/drake#552:

Hello! We were recently discussing the unadvisable (?) approach of saving a database connection as a target and using it in a drake_plan. I had thought that this approach didn't work at all, but it seems it in fact does, at least in some circumstances.

In case they're useful, here are a couple examples demonstrating the same outcome when a connection is an object in the global environment (here, con) or a target (this_con).

library(drake)
library(DBI)
library(tidyverse)

# Funs
connect_to_db <- function() {
  dbConnect(RSQLite::SQLite(), "db")
}

seed_db <- function(conn) {
  dbWriteTable(conn, "mtcars", mtcars, overwrite = TRUE)
}

get_res <- function(conn) {
  dbGetQuery(conn, "SELECT * FROM mtcars") %>%
    as_tibble()
}

update_db_records <- function(tbl, conn) {
  new <- tbl %>%
    map_dfr(log)

  dbWriteTable(conn,
    "mtcars",
    new,
    append = FALSE,
    overwrite = TRUE
  )
}

test_is_original <- function() {
  testthat::expect_equal(
    mtcars %>%
      as_tibble(),
    get_res(con)
  )
}

test_is_log <- function() {
  testthat::expect_equal(
    mtcars %>%
      map_dfr(log),
    get_res(con)
  )
}


# Create db and write mtcars table
con <- connect_to_db()
seed_db(con)
test_is_original()

# Connection for updating is a target in the plan
plan_1 <- drake_plan(
  this_con = connect_to_db(),

  seeded = get_res(this_con),

  processed_data =
    update_db_records(seeded, this_con),

  strings_in_dots = "literals"
)

# Check that we updated mtcars
clean()
make(plan_1)
#> target this_con
#> target seeded
#> target processed_data
test_is_log()

# Overwrite table to original mtcars
seed_db(con)
test_is_original()

# Connection for updating is a global object
plan_2 <- drake_plan(
  seeded = get_res(con),
  
  processed_data =
    update_db_records(seeded, con),

  strings_in_dots = "literals"
)

# Check that we updated mtcars
clean()
make(plan_2)
#> target seeded
#> target processed_data
test_is_log()

Stop using system.file() to get code and data from the examples

ropensci/drake#490

More explicit integration with the workflow management paradigm by Jenny Bryan et al.

The "What they forgot to teach about R" paradigm is friendly, popular, and impactful, and much of its success comes from its limited scope. From a recent PLOS Computational Biology article by @jennybc et al.:

We have deliberately left many good tools and practices off our list, including some that we use daily, because they only make sense on top of the core practices described above or because it takes a larger investment before they start to pay off.

and

Tools like Make were originally developed to recompile pieces of software that had fallen out of date... However, newcomers can achieve the same behavior by writing shell scripts that rerun everything; these may do unnecessary work, but given the speed of today's machines, that is unimportant for small projects.

Even so, drake adds value to this space.

It is fully compatible with the authors' recommendations (example: numbered scripts).
It is friendly to new users (example discussion) especially in the space of Make-like tools.
Even for small projects, it helps by encouraging tidiness and readability (mentions here and here).

I plan to write a special chapter to go through the best practices in the article and spell out all the parallels point-by-point.

Document the changes in hpc support functions

ropensci/drake@d079ba8

Put the PNG images in a separate folder

...and make sure they connect to the relevant chapters.

Add detailed guidance on code analysis magic

General dependency detection. How does drake know which targets depend on other targets just by looking at the commands? How do we check with deps_code() and deps_targets()?
file_in(), file_out(), knitr_in(), and ignore() probably belong in the chapter on workflow plan data frames.
We probably need a chapter on how drake deals with R Markdown reports, dependency detection via loadd() and readd() in active code chunks, etc.

evaluate_plan(trace = TRUE)

ropensci/drake#461

Finish drake#332

ropensci/drake#332

Add repo topics like for the drake repo?

Document the DSL

Focus on ropensci/drake#674, ropensci/drake#680, and drake_plan(trace = TRUE).

Document the new clustermq_staged and future_lapply_staged backends

Add Cautionary note on generic functions.

Hopefully this is able to be followed. Below, I install a minimal package, run make, then reinstall a different version that changes a generic function, then rerun make. The second make thinks it's up-to-date. Afterwards, I show that make notices the change when using the "subgeneric" (not sure of the correct term) directly.

This is likely an issue with the upstream package but it's worth adding to the Cautionary Notes section of the manual.

devtools::install_github("yihui/rmini")
library(drake)
library(rmini)
detach("package:rmini", unload = TRUE)
my_plan <- drake_plan(show_hello = rmini::hello(x = "Will"),
                      strings_in_dots = "literals")
make(my_plan)
#> target show_hello
#> Hi! I love characters!

devtools::install_github("kendonb/rmini")
library(rmini)
detach("package:rmini", unload = TRUE)

# The function output has changed:
rmini::hello(x = "Will")
#> Hi! I love characters! You didn't notice me change!

# make hasn't noticed the change
make(my_plan)
#> All targets are already up to date.

devtools::install_github("yihui/rmini")
library(drake)
library(rmini)

detach("package:rmini", unload = TRUE)
clean(show_hello)
my_plan <- drake_plan(show_hello = rmini:::hello.character(x = "Will"),
                      strings_in_dots = "literals")
make(my_plan)
#> target show_hello
#> Hi! I love characters!

devtools::install_github("kendonb/rmini")
library(rmini)
detach("package:rmini", unload = TRUE)
rmini:::hello.character(x = "Will")
#> Hi! I love characters! You didn't notice me change!

# Now, when using the "subgeneric", make notices the change 
make(my_plan)
#> target show_hello
#> Hi! I love characters! You didn't notice me change!

^{Created on 2018-12-24 by the reprex package (v0.2.1.9000)}

Guidance on drake with R Markdown / R notebooks

For a long time, I have tried to convince people that they should run knitr reports inside drake targets rather than using an R Markdown report as workflow manager to contain drake projects. This was mostly born out of my frustration at seeing people (including my past self) use knitr for larger computations than it was designed to handle. That and drake::make() for seriously intense pipelines should really be an unobtrusive persistent background process or remote job (hopefully using drake's existing HPC. From talking with @lawremi, Joseph Gerrein, and @rpayne-lilly, my position is softening a bit. I think drake's official documentation should accommodate usage inside reports and notebooks.

Document persistent clustermq workers

make(parallelism = "clustermq"). Ref: ropensci/drake#425, ropensci/drake#501, mschubert/clustermq#86

New chapter: example file-based data analysis project

This chapter should

Show how drake can handle multiple file_out() files par target, and
Demonstrate dependency relationships where one target's file_out() is another target's file_in().

I would like to anchor on @tiernanmartin's code from ropensci/drake#257 (comment). The real work will be to grow this seed into a complete story, preferably structured like the gsp chapter. I think we should begin with a believable scientific problem we want to solve, explain the methods (I do not expect everyone to know spatial statistics or sf), proceed with the analysis, and see what we can conclude. We should iterate on the analysis to demonstrate what happens if we

Change a function dependency or a command for a target with an intermediate file_out().
Corrupt an intermediate file_out() and watch make() repair it.

Include tip about spoofing commands from #615

For debugging issues with the dependency structure: ropensci/drake#615 (comment)

Overhaul the HPC chapter

Just feature the "clustermq", "future", and "hasty" backends. The others will be removed later on via ropensci/drake#561.
Document the optional resources list column in the plan (Also update the list of possible columns in the chapter on plans.)

Targets in the plan.
Functions and objects in the environment.
Objects and functions from packages that are explicitly namespaced with :: and :::.

In other words, there is a clearer line between what drake detects and what it does not. And it no longer dives into packages or parent environments automatically by default. The old approach

Made workflows more brittle (likely to fall out of date).
Was categorically inferior to packrat in terms of package reproducibility.

Unfortunately, the change also puts old workflows out of date. Sorry for the inconvenience.

Anyway, the manual needs to document the new behavior.

ropensci-books / drake Goto Github PK

drake's Introduction

Consider targets

The drake R package user manual

drake's People

Contributors

Stargazers

Watchers

Forkers

drake's Issues

Recommend Projects

Recommend Topics

Recommend Org