Giter Site home page Giter Site logo

ropensci-books / drake Goto Github PK

View Code? Open in Web Editor NEW
55.0 6.0 26.0 71.68 MB

The user manual for the drake R package

Home Page: https://books.ropensci.org/drake/

License: GNU General Public License v3.0

R 27.19% Shell 9.95% TeX 12.37% HTML 13.43% CSS 37.05%
reproducibility high-performance-computing r data-science drake makefile pipeline workflow reproducible-research rstats

drake's Introduction

Consider targets

superseded lifecycle

drake is superseded. The targets R package is the long-term successor of drake, and it is more robust and easier to use. Please visit https://books.ropensci.org/targets/drake.html for full context and advice on transitioning.

The drake R package user manual

This is the development repository of the drake R package user manual, hosted here. Please feel free to discuss on the issue tracker and submit pull requests to add new examples and update old ones. The environment for collaboration should be friendly, inclusive, respectful, and safe for everyone, so all participants must obey this repository's code of conduct.

drake's People

Contributors

alperyilmaz avatar atusy avatar boshek avatar edavidaja avatar erictleung avatar gadenbuie avatar gkampolis avatar idavydov avatar johnbaums avatar kendonb avatar krlmlr avatar lorenzwalthert avatar maelle avatar maurolepore avatar pat-s avatar psychobas avatar raffertyp avatar strazto avatar thebioengineer avatar vkehayas avatar wlandau-lilly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

drake's Issues

Authorship

Many of you have had a hand in writing and/or reviewing the documentation that became the user manual.

I have included you all in the DESCRIPTION file of the user manual. If you would like to be removed, please let me know or submit a pull request.

Guidance on knitr files

  1. Using drake inside an R notebook.
  2. knitr files inside targets.
  3. knitr dependency detection:
    • loadd() and readd() for ordinary objects
    • file_in() and file_out() for ordinary input/output files
    • knitr_in() for nested input files.

I have neglected to document most of (2), and I think people would be interested.

library(drake)
pkgconfig::set_config("drake::strings_in_dots" = "literals")

writeLines(con = "master.Rmd", text = paste(
  '---',
  'title: "master"',
  'output: html_document',
  '---\n',

  '```{r master_chunk}',
  'loadd(loadd_master)',
  'readd(readd_master)',
  'file_in("in_master.txt")',
  'file_out("out_master.txt")',
  'knitr_in("subordinate.Rmd")',
  '```',
  sep = "\n"
))

writeLines(con = "subordinate.Rmd", text = paste(
  '---',
  'title: "subordinate"',
  'output: html_document',
  '---\n',
  
  '```{r subordinate_chunk}',
  'loadd(loadd_subordinate)',
  'readd(readd_subordinate)',
  'file_in("in_subordinate.txt")',
  'file_out("out_subordinate.txt")',
  '```',
  sep = "\n"
))

tmp <- file.create(c(
  "in_master.txt",
  "out_master.txt",
  "in_subordinate.txt",
  "out_subordinate.txt"
))

plan <- drake_plan(
  x = render(knitr_in("master.Rmd")),
  loadd_master = 1,
  readd_master = 2,
  loadd_subordinate = 3,
  readd_subordinate = 4
)
config <- drake_config(plan)
vis_drake_graph(config)

Created on 2018-12-20 by the reprex package (v0.2.1)

OOO

Thread for marking OOO time.

Organize the chapters

Some chapters are example data analysis projects, and some focus on the technical details of drake. I think we should divide chapters into examples and non-examples. Is it possible to divide chapters into sections? In bookdown, it seems like sections are one level lower than chapters. As more examples accumulate (ref: #19, #20), we may need to have one bookdown repo for the manual and another for the examples.

database best practices

For #20, it would be great to align on best practices for working with database connection objects. @aedobbyn and @AlexAxthelm have much more experience with databases than I do. From @aedobbyn in ropensci/drake#552:

Hello! We were recently discussing the unadvisable (?) approach of saving a database connection as a target and using it in a drake_plan. I had thought that this approach didn't work at all, but it seems it in fact does, at least in some circumstances.

In case they're useful, here are a couple examples demonstrating the same outcome when a connection is an object in the global environment (here, con) or a target (this_con).

library(drake)
library(DBI)
library(tidyverse)

# Funs
connect_to_db <- function() {
  dbConnect(RSQLite::SQLite(), "db")
}

seed_db <- function(conn) {
  dbWriteTable(conn, "mtcars", mtcars, overwrite = TRUE)
}

get_res <- function(conn) {
  dbGetQuery(conn, "SELECT * FROM mtcars") %>%
    as_tibble()
}

update_db_records <- function(tbl, conn) {
  new <- tbl %>%
    map_dfr(log)

  dbWriteTable(conn,
    "mtcars",
    new,
    append = FALSE,
    overwrite = TRUE
  )
}

test_is_original <- function() {
  testthat::expect_equal(
    mtcars %>%
      as_tibble(),
    get_res(con)
  )
}

test_is_log <- function() {
  testthat::expect_equal(
    mtcars %>%
      map_dfr(log),
    get_res(con)
  )
}


# Create db and write mtcars table
con <- connect_to_db()
seed_db(con)
test_is_original()

# Connection for updating is a target in the plan
plan_1 <- drake_plan(
  this_con = connect_to_db(),

  seeded = get_res(this_con),

  processed_data =
    update_db_records(seeded, this_con),

  strings_in_dots = "literals"
)

# Check that we updated mtcars
clean()
make(plan_1)
#> target this_con
#> target seeded
#> target processed_data
test_is_log()

# Overwrite table to original mtcars
seed_db(con)
test_is_original()

# Connection for updating is a global object
plan_2 <- drake_plan(
  seeded = get_res(con),
  
  processed_data =
    update_db_records(seeded, con),

  strings_in_dots = "literals"
)

# Check that we updated mtcars
clean()
make(plan_2)
#> target seeded
#> target processed_data
test_is_log()

More explicit integration with the workflow management paradigm by Jenny Bryan et al.

The "What they forgot to teach about R" paradigm is friendly, popular, and impactful, and much of its success comes from its limited scope. From a recent PLOS Computational Biology article by @jennybc et al.:

We have deliberately left many good tools and practices off our list, including some that we use daily, because they only make sense on top of the core practices described above or because it takes a larger investment before they start to pay off.

and

Tools like Make were originally developed to recompile pieces of software that had fallen out of date... However, newcomers can achieve the same behavior by writing shell scripts that rerun everything; these may do unnecessary work, but given the speed of today's machines, that is unimportant for small projects.

Even so, drake adds value to this space.

  • It is fully compatible with the authors' recommendations (example: numbered scripts).
  • It is friendly to new users (example discussion) especially in the space of Make-like tools.
  • Even for small projects, it helps by encouraging tidiness and readability (mentions here and here).

I plan to write a special chapter to go through the best practices in the article and spell out all the parallels point-by-point.

Add detailed guidance on code analysis magic

  • General dependency detection. How does drake know which targets depend on other targets just by looking at the commands? How do we check with deps_code() and deps_targets()?
  • file_in(), file_out(), knitr_in(), and ignore() probably belong in the chapter on workflow plan data frames.
  • We probably need a chapter on how drake deals with R Markdown reports, dependency detection via loadd() and readd() in active code chunks, etc.

Add Cautionary note on generic functions.

Hopefully this is able to be followed. Below, I install a minimal package, run make, then reinstall a different version that changes a generic function, then rerun make. The second make thinks it's up-to-date. Afterwards, I show that make notices the change when using the "subgeneric" (not sure of the correct term) directly.

This is likely an issue with the upstream package but it's worth adding to the Cautionary Notes section of the manual.

devtools::install_github("yihui/rmini")
library(drake)
library(rmini)
detach("package:rmini", unload = TRUE)
my_plan <- drake_plan(show_hello = rmini::hello(x = "Will"),
                      strings_in_dots = "literals")
make(my_plan)
#> target show_hello
#> Hi! I love characters!

devtools::install_github("kendonb/rmini")
library(rmini)
detach("package:rmini", unload = TRUE)

# The function output has changed:
rmini::hello(x = "Will")
#> Hi! I love characters! You didn't notice me change!

# make hasn't noticed the change
make(my_plan)
#> All targets are already up to date.

devtools::install_github("yihui/rmini")
library(drake)
library(rmini)

detach("package:rmini", unload = TRUE)
clean(show_hello)
my_plan <- drake_plan(show_hello = rmini:::hello.character(x = "Will"),
                      strings_in_dots = "literals")
make(my_plan)
#> target show_hello
#> Hi! I love characters!

devtools::install_github("kendonb/rmini")
library(rmini)
detach("package:rmini", unload = TRUE)
rmini:::hello.character(x = "Will")
#> Hi! I love characters! You didn't notice me change!

# Now, when using the "subgeneric", make notices the change 
make(my_plan)
#> target show_hello
#> Hi! I love characters! You didn't notice me change!

Created on 2018-12-24 by the reprex package (v0.2.1.9000)

Guidance on drake with R Markdown / R notebooks

For a long time, I have tried to convince people that they should run knitr reports inside drake targets rather than using an R Markdown report as workflow manager to contain drake projects. This was mostly born out of my frustration at seeing people (including my past self) use knitr for larger computations than it was designed to handle. That and drake::make() for seriously intense pipelines should really be an unobtrusive persistent background process or remote job (hopefully using drake's existing HPC. From talking with @lawremi, Joseph Gerrein, and @rpayne-lilly, my position is softening a bit. I think drake's official documentation should accommodate usage inside reports and notebooks.

New chapter: example file-based data analysis project

This chapter should

  1. Show how drake can handle multiple file_out() files par target, and
  2. Demonstrate dependency relationships where one target's file_out() is another target's file_in().

I would like to anchor on @tiernanmartin's code from ropensci/drake#257 (comment). The real work will be to grow this seed into a complete story, preferably structured like the gsp chapter. I think we should begin with a believable scientific problem we want to solve, explain the methods (I do not expect everyone to know spatial statistics or sf), proceed with the analysis, and see what we can conclude. We should iterate on the analysis to demonstrate what happens if we

  1. Change a function dependency or a command for a target with an intermediate file_out().
  2. Corrupt an intermediate file_out() and watch make() repair it.

Overhaul the HPC chapter

  • Just feature the "clustermq", "future", and "hasty" backends. The others will be removed later on via ropensci/drake#561.
  • Document the optional resources list column in the plan (Also update the list of possible columns in the chapter on plans.)

Be totally clear on the dependencies detected by drake 6.0.0

For the sake of reproducibility and speed, drake version 6.0.0 is more discerning in how it detects dependencies:

  1. Targets in the plan.
  2. Functions and objects in the environment.
  3. Objects and functions from packages that are explicitly namespaced with :: and :::.

In other words, there is a clearer line between what drake detects and what it does not. And it no longer dives into packages or parent environments automatically by default. The old approach

  1. Made workflows more brittle (likely to fall out of date).
  2. Was categorically inferior to packrat in terms of package reproducibility.

Unfortunately, the change also puts old workflows out of date. Sorry for the inconvenience.

Anyway, the manual needs to document the new behavior.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.