factual / drake Goto Github PK

Data workflow tool, like a "Make for data"

License: Other

Clojure 88.37% D 0.37% Makefile 2.82% Shell 8.42% DTrace 0.03%

drake's Introduction

Drake

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

which commands to execute (based on file timestamps)
in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

Drake walk-through

If you like screencasts, check out this Drake walk-through video recorded by Artem Boytsov, Drake's primary designer:

Installation

Drake has been tested under Linux, Mac OS X and Windows 8. We've not tested it on other operating systems.

Drake installs itself on the first run of the drake shell script; there is no separate install script. Follow these instructions to install drake manually:

Make sure you have Java version 6 or later.
Download the drake script from the master branch of this project.
Place the drake script on your $PATH. (~/bin is a good choice if it is on your path.)
Set it to be executable. (chmod 755 ~/bin/drake)
Run it (drake)

Homebrew

If you're on a Mac you can alternatively use Homebrew to install Drake:

brew install drake

Upgrade Drake

Starting with Drake version 1.0.0, once you have Drake installed you can easily upgrade your version of Drake by running drake --upgrade. The latest version of Drake will be downloaded and installed for you.

Download or build the uberjar

You can build Drake from source or run from a prebuilt jar. Detailed instructions

Use Drake as a Clojure library

You can programmatically use Drake from your Clojure project by using Drake's Clojure front end. Your project.clj dependencies should include the latest Drake library, e.g.:

[factual/drake "1.0.3"]

Faster startup time

The JVM startup time can be a nuisance. To reduce startup time, we recommend using the way cool Drip. Please see the Drake with Drip wiki page.

Basic Usage

The wiki is the home for Drake's documentation, but here are simple notes on usage:

To build a specific target (and any out-of-date dependencies, if necessary):

$ drake mytarget

To build a target and everything that depends on it (a.k.a. "down-tree" mode):

$ drake ^mytarget

To build a specific target only, without any dependencies, up or down the tree:

$ drake =mytarget

To force build a target:

$ drake +mytarget

To force build a target and all its downtree dependencies:

$ drake +^mytarget

To force build the entire workflow:

$ drake +...

To exclude targets:

$ drake ... -sometarget -anothertarget

By default, Drake will look for ./Drakefile. The simplest way to run your workflow is to name your workflow file Drakefile, and make sure you're in the same directory. Then, simply:

$ drake

To specify the workflow file explicitly, use -w or --workflow. E.g.:

$ drake -w /myworkflow/my-workflow.drake

Use drake --help for the full list of options.

Documentation, etc.

The wiki is the home for Drake's documentation.

A lot of work went into designing and specifying Drake. To prove it, here's the 60 page specification and user manual. It's stored in Google Docs, and we encourage everyone to use its superb commenting feature to provide feedback. Just select the text you want to comment on, and click Insert -> Comment (Ctrl + Alt + M on Windows, Cmd + Option + M on Mac). It can also be downloaded as a PDF.

There are annotated workflow examples in the demos directory.

There's a Google Group for Drake where you can ask questions. And if you found a bug or want to submit a feature request, go to Drake's GitHub issues page.

Visualize your workflow

See more detail

Asynchronous Execution of Steps

Please see the wiki page on async.

Plugins

Drake has a plugin mechanism, allowing developers to publish and use custom plugins that extend Drake. See the Plugin wiki page for details.

HDFS Compatibility

Drake provides HDFS support by allowing you to specify inputs and outputs like hdfs:/my/big_file.txt.

If you plan to use Drake with HDFS, please see the wiki page on HDFS Compatibility.

Amazon S3 Compatibility

Thanks to Chris Howe, Drake now has basic compatibility with Amazon S3 by allowing you to specify inputs and outputs like s3://bucket/path/to/object.

If you plan to use Drake with S3, please see the wiki doc on S3 Compatibility.

Drake on the REPL

You can use Drake from your Clojure REPL, via drake.core/run-workflow. Please see the Drake on the REPL wiki page for more details.

Stuff outside this repo

Thanks to Lars Yencken, we now have Vim syntax support for Drake:

Also thanks to Lars Yencken, utilities for making life easier in Python with Drake workflows.

Courtesy of @daguar, an alternative approach to installing Drake on Mac OS X.

Original blog post announcing Drake's open source release

An epic knock-down-drag-out set of threads on Hacker News discussing the design merits of Drake

License

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.

drake's People

Contributors

Stargazers

Watchers

Forkers

souravzzz c0mpsc1 christopherdebeer chrismetcalf jinshen-cn githubbridge branky lemonhall reckbo thethirdwheel aboytsov guoyunsky brianm poemcao dmnpignaud howech civitaslearning dchapsky iamedwardshen nadai biocowboy guillaume bluelava stanistan marshallshen gtuckerkellogg derenrich ronswanson64 arowla bluegnu ash211 nvdnkpr morrifeldman hellcoderz suzker wwwtravel trentonstrong nivertech chauncey-garrett spencerx calfzhou amalloy yilab jbn manboubird randomeffect vectart-com andrew-christianson lichia kcandrews imclab sjackman chen-factual williamlao nitorcreations asmunduhreinn yoshw silky bobohuang bahulneel alfredp fxcebx snazz2001 xxpanda omaranwarny seifer08ms dalloliogm fadibashir arinbasu yixf-self jpetimar kranjeto seregasheypak pariyat glowdb nchase vishalinvincible brandonliang kevinwkc artemzi shayrynen yencarnacion robtheoceanographer connectthefuture msmoving corlin zcdr kiraxie17 gcdr kseniakruglova tresata-opensource sesas nickengland parzivalwins teresy jguhlin hwestbro kai1819 mmmika dirten

drake's Issues

Cygwin/Windows support

Perhaps I'm the first one to try running on Windows? It looks like filename processing isn't going well when I run under Cygwin:

% cat `which drake`
#!/usr/bin/bash
java -cp `cygpath -w ~/Downloads/drake.jar` drake.core $@

% cygpath -w $PWD
C:\Users\me\mydirectory

% drake --version
Drake Version 0.1.0

% cat workflow.d
startdat.csv <- [R]
  x <- runif(10)
  write.csv(data.frame(x=x))

% drake
The following steps will be run, in order:
  1: startdat.csv <-  [missing output]
Confirm? [y/n] y
Running 1 steps...
Invalid filename: file:C:\Users\me\mydirectory\startdat.csv

According to http://blogs.msdn.com/b/ie/archive/2006/12/06/file-uris-in-windows.aspx, the proper URI for that path would be file:///C:/Users/me/mydirectory/startdat.csv.

--version should halt

One user reported this behavior:

drake --version
Drake Version 0.1.0
Target not found: ...

This is weird. --version should not try to run any tagets.

Support output and input directories, not just part-????? files

In the form described in the spec, or some other form. Maybe it should just be the default behavior if a directory is specified instead of a file. See also the outstanding comment in the Filenames section.

I'm not sure I have the bandwidth to take it on right now. Any takers? I will gladly review the code.

This feature seems to be required for using Drake with Hive.

support for wildcard inputs and outputs

Feature request.

address slow startup time

Nailgun won't work for multiple runs, unless we use --auto to avoid cli interaction. Related to how we're dealing with stdin.

java.lang.NullPointerException
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:296)
at d.core$user_confirms_QMARK_.invoke(core.clj:54)

That's where we call read-line-stdin to get user confirmation. The first run under nailgun works fine. Any next run will immediately get a nil returned from read-line-stdin. As a test, i tried calling read-line-stdin multiple times at that stage, and all calls get a nil back immediately, w/o the user ever having the chance to enter anything.

Dot graph generated from workflow files

If a dot graph could be generated from a workflow file, then, using graphwiz/gephi, we could get an intuitive image of the workflow

Check for filesystem desynchronization

When running on more than one file system, check that they are sychronized (within some margin). That margin should probably be --step-delay. See #15.

Create a JAR for download, update the docs

Handle conflicting command-line options

Some options conflict with each other and cannot be used together. Should be easy to handle by adding something like:

(def confiicting-options [  
  ;; scalar - neither of those can be used together
  #{:preview :print}                                      
  ;; tuple - neither one on the left can be used with one on the riht
  [ #{:help :version} #{:branch :vars :auto} ]          
])

It should be possible to specify method-mode in method definition

This works:

filter-with-grep() [eval]
  grep -v "$CODE" $INPUT > $OUTPUT

output <- input [method:filter-with-grep method-mode:append]
  regexp-matching-bad-entries-to-be-removed

But it should be possible to do this:

filter-with-grep() [eval method-mode:append]
  grep -v "$CODE" $INPUT > $OUTPUT

output <- input [method:filter-with-grep]
  regexp-matching-bad-entries-to-be-removed

In other words, all the step's checks should happen after options and variables are merged with the method's definition, not before.

Switch to 1ms timestamp resolution?

It seems that at least on OS X, we're using 1s timestamp resolution. We're requesting in milliseconds, but getting back only thousands in return (see all numbers end with 000):

Timestamp checking, inputs: [{:path "/tmp/drake-test/hdfs_1", :mod-time 1359528216000, :directory false} {:path "/tmp/drake-test/hdfs_2", :mod-time 1359528237000, :directory false}], outputs: [{:path "/tmp/drake-test/merged_hdfs", :mod-time 1359528242000, :directory false}]
Newest input: 1359528237000, oldest output: 1359528242000
Running 2 steps...
Timestamp checking, inputs: [{:path "/Users/artem/drake/resources/regtest/local_1", :mod-time 1359528216000, :directory false} {:path "/Users/artem/drake/resources/regtest/local_2", :mod-time 1359528245000, :directory false}], outputs: [{:path "/Users/artem/drake/resources/regtest/merged_local", :mod-time 1359528224000, :directory false}]
Newest input: 1359528245000, oldest output: 1359528224000

I'm pretty sure HFS plus is capable of much higher resolution, so I'm not sure what's going on.

I've added --step-delay flag in feature/vvv branch (ee833c5), to make regression tests pass.

output file is left standing on an error'd step

This is a problem if Drake is rerun after a step hard crashes in the middle of writing output. Drake will think the error'd step actually completed (since there's a recent output file).

Best solution may be that, for all in-process output files, we write to a temp output file, then mv it to final output file when full success.

Update the spec with --print and --preview

Not documented now.

Cannot get drake to build due to missing jlk/time dependency

I am trying to build drake behind a firewall that has a Sonatype nexus proxying Clojars, Central, et al. For some reason, I can build the uberjars and run drake fine, but it barfs when I try to start a repl:

Could not find metadata jlk:time:0.1-SNAPSHOT/maven-metadata.xml in clojars-snapshots (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in Internal central (http://hsdgrnbrg.XXXX/nexus/content/repositories/central)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in Internal clojars (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in clojars-snapshots (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in internal-nexus (http://hsdgrnbrg.XXXX/nexus/content/repositories/releases)
Could not find artifact jlk:time:pom:0.1-SNAPSHOT in foursquareapijava (http://foursquare-api-java.googlecode.com/svn/repository)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in Internal central (http://hsdgrnbrg.XXXX/nexus/content/repositories/central)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in Internal clojars (http://hsdgrnbrg.XXXX/nexus/content/repositories/clojars)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in clojars-snapshots (http://hsdgrnbrg.XXXXnexus/content/repositories/clojars-snapshots)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in internal-nexus (http://hsdgrnbrg.XXXX/nexus/content/repositories/releases)
Could not find artifact jlk:time:jar:0.1-SNAPSHOT in foursquareapijava (http://foursquare-api-java.googlecode.com/svn/repository)
Check :dependencies and :repositories for typos.
It's possible the specified jar is not in any repository.
If so, see "Free-floating Jars" under http://j.mp/repeatability
Exception in thread "Thread-1" clojure.lang.ExceptionInfo: Could not resolve dependencies {:exit-code 1}
    at clojure.core$ex_info.invoke(core.clj:4227)
    at leiningen.core.classpath$get_dependencies.doInvoke(classpath.clj:128)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.RestFn.applyTo(RestFn.java:132)
    at clojure.core$apply.invoke(core.clj:605)
    at leiningen.core.classpath$resolve_dependencies.doInvoke(classpath.clj:144)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at leiningen.core.eval$prep.invoke(eval.clj:60)
    at leiningen.core.eval$eval_in_project.invoke(eval.clj:220)
    at leiningen.repl$start_server.doInvoke(repl.clj:65)
    at clojure.lang.RestFn.invoke(RestFn.java:470)
    at leiningen.repl$repl$fn__1788.invoke(repl.clj:145)
    at clojure.lang.AFn.applyToHelper(AFn.java:159)
    at clojure.lang.AFn.applyTo(AFn.java:151)
    at clojure.core$apply.invoke(core.clj:601)
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1771)
    at clojure.lang.RestFn.invoke(RestFn.java:425)
    at clojure.lang.AFn.applyToHelper(AFn.java:163)
    at clojure.lang.RestFn.applyTo(RestFn.java:132)
    at clojure.core$apply.invoke(core.clj:605)
    at clojure.core$bound_fn_STAR_$fn__3984.doInvoke(core.clj:1793)
    at clojure.lang.RestFn.invoke(RestFn.java:397)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:722)

Consider using () for method invocation

Need your feedback, guys. I'm thinking we should have an alternative way for invoking methods just by using () similar to how we match methods in target selection from the command-line, e.g.:

my-method() [eval]
  ...

; Not just this:

output <- input [method:my-method some-option:55]

; But also this:

output <- input [my-method() some-option:55]

Why not?

Individual $BASE for every filesystem

We should probably extend the notion of $BASE for every filesystem. It's convenient to have a separate working directory on the local filesystem as well as on HDFS, for example.
Something like that maybe:

Global BASE can be specified along with filesystem-specific BASEs
If the filename starts with /, BASE is not used, regardless of whether the filesystem prefix is given. If no filesystem prefix is given, the default filesystem is used (now local, but we can also add a command-line flag to specify which).
Otherwise:
1. If the filename has a filesystem prefix, global BASE is ignored and only filesystem-specific BASE is looked for. If not given, it's an error for filesystems which don't have the notion of current directory (e.g. HDFS). For local filesystem, the file is relative to the directory of the master workflow file.
2. If the filesystem prefix is not given, either global BASE or the default filesystem's BASE is used. If both are specified, it's an error.

Example:

hdfs:BASE=/tmp
file:BASE=/tmp

hdfs:a <- file:b       ; hdfs:/tmp/a <- /tmp/b
hdfs:a <- b            ; hdfs:/tmp/a <- /tmp/b
a <- b                 ; /tmp/b <- /tmp/a

file:BASE=
BASE=s3:/tmp
a <- b                 ; s3:/tmp/a <- s3:/tmp/b
a <- /b                ; s3:/tmp/a <- /b

hdfs:/a <- s3:/a       ; hdfs:/a <- s3:/a
hdfs:/a <- /a          ; hdfs:/a <- /a

file:BASE=/tmp
a <- b                 ; Error, ambiguous: s3:/tmp/a or file:/tmp/a?
/a <- /b               ; /a <- /b

Include should work on lexical level

See description at the end of the designdoc. Same applies to $(...).

StackOverflowError

When I run w/ my large workflow.d file I get a stack overflow exception. If need be, I can send/post the workflow.d file.

Exception in thread "main" java.lang.reflect.InvocationTargetException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.flatland.drip.Main.invoke(Main.java:117)
  at org.flatland.drip.Main.start(Main.java:88)
  at org.flatland.drip.Main.main(Main.java:64)
Caused by: java.lang.StackOverflowError
  at clojure.core$concat$fn__3804.invoke(core.clj:662)
  at clojure.lang.LazySeq.sval(LazySeq.java:42)
  at clojure.lang.LazySeq.seq(LazySeq.java:60)
  at clojure.lang.RT.seq(RT.java:473)
  at clojure.core$seq.invoke(core.clj:133)
  at clojure.core$concat$fn__3804.invoke(core.clj:662)

...

Checksum-based dependency evaluation

There was a request for this.

Related to #11 (support general evaluator hooks).

It could be baked into Drake directly. This evaluator would ignore the timestamps of the input and output files, and only re-run the step if MD5 of the step's inputs have changed since the last run. The MD5's would probably have to be created alongside the files (input.md5-drake or something like that), and will need to be moved/renamed with the files when branching, backups, etc. Forced rebuild should probably update the MD5s?

factor c4 out of Drake

better suited as an external lib?

Make sure filenames do not contain '#'

...as well as branch names. Handle gracefully.

inline R code

This is more of a feature request than a bug,
any change for inline R code?

Unable to compile jar from cloned repo

I encounter the following error message of a missing dependency when I try to run lein uberjar, as instructed in the README:

: Missing:
----------
1) com.google.oauth-client:google-oauth-client:jar:${project.oauth.version}

  Try downloading the file manually from the project website.

  Then, install it using the command:
      mvn install:install-file -DgroupId=com.google.oauth-client -DartifactId=google-oauth-client -Dversion=${project.oauth.version} -Dpackaging=jar -Dfile=/path/to/file
...
  Path to dependency:
        1) org.apache.maven:super-pom:jar:2.0
        2) com.google.api-client:google-api-client:jar:1.8.0-beta
        3) com.google.oauth-client:google-oauth-client:jar:${project.oauth.version}

I've tried to manually download the jar from here, but to no avail. Any ideas how I might solve this issue?

Thanks

Defining BASE with := is ignored.

;This works
BASE=hdfs://user/alexr/resolve-ml

;This appears to be completely ignored.
BASE:=hdfs://user/alexr/resolve-ml

The user manual recommends doing the second method, so it can be overridden from the command line.

Automatic filename generation and making Drake even more cool

Would like to hear everyone's thought on this one.

Design, spec out and implement automated filename generation for cases where filenames are not important. We can use _ symbol to specify it. The filenames would still be persistent - they should be a function of information in the step, for example (probably in that order), the method used, other (named) outputs, tags used, or step's numeric position (worse). Even though these scheme can never guarantee changing the workflow wouldn't change the filenames, we should try to minimize these cases. Example:

_ <- input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_ <- _
  sort $INPUT > $OUTPUT

output <- _
  uniq $INPUT > $OUTPUT

Or in combination with methods:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

uniq()
  uniq $INPUT > $OUTPUT

_ <- input            filter()
_ <- _ [retries:5]    sort()
_ <- _ [my-option:66] uniq()
output <- _           filter()          ; can be used several times, why not?

It is mostly useful for very simple relationship (single input, single output), but can be used in a more complicated context as well:

output1, _ <- input        ; two outputs, don't much care about naming of the second one
   ....

_ <- _
   ....

result <- output1, _       ; referring to the output1 directly
   ....

We could even add a special symbol (+) as a shortcut for (_ <- _):

+
  grep -v BAD_ENTRY $INPUT > $OUTPUT

+ 
  sort $INPUT > $OUTPUT

+ 
  unique $INPUT > $OUTPUT

And if we relax requirement for each step to begin with a new line (which is only important when the body is defined), in combination with methods we could arrive at the following equivalent:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

unique()
  unique $INPUT > $OUTPUT

+ filter + sort + unique

And we can also introduce some rules that the very first input _ is replaced with $in environment variable, and the very last output _ with (optional) $out environment variable, then the script above could be invoked as:

drake -v in=my_input,out=my_output

and we can use Drake to create quick ad-hoc data processing pipelines without caring about naming intermediate data files.

For truly temporary files that should be deleted, we can use _?. The benefits of this is less obvious, because if the file is truly temporary, Drake will always run steps linked through such files together (there would never be a state where only one of them is up-to-date). It could still be convenient if you want a temporary file anyway, just want something else (Drake) to take care of its creation and deletion.

+1 if you like. Your feedback is appreciated.

revisit no-output targets and -check

Artem: "No-output targets are not run by default unless [-check] is used. This is logically consistent, but a bit inconvenient. We might need to rethink it, if a lot of people are confused with that. The default running behavior of no-input and no-output steps are described in the spec. Would love to hear your thoughts on it."

Consider a self-install script

...similar to lein, for example.

http://news.ycombinator.com/item?id=5118551

Support passing in "--" args in run-intepreter

We have some interpreters where we'd like to pass several options to them before the script to configure them for running a script off disk, and several options after a -- to allow the user to pass extra options to their script.

I think that adding an :args key to the step map in run-interpreter would be a fine way to communicate this, and then change the (apply shell... command to look like:

    (apply shell (concat [interpreter]
                         args
                         [script-filename]
                         (:args step) ;CHANGE HERE
                          [:env vars
                          :die true
                          :out [System/out (writer (log-file step "stdout"))]
                          :err [System/err (writer (log-file step "stderr"))]]))

Then, it would be up to the particular handler for the language to decide how to support args, whether it needs to include a -- form, and any other of that kind of decision.

BUG: Long step defintion -> filename too long (linux fc 13)

As an example, this step fails for me:

dwi_found, t1_found, t2_found, flair_found, swi_found  <- find_dicom_folders

with

BASE=/tmp/case_id_020_20130104_1_23_2013_14_20_6.zip.dir

java.io complains "(File name too long)" when trying to write to .drake/.

(2.6.34.9-69.fc13.x86_64 #1 SMP Tue May 3 09:23:03 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux)

resources/regtest/test_jar/ is missing from Drake repository

Probably disappeared when we migrated the repo from D to Drake. It is required by regtest_protocol_eval.

Aaron, could you also please re-run the whole regtest suite with run-all.sh? I don't have HDFS access.

stdout and stderr of task is being interleaved

This is what it should look like (when I run it standalone):

Hist('p0iso_tot') ph_analysis_isolation[0] / 1e3
Traceback (most recent call last):
  File "./histograms.py", line 69, in <module>
    main()
  File "./histograms.py", line 66, in main
    make_iso_plots(t, 0, sel)
  File "./histograms.py", line 35, in make_iso_plots
    print hist, "mgg", sel(*s)
NameError: global name 'hist' is not defined

This is what I get from drake (it's reproducible):

--- 0. Running (missing output): histograms.root <- data.root, histograms.py
�T[r?a1c0e3b4ahcHki s(tm(o'spt0 irseoc_etnott ')c aplhl_ alnaasltys):i
s _ isFoillaet i"o.n/[h0i]s t/o g1rea3m
s.py", line 69, in <module>
    main()
  File "./histograms.py", line 66, in main
    make_iso_plots(t, 0, sel)
  File "./histograms.py", line 35, in make_iso_plots
    print hist, "mgg", sel(*s)
NameError: global name 'hist' is not defined
drake: shell command failed with exit code 1

Somehow the output is getting interleaved. I also see unprintable characters on the terminal.

File-level evaluators

Related to #30, #11.

We might consider specifying evaluators on individual file (group) level rather than the step level. Use cases:

Some file should be included as step input, but should not be evaluated
Same about output - some side-effect output is created (report?) but should be ignored as far as the evaluation goes

These can be solved by excluding the files from the list of inputs/outputs and hardcoding their names into the step's body, but it complicates workflow management and goes against Drake's philosophy. Also:

Combination of timestamp and MD5 evaluators: should rebuild if the output is older than the input OR the input's checksum has changed

Proposal:

Specify evaluators for any combination of named inputs/outputs (#39).
Specify evaluator groups by using prefixes - all filenames starting with this prefix would share the same evaluator group and the same evaluator.
Files can be part of multiple evaluators.
The default evaluator is applied to the remaining (not named) files.
The end result is an OR of all evaluators used.

Example:

a, b <- c [eval:timestamp] 
  ; Standard ("timestamp") evaluator is called on 2 outputs and 1 input

a, b <- c [eval:md5]
  ; MD5 evaluator is called on 2 outputs (which it ignores) and 1 input which it 
  ; verifies for MD5 change

a, b <- c, d(x) [eval:ignore(x)]
  echo $x        # "d"
  ; Built-in "ignore" evaluator, which always returns false, is called with 0 outputs 
  ;   and 1 input
  ; Standard ("timestamp") evaluator is called on remaining 2 outputs and 1 input

a, b <- c, d(x1), e(x2) [eval:ignore,md5(x)]
  echo $x1    # "d"
  echo $x2    # "e"
  ; MD5 evaluator is called on 0 outputs and 2 inputs
  ; Remaining 2 outputs and 1 input are processed through "ignore" evaluator 
  ;   and ignored

a(t) <- b(t), c(t,x) [eval:timestamp(t),md5(x)]
  echo $x      # "c"
  echo $t      # "a b c"
  ; MD5 evaluator is called on 0 outputs and 1 input
  ; Timestamp-based evaluator is called on 1 output and 2 input
  ; The step will run either if c's checksum has changed, or if b, c or both are 
  ;   fresher than a

We can also add syntactic sugar to specify evaluators directly in filenames without assigning them variables:

a <- b, c(eval:m5)
  ; MD5 will be run on c, a and b will be compared by timestamps

I'm sure there's more to it and I've just scratched the surface. For example, options that alter behavior of evaluators (check, timecheck, and #38) should be applied to groups instead. One idea is to get rid of these options altogether, but specify different evaluator flavors instead, which could be more consistent.

No blank lines allowed in code blocks

I've been working with Drake on a few projects, and ran into this issue. This works:

%hello <-
    x=1
    echo $x

but not this

%hello <-
    x=1

    echo $x

Adding that newline gives a largish syntax error when we run drake -a %hello

java.lang.IllegalStateException: drake parse error at line 4, column 1: Illegal syntax starting with "EOF" for workflow
    at drake.parser_utils$throw_parse_error.invoke(parser_utils.clj:47)
    at drake.parser_utils$illegal_syntax_error_fn$fn__3010.invoke(parser_utils.clj:66)
    at drake.parser$parse_state$fn__787.invoke(parser.clj:594)
    at name.choi.joshua.fnparse$rule_match.invoke(fnparse.clj:433)
    at drake.parser$parse_state.invoke(parser.clj:590)
    at drake.parser$parse_str.invoke(parser.clj:600)
    at drake.parser$parse_file.invoke(parser.clj:605)
    at drake.core$with_workflow_file.invoke(core.clj:456)
    at drake.core$_main.doInvoke(core.clj:659)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at drake.core.main(Unknown Source)
java.lang.IllegalStateException: drake parse error at line 4, column 1: Illegal syntax starting with "EOF" for workflow
    at drake.parser_utils$throw_parse_error.invoke(parser_utils.clj:47)
    at drake.parser_utils$illegal_syntax_error_fn$fn__3010.invoke(parser_utils.clj:66)
    at drake.parser$parse_state$fn__787.invoke(parser.clj:594)
    at name.choi.joshua.fnparse$rule_match.invoke(fnparse.clj:433)
    at drake.parser$parse_state.invoke(parser.clj:590)
    at drake.parser$parse_str.invoke(parser.clj:600)
    at drake.parser$parse_file.invoke(parser.clj:605)
    at drake.core$with_workflow_file.invoke(core.clj:456)
    at drake.core$_main.doInvoke(core.clj:659)
    at clojure.lang.RestFn.applyTo(RestFn.java:137)
    at drake.core.main(Unknown Source)

If it's not too hard to support, it'd be great to allow newlines in the code blocks, since they help so much with readability whenever you have a block that's more than a few lines.

support off-base files, BASE override

And also make sure trailing '/' in BASE doesn't matter for steps or for target matching.

Support wildcard inputs & outputs (globbing)

An example is if you had a directory structure like logs/year/month/part-files and wanted to only process jan - mar from every year. the pattern would be logs/*/0[1-3]

Parse error when filenames contain equals character

Here's a minimal example:

spain.today <- spain.dt=2013-01-29.in
    cat $INPUT >$OUTPUT

You might think having an equals sign in a filename is rare, but it's how Hive names each folder for a partition.

I couldn't find a way to escape the equals character. My natural guesses, using a backslash or enclosing the whole filename in quotes, didn't work.

Support parameter passing to methods

At the moment, the only way to get variables/parameters into a method is something like:

my_method() [shell]
  echo $FOO

FOO=bar
my_output <- my_input [method:my_method]

This is really non-obvious and it would be much cleaner to do this in a more standard way such as:

my_method(arg0) [shell]
  echo $$arg0

my_output <- my_input [method:my_method("foo")]

...in this case using $$ to denote a method parameter and not a variable. I think I saw some mention of this in the Google Doc spec.

Support make "suffix rules" (aka template rules)

Assume a bunch of files in a directory whose names all follow the same pattern (for example "[0-9]+.html"). Each file name is basically a string of characters, followed by some dot delimited suffix.

I want to run the same set of steps for all the . delimited files (ie like a metaworkflow). In Make there is the concept of a suffix rule, where you can use a suffix to define a general set of actions to run. For example:
.cc.o:
$(CXX) $(CXXFLAGS) -c $<

tells make how to build .cc.o files (where "$<" is a special macro which stands for ".cc").

Something similar for Drake would be useful.

Can't pull dependencies

I'm new to Drake and Clojure and Leiningen, so I'm not sure how to troubleshoot this. Here's what happens when I try to build Drake:

% lein deps
Could not transfer artifact clj-logging-config:clj-logging-config:pom:1.9.6 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact fs:fs:pom:1.3.2 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:jlk-time:pom:0.1 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact digest:digest:pom:1.4.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact slingshot:slingshot:pom:0.10.2 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:fnparse:pom:2.3.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:sosueme:pom:0.0.15 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact factual:c4:pom:0.0.8 from/to clojars (https://clojars.org/repo/): peer not authenticated
Could not transfer artifact hdfs-clj:hdfs-clj:pom:0.1.0 from/to clojars (https://clojars.org/repo/): peer not authenticated
This could be due to a typo in :dependencies or network issues.
Could not resolve dependencies

Does this just mean that https://clojars.org is down? If so, is there anything I can do about it, e.g. grab stuff from another server?

Add a step option to require force-rebuild

We should probably have an option that would require a force-rebuild, i.e.:

output <- input [+force]
   ...

which could be equivalent to specifying "force" evaluator (see #11):

output <- input [evaluator:force]
  ...

Also, it seems like timecheck and check option names could be a bit confusing, since one might assume they accomplish exactly that.

Named input and output files

The way it stands now, multiple inputs are put into INPUT1, INPUT2, ... etc. which is convenient for simple steps, but can get complicated with more than a couple inputs, and also makes editing and re-using steps' code harder. It would be nice if users were able to give steps' files names. Named files will be excluded from automatic (INPUTX) variables and be put in separate environment variables. Several steps can share the same name - then they're concatenated via space and put into one variable. Example:

a(y), b(x), c, d <- e(y), f(z), g(z), h
  echo $INPUTN       ;; "1"
  echo $INPUT1       ;; "h"
  echo $OUTPUTN      ;; "2"
  echo $OUTPUT1      ;; "c"
  echo $OUTPUT2      ;; "d"
  echo $x            ;; "b"
  echo $y            ;; "a e"
  echo $z            ;; "f g"

"#" in branch hdfs path gets encoded if path already exists, causing hadoop rm command to fail

When using the --branch option and hdfs paths, it looks like the "#" symbol is encoded under the shell protocol if the path already exists, causing the hadoop rm command to fail.

Given the following workflow.d:

BASE=hdfs:/user/$[USER]/tmp

input <- !in.csv
  hadoop fs -rm -r $OUTPUT
  hadoop fs -copyFromLocal $INPUT $OUTPUT

run:
$ drake --auto --branch test +...

Running 1 steps...

--- 0. Running (forced): hdfs:/user/raronson/tmp/input#test <- in.csv
rm: `hdfs:/user/raronson/tmp/input#test': No such file or directory
--- done in 2.89s

Done (1 steps run).

$ drake --auto --branch test +...

Running 1 steps...

--- 0. Running (forced): hdfs:/user/raronson/tmp/input#test <- in.csv
rm: File does not exist: hdfs://namenode/user/raronson/tmp/input%23test
copyFromLocal: `hdfs:///user/raronson/tmp/input%23test': File exists
drake: shell command failed with exit code 1

verify BASE esp with c4

A user reported that BASE=./ was breaking things. This might be c4 specific. Also revisit docs and ensure clarity.

Confirm that we need all the dependencies we pull

On the first look, it seems like we're pulling quite a lot. Do we really need all of those or some are just dead weight?

include build information when uberjar'ing

Should include things like timestamp, branch, user, etc.

This is useful for tracking versions, e.g. print it out via --version

HDFS file existence check failing

I'm using Hadoop 1.0.3 on an AWS Elastic-MapReduce cluster. I compiled drake with [org.apache.hadoop/hadoop-core "1.0.3"] set in project.clj.

When I set about trying to write this minimal recipe to copy a file to HDFS, it doesn't recognise that the file exists after the first successful run.

hdfs:///user/hadoop/myfile.txt <- myfile.txt
    if hadoop fs -test -e $OUTPUT; then
        hadoop fs -rm $OUTPUT
    fi
    hadoop fs -put $INPUT $OUTPUT

So, no matter how many times this is run, each run gives:

The following steps will be run, in order:
  1: hdfs:///user/hadoop/myfile.txt <- myfile.txt [missing output]
Confirm? [y/n]

Even though the output is there. This looks related to #15

Timestamps appear to be reliable and in sync. That is, HDFS is reporting the timestamp of the output as being fresher than the input (at least, on the command-line).

hadoop@hadoop-master:~$ hadoop fs -ls myfile.txt
Found 1 items
-rw-r--r--   2 hadoop supergroup          6 2013-01-30 01:16 /user/hadoop/myfile.txt
hadoop@hadoop-master:~$ ls -l myfile.txt
-rw-r--r-- 1 hadoop hadoop 6 Jan 30 01:05 myfile.txt
hadoop@hadoop-master:~$ drake -a
Running 1 steps...

--- 0. Running (missing output): hdfs:///user/hadoop/myfile.txt <- myfile.txt
Deleted hdfs://10.117.143.22:9000/user/hadoop/myfile.txt
Step Duration Secs: 11

Done (1 steps run).

Any suggestions for a workaround?

All the steps involve at least one hdfs location. A very similar workflow that was all local didn't exhibit this same behavior.