stephens999 / dscr Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 10.0 821 KB

Dynamic statistical comparisons in R

R 100.00%

dscr's People

Contributors

Stargazers

Watchers

Forkers

nkweiwang surbut mengyin zrxing ramanshah benjaminpeter jhsiao999 fboehm pcarbo

dscr's Issues

Revisit structure of output parsers

The way that output parsers are implemented via a Sys.glob call that changes outputs to other outputs is quite complicated. The logic of the execution engine is very convoluted, and it may continue to block the implementation of parallelized run_output_parsers or run_scores as well as a correct/safe reset_output_parsers.

It might be worth revisiting the goals and desired functionality of output parsers and to re-implement them in a more careful/rigorous way that gives the overall dsc workflow an execution path that is easier to introspect.

browseVignette names are not very informative

when I browseVignettes("dscr") I get
A Vignette to illustrate dscr - HTML source R code
A Vignette to illustrate dscr - HTML source R code

it would be better for the vignettes to have more informative titles!
(and for the elementary vignette to be listed first)

using system call to run external programs eg matlab

when using system() R proceeds to next command once external program has opened (ie completion of system() command equals to opening the program, not when the program finishes running). This affects windows systems, but not linux systems.

todo for release v0.1

turn parsers to outputparsers
make default_score and default_data directories
write second vignette
switch to using RDS files
give scores a datatype=default_data as well as an outputtype

problem with single score?

i have a dsc where I'm outputting the score as a list with one element, score returns
list(diff1=x) say.
The resulting scorename appears to be NA, and this causes a crash.

When I modify the score to return
list(diff1=x, diff2=0)
everything works ok.

Integration with batchJobs

We've had various discussions about how to provide better support for long-running jobs.
To me it seems that by making use of batchJobs, and particularly its waitForJobs function,
we should be able to get something that works with relatively little code.

Currently we have, in run_dsc, the code:

runScenarios(dsc,scenariosubset,seedsubset)
runMethods(dsc,scenariosubset,methodsubset,seedsubset)
runOutputParsers(dsc)
runScores(dsc,scenariosubset,methodsubset)

The simplest approach that I can see would involve submitting jobs to do each
of these functions, and using waitForJobs to wait between each job set.

runScenarios (by submitting to batchJobs)
waitForJobs()
runMethods (again through batchJobs, a second registry of jobs this)
waitForJobs()
runOutputparsers (again through batchJobs, a third registry)
waitForJobs()
runScores (batchJobs, a fourth registry)
waitForJobs()

@ramanshah is there a reason you can see that this would not work?

we're being schizophrenic about naming of scores

currenty, in runScore we have
results=c(score=score$fn(data,output),as.list(timedata))

this means that if the output of score$fn is a named list, eg list(MSE=x)
then the score becomes names score.MSE instead of MSE.

This is nice if score$fn returns a vector c(1,2,3) as the scores are then
names score.1, score.2, etc
but it seems we want this renaming to occur only in this case.

This behaviour (which is a relatively recent change) has broken the vignette.

create gold standard methods

sometimes it is helpful to compare a method with a "gold standard" method that could not
actually be run in practice. To do this is would be good to add a new type of method, that instead of taking input, takes data=(meta,input). The idea is that the gold method has access not only to the input, but also to meta. This allows the method to "cheat". The only difference in implementation between a gold_method and a method would be that it is passed data instead of input.

I guess I envisage adding a flag for each method saying whether or not it is a "gold" method.
Then we can add a function
add_gold_method
or overload add_method with a new parameter gold=TRUE/FALSE.
so that the user can add a gold method.
Finally when we run a method, we would have to check whether it is flagged as gold. If so then
it gets passed data=(meta,input); otherwise it gets passed just input.

thoughts @ramanshah ?

store time and memory of each application

would be good to have dscr store the time and memory it took for each analysis

vignette does not provide good overview of work flow

problem is that vignette is interactive, but preferred workflow is very much not. Need to emphasise this in the vignette, and probably provide a pointer to a github repo (dsc-osl?) where a dsc repo with recommended structure and workflow is provided.

keep dscr going with some methods fail in some senarios.

I have a problem in using dscr package, I have one method called "PCS" report error during runing with the senario called "real_data". But I still have other methods which work well on this senario. But in dscr 0.0 version I used, when one methods report error, the whole procedure is force to stopped. It would better to have dscr keep running with some methods failing during the process and reporting error in the end.

input parser

it is intended that we also allow "input parser" as well as output parser. This is a to-do item... just documenting it here.
Something as an aside/motivation, it might help with @mengyin 's use where she is actually
comparing combinations of methods with different pre-processing steps. She is
currently doing all the different pre-processing steps in the datamaker, which is rather messy.
If we had input parser functionality then this could deal better with this - each
pre-processing step could be implemented as an input parser.

This raises the question of how to deal more generally with multi-step pipelines... the idea of a parser whcih takes one type of file and changes it to another is basically rather general...

possible new option: save all output from applying a method, not just those in the output requirement

currently just save output in "output" format. Sometimes it is useful to store all the results of a method as well as the "output" part for the dsc. Eg to make extending the dsc easier in the future.

implement dsc, AddMethod and AddScenario

would be cleaner to have dsc=(Methods,Scenarios) and to have an addMethod function and addScenario function, [along the lines of BatchExperiments]

Add code to track progress of dscr

woudl be nice if it said a bit about what it was doing. Maybe add a verbose option?

Convert dsc to BatchExperiment to allow easy submission of jobs to cluster

Seed subsets

The engine currently has seed subsetting appear in a couple of places with a tacit assumption that the seeds are the same between scenarios, i.e. that a dsc (ignoring output parsers) is a Cartesian product:

scenarios x seeds x methods x scores.

However, the dsc itself allows the user to specify an arbitrary set of seeds for each scenario, leading to a situation that could look like this:

scenario seeds
scen1    1 2 3 4
scen2    5 6
scen3    1 2 3 4

(Actually I wonder if users are already depending on this behavior and using seed subsets to do funny things with their workflows.)

It's worth discussing what functionality we want to provide in terms of seed subsetting. At the coarsest level of control, we could allow no seed subsetting at all. At the finest, we could allow the user to pass a data frame of the exact scenario/seed combinations that he or she wants to execute. The current state is somewhere between these two and encourages whimsical, opaque hacks.

Another thing I've noticed in my refactoring is that one cannot currently subset run_scores according to a seed subset.

Confusion with tests/one_sample_location_longoutput

In the dscr/tests/one_sample_location_longoutput, there is a non-current version of adding methods and scenarios. It uses the list format instead of the addScenarios and addMethods. I stumbled upon this when writing my dsc and accidentally used this format at first. I think that the presence of this on the general site may be confusing to other first-time users. Thanks!

problem with way path to output files is stored

the problem is that the output file path is stored relative to the directory
in which the dsc is run. This means we have to specify this directory when loading
examples etc later. I've modified load_example to take a paprameter home.dir to allow this to be specified. but maybe this can be done better?

Remove methods/scenarios? List current methods/scenarios?

It might be desired to have a function to list all methods/scenarios of the current dsc object?

There are no functions to directly remove some methods/scenarios from a dsc object neither....

automatic makefile

I'm thinking the next step is to make it easy to only update the things that need updating. This seems like it would most naturally be accomplished using make. To make it easier I'm thinking we could maybe use R to create the makefile. So have a function update_makefile that creates/updates a makefile for the project. The makefile could have targets like "make_params", "run_methods" etc of the form "Rscript xxx" that allow the most common actions to be easily perfomed.

Vignette depends on ashr

The vignette dsc_shrink.rmd depends on ashr. This causes installation failure for any user who does not previously install ashr.

Function as method output

I am working on a dscr and thought it would be best for each method to return a function rather than a value or set of values. I am running into a problem when the dscr calls get_results_singletrial. This function attempts to recast the results list as a dataframe, but I don't think this works when one of the elements of the list is a function. The in progress code is here https://github.com/jfdegner/BlackJack

Set up Travis-CI to run automatic build and check

Another thing appears in my mind is maybe using Travis-CI (https://travis-ci.org) to build and test the package remotely and dynamically - every time we push the commits to GitHub, Travis will rebuild and retest the package. We can run any shell scripts to generate and test the results we want - on their virtual machines. Travis would be useful for team collaboration, and free us from the tedious manual testing.

A short tutorial of Travis-CI:

http://jtleek.com/protocols/travis_bioc_devel/

"run_dsc" runs slow if score function returns a long row vector

The function "run_dsc" runs very slow when aggregating results if the score function returns a long row vector/list/array (i.e. "results" is a long row vector/list/array).

It might be due to line 140 in "main.R":
return(data.frame(seed=seed, scenario=scenario$name, method=method$name, results))

The "data.frame" function runs slow when combining high dimension row vector, but pretty fast for high dimension column vector. For example, the second command runs much faster than the first command:

system.time(data.frame(seed=1,t(data.frame(1:10000)),check.names=FALSE))
user system elapsed
0.035 0.003 0.037
system.time(data.frame(seed=1,data.frame(1:10000),check.names=FALSE))
user system elapsed
0.001 0.000 0.001

P.S. It is interesting that in this case the "cbind" function runs much faster than "data.frame":

system.time(cbind(seed=1,t(data.frame(1:10000)),check.names=FALSE))
user system elapsed
0.001 0.000 0.001

Results always get overwritten

Although we check before running scenarios and methods, the results/scores always get
recomputed even if they already exist. I'm not sure whether there is a good reason for this..

should be more flexible about allowed form for score function

From jfdgner: run_dsc failed around the aggregate_results function call. It turned out to be fixed by changing my score function so that it returned a data.frame instead of a numeric vector. I don't know if it is worth adding some flexibility here.

would be good to have a way to check output formats of everything match required specifications.

also might be helpful to have a function that runs the dsc on just 1 rep per scenario, so people can test before running the whole thing.

possible enhancement: option to save results as text instead of rdata?

run_dsc() still aggregates all scores when using "seedsubset=..." option?

When using "seedsubset=..." option, the function run_dsc(...) still aggregates scores of all seeds.

error in creating vignettes: dsc_shrink.rmd

when i installed the latest dscr package, i got an error message about dsc_shrink vignettes; one sample location vignette seems based on the previous version of dscr

devtools::install_github("stephens999/dscr",build_vignettes=TRUE)
browseVignettes("dscr")

The downloaded source packages are in
    ‘/tmp/RtmpAAt6QJ/downloaded_packages’
'/usr/lib/R/bin/R' --vanilla CMD build '/tmp/RtmpAAt6QJ/devtools165a6d425053/stephens999-dscr-92cd44e'  \
  --no-resave-data --no-manual 

* checking for file ‘/tmp/RtmpAAt6QJ/devtools165a6d425053/stephens999-dscr-92cd44e/DESCRIPTION’ ... OK
* preparing ‘dscr’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Quitting from lines 166-168 (dsc_shrink.rmd) 
Error: processing vignette 'dsc_shrink.rmd' failed with diagnostics:
could not find function "ncomp"
Execution halted
Error: Command failed (1)

possible enhancement: load data only once and apply all methods?

however, this wouldn't work well for distributed cluster computing....

Vignettes Browsing

There are two ways of installing dscr packages: i) devtools::install_github("stephens999/dscr") in R; ii) make in terminal. Only approach (ii) can produce the vignettes in my machine.

Here is the output of approach (i) in Rstudio:

list methods and sceniarios

would be nice to have functions to list all methods and scenarios defined

install dependencies

some users are reporting problems with make deps, and particularly installing devtools

Big if-else datamaker in one_sample_location

The big set of if-else blocks in the one_sample_location vignette violates the single responsibility principle, and it is encouraging students to write a single, enormous (hundreds of lines) datamaker function with similar if-else logic for their own complicated statistical cases. This is digging them in a hole in terms of testing and debugging.

From the codebase it seems that dscr allows one to have several simple datamakers that get specified at the addScenario level. This seems to me to be better software practice, and we should teach this pattern via the vignette.

(Matthew - I'm filing this issue basically as a todo for myself. Feel free to just wait for me to implement it, unless you disagree or otherwise want to comment.)

Reset a scenario

There is no simple way to delete the output files (data, output, results) from a dsc within R. Add reset functions so that all files can be recreated if dependencies have changed. For example, running reset_scenarios(scenarios) and reset_methods(methods) followed by res=run_dsc(scenarios, methods, score) would run the dsc completely from scratch. Alternatively, a function rerun_dsc(scenarios, methods, score) would do the work of reset followed by run.

Make dscr minimally ready for broader collaboration

Strip the dscr API down to as few exported commands as possible, and rename them in snake_case (#44). This is a serious breaking change, and I'd have to assist students in getting all their dscs back on the rails subsequently.
Replace the guts of the execution engine with BatchJobs registries for parallel execution (#23, #46)
Complete some vignette work to illustrate the tooling for fixing mistakes and working noninteractively (#35, #47, #48)
Flesh out the reset functionalities so that other phases of the computation can be reset (#43)
Assist in the process of packaging only the code and .RDS scores (as opposed to raw data or the other cached results of computation) as a Git repo such that the repo size is GitHub-friendly (this is a new one)

At that point, I think that dscr repositories (versioned under Git and hosted by GitHub) should be effective enough to facilitate new methodological collaborations, where new scenarios or methods could be shared via GitHub pull requests. Rerunning other people's computation (to get the non-score stuff) could be done on an as-needed basis, but the idea is that such "auditing" tasks would be rarer than adding code and scores to the repo.

This specifically punts on some of the other directions in the interest of time:

CRAN-ready or otherwise cleanly engineered build (#38)
Input parsers or any more depth/complexity to the workflow hierarchy (#42)
Smarter handling of the dependency graph (discussion in #43)
Tooling for safe partial reruns of a project as the code itself changes (discussion in #43)

@stephens999 This is fully open to debate.

Possible change of license

Hi Matthew, great project!

I saw that you wrote "Creative Commons License" in the DESCRIPTION file.

First of all, I don't think this is enough because there are many different types of CC licenses.

Furthermore, they don't recommend using CC for software, instead they mention the GPL. As an example, Hadley Wickham chose GPL (>= 2) for devtools.

Given the aim of the project, I think the GPL makes sense.

Best,
Tim

better error catching for methods that might fail

perhaps we should use try() on every method, so that if it fails we don't crash...

Show which trials went wrong when "running Scores" or "running outputParser"?

If an error occurs in the "running Scores" or "running outputParser" step, there is just an error message but it doesn't tell me which trial caused that error. It'll be helpful for debugging if dscr can show which trials went wrong...

Warning on load

When I load dscr, I get a warning:

Warning message:
replacing previous import by ‘psych::%+%’ when loading ‘dscr’

Not a functional problem, but we'll want to fix it before going public with this package.

Spirit of dscr

Talking today with @xiangzhu and bringing his dsc up to date was illuminating. Like most others who use dscr in practice, he "cheats" dscr by having his method wrappers point to hand-prepared datasets already pre-processed to drop into the various methods he want to test. His input objects are just fragments of filenames, and his wrapper functions simply assemble an input filename from the input fragment and the name of the method to be tested, and load the hand-prepared data residing in this file. This practice struck me as kind of smelly, and our discussion clarified my thoughts on why we claim that dscr is a tool for making research more reproducible. It boils down to users obeying and implementing a single kind of interface:

Each scenario emits an actual dataset, not a pointer to a dataset, in a standardized format
Each method, via a "wrapper," ingests an actual dataset in a standardized format and prepares it for the idiosyncratic needs of a specific method.

To me it seems that the spirit of dscr is that each method gets handed a bitwise identical copy of actual data to process, and that a scientist new to a project will be able to audit for preprocessing errors, configuration choices, etc., by tracing the execution path preceding the run_dsc verb. To date, I've yet to see a student do a real scientific application with dscr that actually implements the above interface with rigor. And worse, the dark "off-the-books" portion of the benchmarking study is usually convoluted and almost never documented. My feeling is that mixing undocumented data-preparation with dscr results in something that is even less reproducible than performing a benchmarking study by hand but with a moderate quality of lab notes describing the process.

Thoughts? Suggestions for application with which we can convey the above somehow with a third vignette?

add reset scores, reset parser

analogous to reset_dsc (method, scenario) need way to reset scores and parsers

vignette does not illustrate reset_dsc

need to have some discussion of features like this somewhere. perhaps in vignette.

Series of simulations

in some cases one wants to vary simulations - eg do 100 simulations with a parameter p varying
from 0.01 to 1.00 by steps of 0.01. Would be nice to have a way to do this.

Naming conventions

It would be nice to keep function names consistent.

Currently, there are functions that are dot separated (new.dsc), camelCase (addMethod) and underscore delimited (run_dsc). It would be nice to make the names consistent in one way or the other.

Clean build

This will require a fairly major effort; I'm gathering my thoughts for discussion/debate.

A goal, probably a prerequisite for wide release such as on CRAN, is for the GitHub repo to contain a clean specification of the package that is as non-repetitive as possible while remaining useful. The R tooling (e.g., devtools and roxygen2) causes a lot of duplication of information by compiling code from one place into code in other places. There's controversy about which duplication to check into git: some package developers, for example, go for maximum parsimony, excluding NAMESPACE or any documentation .Rd files because they can be built by roxygen2. Others (including Hadley) recommend keeping them in git so that devtools::install_github will ship documentation to users.

Currently, a lot of stuff is checked into this repo, and it has gotten out of sync in various ways. Many of the devtools verbs shred up the repository in surprising ways. For example, devtools::build_vignettes() deletes vignettes/dsc_shrink.html and puts a fresh vignette output in /inst/doc which is .gitignored. The package developer has to manually move files around to undo this.

In any case, we need to document and standardize a build process (it will likely consist of just a few devtools magic words that correspond to a known sequence of actions in RStudio) and automate enforcement that all duplicated/cached artifacts in the repo are fresh: that, for instance, all .Rmd and other roxygen2-generated files are consistent with the roxygen2 comments in the main codebase. I believe I can build Travis-CI tooling for this.

There are two ways to go in my mind, depending on priorities:

If we are hoping to get dscr onto CRAN substantially as is, I could put my efforts into incrementally achieving a clean build for the project.
If instead we are hoping to make major changes (e.g., rebuild a rather different dscr on top of BatchJobs for seamless parallelization) for a later release, it might be less work to start from the bottom, with a fresh package, and document the build process and all of its artifacts step by step. We'd graft the essential code into the new package piece by piece.

possible bug: reset_dsc doesn't seem to remove scores?

could be more flexible when method doesn't return certain parts of the output

Suppose a method only returns some aspects of the formal dsc output - eg a method for doing FDR estimation only outputs q values and not an estimate of pi0, but you want to compare methods on both. Ideally we want that method to be scored for the parts of the output it produces and not for others. Currently you need to explicitly make the method return NA for that part. It might be nice to automatically make it return NA for parts of output that it doesn't compute?