ssadedin / bpipe Goto Github PK

View Code? Open in Web Editor NEW

226.0 18.0 58.0 53.94 MB

Bpipe - a tool for running and managing bioinformatics pipelines

Home Page: http://docs.bpipe.org/

License: Other

Shell 9.09% Groovy 85.15% Java 2.41% HTML 3.35% Roff 0.01%

bpipe's Introduction

Welcome to Bpipe

Bpipe provides a platform for running data analytic workgflows that consist of a series of processing stages - known as 'pipelines'. Bpipe has special features to help with specific challenges in Bioinformatics and computational biology.

May 2023 - New! Bpipe 0.9.12 released!
Documentation
Mailing List (Google Group)

Bpipe has been published in Bioinformatics! If you use Bpipe, please cite:

Sadedin S, Pope B & Oshlack A, Bpipe: A Tool for Running and Managing Bioinformatics Pipelines, Bioinformatics

Example

 hello = {
    exec """
        echo "hello world" > $output.txt
    """
 }
 run { hello }

Why Bpipe?

Many people working in data science end up running jobs as custom shell (or similar) scripts. While this makes running them easy it has a lot of limitations. By turning your shell scripts into Bpipe scripts, here are some of the features you can get:

Dependency Tracking - Like make and similar tools, Bpipe knows what you already did and won't do it again
Simple definition of tasks to run - Bpipe runs shell commands almost as-is : super low friction between what works in your command line and what you need to put into your script
Transactional management of tasks - commands that fail get outputs cleaned up, log files saved and the pipeline cleanly aborted. No out of control jobs going crazy.
Automatic Connection of Pipeline Stages - Bpipe manages the file names for input and output of each stage in a systematic way so that you don't need to think about it. Removing or adding new stages "just works" and never breaks the flow of data.
Job Management - know what jobs are running, start, stop, manage whole workflows with simple commands
Easy Parallelism - split jobs into many pieces and run them all in parallel whether on a cluster, cloud or locally. Separate configuration of parallelism from the definition of the tasks.
Audit Trail - keeps a journal of exactly which commands executed, when and what their inputs and outputs were.
Integration with Compute Providers - pure Bpipe scripts can run unchanged whether locally, on your server, or in cloud or traditional HPC back ends such as Torque, SLURM GridEngine or others.
Deep Integretation Options - Bpipe integrates well with other systems: receive alerts to tell you when your pipeline finishes or even as each stage completes, call REST APIs, send messages to queueing systems and easily use any type of integration available within the Java ecosystem.
See how Bpipe compares to similar tools

Ready for More?

Take a look at the Overview to see Bpipe in action, work through the Basic Tutorial for simple first steps, see a step by step example of a realistic pipeline made using Bpipe, or take a look at the Reference to see all the documentation.

bpipe's People

Contributors

Stargazers

Watchers

Forkers

slugger70 alexagrf sb43 druvus lonsbio tbshirey supernifty hdashnow sajp xtmgah congenica vivovip msvijayabaskar josegarciamanteiga puneet-shivanand gdevenyi discoveradnan nyc9981 vd4mmind mmesbahu teaguesterling boratonaj yixf-self winterli1993 codwd alexandre-nadin zamaudio polojacky dad hrk2109 cwt1 iansealy pzumbo ypchan tirohia kc-lan debasishmaji imbforge gypsybud 5l1v3r1 inambioinfo nemartins nirvananimbusa biocq jing-xinxing drtconway pang-hd alaminzju celinezhu alexrogalskiy david-ma tommyli leimanao typekey pearl520 godfat616

bpipe's Issues

Bpipe should support OAuth for Accessing Google Services

From [email protected] on 2012-07-17T09:32:06Z

Bpipe's ability for sending email and instant messages via Google Talk and Gmail currently requires the user to enter their Google password into a bpipe.config file. This is insecure for many users. Since Google offers OAuth for these services, Bpipe should support that for users who are unable to put their passwords in plain text.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=46

Support for Child Directories for Outputs of Pipeline Stages

From [email protected] on 2012-07-12T10:04:15Z

Currently Bpipe relies on all the outputs that are generated by your pipeline being produced appearing in the same directory as the pipeline script (it's not compulsory, but some Bpipe features don't work or you can get odd results if you put output files elsewhere).

This really starts to break down when you have a huge number of files in a project or job, especially intermediate files that are just computational byproducts.

Bpipe should fully support tasks creating outputs in child directories.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=43

Input Splitting not Working Correctly on Some Patterns

From [email protected] on 2012-06-18T12:50:14Z

What steps will reproduce the problem? 1. Use input files like so: "120206Bha_D12-530_1_sequence.fastq",
"120206Bha_D12-530_2_sequence.fastq"
2. Pattern "530_%_sequence.fastq" should split the files into two groups

Actual result: Bpipe complains that it cannot find any input files.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=37

Deadlock and Corrupted Command Log when running huge number of Parallel Stages (>100)

From [email protected] on 2012-02-08T12:49:12Z

What steps will reproduce the problem? 1. Create pipeline with 200 input files feeding into a single parallel stage
2. Execute Pipeline What is the expected output? What do you see instead? All 200 files should be processed and script should complete.

Actual behavior - sometimes Bpipe hangs, and the command log contains statements overwriting each other and even misses some commands.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=12

Bpipe should offer some way to control how many threads it will create

From [email protected] on 2012-02-08T12:51:24Z

Currently Bpipe will launch as many parallel stages as you feed it input files for parallel pipeline segments. This isn't always appropriate or feasible. Bpipe should offer some kind of option to limit how many threads it will create.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=13

Composable Pipelines - Ability to Define Reusable Pipeline Segments

From [email protected] on 2012-03-09T10:20:09Z

Currently you can add stages together, but it's not possible (or easy, anyway) to share segments of pipelines made up of common pieces. For example, one should be able to define

{{{
align = {
exec "bwa ..."
}

dedupe = {
exec "MarkDuplicates ..."
}

mapping = { align + dedupe }

Bpipe.run {
mapping
}
}}}

Original issue: http://code.google.com/p/bpipe/issues/detail?id=17

Ability to Detect and Record the Versions of Tools that Run

From [email protected] on 2012-05-20T16:38:52Z

A key element of traceability is knowing which version of a tool produced an output. It would be good if Bpipe could provide some support for tracking this and including it in the audit trail / output report.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=27

bpipe log executed as different user to the one who started the pipeline incorrectly reports that it is finished

From [email protected] on 2012-07-17T09:37:52Z

What steps will reproduce the problem? 1. Start a long running pipeline
2. Log in as different user
3. Go to pipeline directory and execute "bpipe log" What is the expected output? What do you see instead? Expect to see ongoing log. Instead, Bpipe says it finished.

Guesswork: Bpipe is querying the status of the process id from the OS and is confusing a permission denied error with the process not existing.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=47

Error Displayed when referring to an Unknown Pipeline Stage is Ugly

From [email protected] on 2012-07-16T13:18:09Z

What steps will reproduce the problem? 1. Make a pipeline that refers to a pipeline stage that is not defined
2. try to run it

What is the expected output? What do you see instead?

Expect a nice error message explaining that the pipeline stage couldn't be found. Instead see a giant groovy error stack trace.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=45

Merged job logging of parallel commands.

From [email protected] on 2012-07-10T07:53:45Z

What steps will reproduce the problem? 1. Run a pipeline with parallel commands.
2. Look at the screen log output. 3. What is the expected output? What do you see instead? I expect to see the output of each command in a single file somewhere in the .bpipe directory. Instead, I see only one log file for the entire run, with output of all of the commands. This is fine for sequential pipelines, but for pipelines that have jobs running in parallel, this is a problem. For example, a pipeline that runs Cufflinks as well as OverdispersedRegionScanSeqs (USeq package) in parallel, I see this type of thing when the outputs intersect with each other:

[16:18:02] Modeling fragment count overdispersion.
[16:18:02] Calculating initial abundance estimates for bias correction.<----- [From Cufflinks]

Processed 16024 loci. [*************************] 100% chr17random chr14random chrY chrX<----- [From OverdispersedRegionScanSeqs]
[16:38:19] Learning bias parameters.<----- [From Cufflinks]
chr16 chr19 chr5_random chr18<----- [From OverdispersedRegionScanSeqs]

There are many situations where this is problematic. For example, if a particular application gives similar output to another application running in parallel, it is impossible to distinguish them. Or, if multiple commands of the same application are running, it looks something like this:
Calculating read count stats...
444621 Queried Treatment Observations
Sample_43_CNT_24hr.bam
313186 Queried Control Observations
Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq ( http://www-huber.embl.de/users/anders/DESeq/).. .
chr19 chr5_random chr18

Calculating read count stats...
396066 Queried Treatment Observations
Sample_43_CNT_24hr.bam
279005 Queried Control Observations
Sample_58_Hep_cnt_72hr.bam

Calculating negative binomial p-values and FDRs in R using DESeq ( http://www-huber.embl.de/users/anders/DESeq/).. .
chr5_random chr18 chrM

Calculating read count stats...
445143 Queried Treatment Observations
Sample_43_CNT_24hr.bam
308133 Queried Control Observations
Sample_60_3T3_cnt_72hr.bam

There is no way to tell which initial command produced which particular output stats (except that in this case, the application was kind enough to provide us with the input file names, but not all applications will do this). Likewise, If multiple commands are running in parallel and one reports an error, it is difficult to know which command caused the error.

Ideally, the output could be saved within separate files for each command that is issued, then errors, etc. can be associated with originating commands. What version of the product are you using? On what operating system? bpipe-0.9.5, Linux Please provide any additional information below. Maybe one way to do this would be to have a "command id" in the commandlog.txt file, after each command. Then within .bpipe directory, you could have a directory containing files named .something containing 1) the present working directory, 2) the command as well as 3) the stderr and stdout for the command. And furthermore you could use this commandid as a job name for submissions to the cluster, and then when we get an email notification of an error (e.g., non-zero exit status), we can go directly to this command output file to see the full screen output/error message.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=42

Support for Email and other Notifications about Pipeline Progress

From [email protected] on 2012-04-05T12:42:42Z

Constantly checking a long running pipeline is painful.

Bpipe should support a way to get notified when a pipeline succeeds, fails, finishes, etc.

Ideally via email, chat, via a generic / extensible mechanism.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=19

Simple stage execution make Bpipe crash when launched twice in row

From paolo.ditommaso on 2012-07-13T23:03:15Z

What steps will reproduce the problem? 1. use the simple stage attached
2. execute it with: "bpipe testcase.bpipe"
3. without deleting the produced file 'foo.txt', execute a second time: "bpipe testcase.bpipe"

Bpipe crashes with the following stack trace:

$ bpipe testcase.bpipe
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.tools.GroovyStarter.rootLoader(GroovyStarter.java:108)
at org.codehaus.groovy.tools.GroovyStarter.main(GroovyStarter.java:130)
Caused by: java.lang.NullPointerException: Cannot get property 'class' on null object
at org.codehaus.groovy.runtime.NullObject.getProperty(NullObject.java:56)
at org.codehaus.groovy.runtime.InvokerHelper.getProperty(InvokerHelper.java:156)
at org.codehaus.groovy.runtime.callsite.NullCallSite.getProperty(NullCallSite.java:44)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callGetProperty(AbstractCallSite.java:227)
at bpipe.Utils$_isNewer_closure3.doCall(Utils.groovy:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:272)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at groovy.lang.Closure.call(Closure.java:412)
at groovy.lang.Closure.call(Closure.java:425)
at org.codehaus.groovy.runtime.DefaultGroovyMethods.every(DefaultGroovyMethods.java:1502)
at org.codehaus.groovy.runtime.dgm$215.doMethodInvoke(Unknown Source)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1047)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at org.codehaus.groovy.runtime.callsite.PojoMetaClassSite.call(PojoMetaClassSite.java:44)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
at bpipe.Utils.isNewer(Utils.groovy:70)
at bpipe.Utils$isNewer.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
at bpipe.PipelineContext.produce(PipelineContext.groovy:591)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1047)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:697)
at bpipe.PipelineContext.invokeMethod(PipelineContext.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1047)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:39)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
at bpipe.PipelineDelegate.methodMissing(PipelineDelegate.groovy:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at groovy.lang.MetaClassImpl.invokeMissingMethod(MetaClassImpl.java:804)
at groovy.lang.MetaClassImpl.invokePropertyOrMissing(MetaClassImpl.java:1096)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1049)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:697)
at bpipe.PipelineDelegate.invokeMethod(PipelineDelegate.groovy)
at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeOnDelegationObjects(ClosureMetaClass.java:423)
at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:346)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:877)
at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:66)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:46)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:133)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:145)
at script134217889892558917682$_run_closure1.doCall(script134217889892558917682.groovy:2)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:233)
at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:272)
at groovy.lang.MetaClassImpl.invokePropertyOrMissing(MetaClassImpl.java:1086)
at groovy.lang.MetaClassImpl.invokeMethod(...

Attachment: testcase.bpipe

Original issue: http://code.google.com/p/bpipe/issues/detail?id=44

Returning an Object from a Pipeline Stage Should not be Treated as an Output

From [email protected] on 2012-05-06T12:52:43Z

This behavior produces surprising results and forces confusing work arounds.

For example, this pipeline stage:

hello = {
exec "cp $input $output"
x = "foo"
}

Produces an error:

Pipeline failed!

Expected output file ./foo could not be found

This happens because the expression 'x = "foo"' evaluates to a String object, which, being the last expression in the pipeline stage is treated by Groovy as a return value. Bpipe then sees it and considers it an output.

The "return value as output" is in fact a relic from when Bpipe had no "produce", "transform" or "filter" constructs and is now really needed only by internal parts of Bpipe. Removing it will make Bpipe clearer.

Note: there will still be occasions when the user needs to specify that a different output from the default one assumed by Bpipe should be forwarded to the next stage. This should be a separate construct, called "forward".

Original issue: http://code.google.com/p/bpipe/issues/detail?id=22

Errors printed out when Bpipe run in Directory for First Time

From [email protected] on 2012-04-13T11:13:35Z

What steps will reproduce the problem? 1. make a new directory
2. create a dummy bpipe script (empty)
3. run Bpipe using bpipe run dummy.groovy What is the expected output? What do you see instead? Script should run normally

Instead, see errors before the normal output:

basename: missing operand
Try `basename --help' for more information.

These errors don't appear after the first run.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=20

Trailing Variable Assignment inside Produce Causes Incorrect Inputs Passed to next Stage

From [email protected] on 2012-06-03T16:39:11Z

Pipeline stage like this:

foo = {
produce("bar.txt") {
exec "....."
x="test"
}
}

The output should be considered to be "bar.txt" and this should be passed to next stage, but instead, the output is interpreted as "test".

Original issue: http://code.google.com/p/bpipe/issues/detail?id=33

Support for an easy way to parallelize based on region

From [email protected] on 2012-04-04T11:16:49Z

Currently Bpipe lets you easily parallelize if you split processing by sample or by splitting your input files up into pieces. However many tools can operate on regions independently without splitting files up. This is kind of tricky to do in Bpipe right now. It would be nice to have some support for a direct syntax to say "run this pipeline with every chromosome in parallel".

Syntax:

chr(1..22) * [ call_variants ]

Will create a $chr variable that the call_variants stage can use.

More detailed regions could be created using an organism specific database:

hg19.split(60) * [ calculate_coverage_depth ]

This latter would figure out how to split the human genome into 60 roughly even parts for you and pass variables $chr, $start, $end to the calculate_coverage_depth pipeline stage, making it really easy to parallelize data processing.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=18

Running bpipe stop reports stopping many more commands than it should

From [email protected] on 2012-05-26T11:23:10Z

Run a pipeline several times in a directory.

Observe all commands are already stopped

Then run 'bpipe stop'.

Bpipe will tell you:

Stopped 22 commands

(Where 22 is any number > 0).

If all commands finished, Bpipe should not be telling you it stopped anything.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=31

Job id missing from commandlog.txt

From [email protected] on 2012-07-10T07:24:47Z

What steps will reproduce the problem? 1. Run the "helloworld" pipeline from the tutorial in the documentation.
2. Look in .bpipe/logs directory
3. Look in commandlog.txt file. What is the expected output? What do you see instead? In the documentation ( https://code.google.com/p/bpipe/wiki/Logging ), it states "If you wish to see the output log for an older job you need to find its Job Id, which you can see in the command log. Then you can find the log archived in directory called '.bpipe/logs/[Job Id].log'."

An example of a commandlog.txt file is shown at that website, with these lines:

Starting pipeline at Wed Feb 01 15:22:01 EST 2012 as Job 4225

Input files: test.txt

However, my commandlog.txt file looks like this:

Starting pipeline at Mon Jul 09 17:12:01 EDT 2012

Input files: []

Stage hello

echo Hello

Stage world

echo World

There is no "as Job " in the line beginning "# Starting pipeline ..." What version of the product are you using? On what operating system? bpipe-0.9.5.3 Please provide any additional information below. Also, there is no single job id associated with this job; apparently, there are two job ids. There are two files in .bpipe/logs associated with the single job that I ran:
ls .bpipe/logs/
20038.log 20045.bpipe.log

I expected to see two files with the same id, e.g.,
20038.log, 20038.bpipe.log

Attachment: 20045.bpipe.log 20038.log commandlog.txt

Original issue: http://code.google.com/p/bpipe/issues/detail?id=41

Ability to Parameterize Pipeline Stages

From [email protected] on 2012-05-20T16:41:13Z

Pipeline stages are often generally useful but have some particular parameters that need to be modified per-pipeline or perhaps within a pipeline if the stage is used multiple times.

It would be good to allow values to be passed to a pipeline stage as parameters (or similar) so that pipeline stages are more reusable.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=28

Support for "module load" in torque scripts

From [email protected] on 2012-02-07T17:43:03Z

It would be good for torque jobs to support the "module load" command, and for it to be mentioned in the online documentations.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=10

Bpipe configuration defaults

From [email protected] on 2012-05-20T16:37:04Z

At the moment you can put a bpipe.config file in the local pipeline directory to control certain behaviors of bpipe. A lot of times however these are shared across pipelines so it would be helpful to have Bpipe load defaults from a common location (eg: home directory) and then merge those with a bpipe.config found in the local directory.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=26

Error in Parallelizing same task with different inputs

From [email protected] on 2012-05-11T06:39:33Z

I am trying Bpipe for one of our in-house software. The program runs fine when I run a single version of the "exec" command but when I try and run the same command by changing one input parameter and parallelizing the two shell commands, I get an error :

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
script_from_command_line: 4: expecting anything but ''\n''; got it anyway @ line 4, column 192.

1 error

Any help would be appreciated.

Thanks
Jess

Original issue: http://code.google.com/p/bpipe/issues/detail?id=24

Possible deadlock

From [email protected] on 2012-05-26T11:13:14Z

Although it is very rare, very occasionally Bpipe will completely hang when running commands locally (not with Torque/SGE/LSE executors).

Original issue: http://code.google.com/p/bpipe/issues/detail?id=30

RealPipelineTutorial should include Bpipe.run line in each example

From [email protected] on 2012-02-07T17:36:40Z

I naively copied the last version of the example pipeline from https://code.google.com/p/bpipe/wiki/RealPipelineTutorial It does not include the Bpipe.run line (which is mentioned earlier on the same page).

Forgetting to put this in the script means it will not do anything.

So at the expense of extra redundancy it might be worth repeating the run line in each iteration of the running example.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=7

ConcurrentModificationException when Running Parallel Pipeline

From [email protected] on 2012-05-26T10:41:20Z

What steps will reproduce the problem? 1. Create a parallel pipeline that runs 2 or more segments simultaneously
2. Run it many many times in a loop
3. Eventually you get a ConcurrentModificationException

Original issue: http://code.google.com/p/bpipe/issues/detail?id=29

Forward not working inside Transform / Filter / Produce

From [email protected] on 2012-06-01T10:04:48Z

A forward like this doesn't get applied:

fine = {
transform("csv") {
msg " $input => $output"
exec "cp $input $output"
forward input
}
}

The input to next stage will be the $output variable.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=32

Transform and Filter should accept multiple arguments

From [email protected] on 2012-05-06T12:55:32Z

Currently you can specify that a section of a pipeline stage transforms an input like so:

foo = {
transform("csv") {
....
}
}

However if the code performs multiple transformations then you can't easily specify them both together. Bpipe should support syntax such as:

foo = {
transform("csv","xml") {
...
}
}

The same would go for filtering operations.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=23

Pipeline file can't have same name as pipeline stage

From [email protected] on 2012-02-07T17:42:09Z

I had a simple pipeline with a stage called "hello" and I saved the whole thing in a file called hello.pipeline.

This gave me an error about assigning to the variable called hello:

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
/vlsci/VLSCI/bjpop/code/bpipe-0.2/test/hello.pipeline: 1: you tried to assign a value to the class 'hello'. Do you have a script with this name?
@ line 1, column 1.
hello = {
^

1 error

Original issue: http://code.google.com/p/bpipe/issues/detail?id=9

Passing Parameters with 'Using' Broken

From [email protected] on 2012-07-09T15:28:43Z

What steps will reproduce the problem? 1. 2. 3. What is the expected output? What do you see instead? Please use labels and text to provide additional information.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=40

'bpipe log' shows output from runs even when Bpipe didn't successfully start

From [email protected] on 2012-02-08T10:46:47Z

What steps will reproduce the problem? 1. execute 'bpipe run some_pipeline.groovy'
2. execute 'bpipe log'
3. after run finishes, execute 'bpipe' (no arguments)
4. now execute 'bpipe log' What is the expected output? What do you see instead? Expect to see the output from the last run pipeline.

Instead see help output.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=11

Transitive Dependencies / Dealing with Large Intermediate Files

From [email protected] on 2012-01-30T20:55:59Z

Bpipe needs to have a way of allowing intermediate results to be deleted or dispensed with that won't trigger recomputing of them if the pipeline is re-executed.

Suppose you have 3 files in a dependency chain where A is used to compute B and B is used to compute C.

Bpipe knows that A produces B and that B produces C. However it does not know that at a higher level, A produces C. Thus if one deletes the file B it will recompute B and then C even if it had C newer than A to start with.

This problem manifests when dealing, for example, with SAM files which are much larger than, but equivalent to a BAM file. After you have the BAM file you really don't need the SAM file, but with Bpipe you have to keep it anyway, wasting disk space, because otherwise Bpipe will try to recompute it.

Workarounds:

don't allow such an intermediate files to be recognised as an output (that means, don't put them in produce, transform, filter, statements, etc.). The downside of this is that Bpipe is less likely to clean them up if the pipeline fails.
After making the final file, remove the intermediate file (eg. the SAM file) and create a 'dummy' intermediate file to trick Bpipe. Then touch the final file to make it "newer" than the intermediate file. This is a hack.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=2

Ability to resolve multiple files using input variable with extensions

From [email protected] on 2012-02-26T18:03:09Z

Currently you can use a file extension to filter inputs to those with a particular extension:

exec "cp $input.txt $output"

However that only finds the first input. It would be useful to be able to retrieve all inputs with a particular extension in the following way:

exec "cat $inputs.txt > $output"

Original issue: http://code.google.com/p/bpipe/issues/detail?id=16

Recompute from Arbitrary Stage

From [email protected] on 2012-01-30T21:04:59Z

At the moment if you want Bpipe to recompute all results from a particular stage forward when Bpipe thinks they are up to date you need to physically delete or touch a file that will make Bpipe think a dependency is out of date. This makes it hard to reliably ensure results are properly recomputed when you change something that Bpipe is unaware of that affects your pipeline.

This could be implemented in the form of a 'clean' command that allowed results to be purged from a specified stage forward, or it could be simply be an explicit

bpipe rerun <input files ...>

Original issue: http://code.google.com/p/bpipe/issues/detail?id=3

Support for Report of Run in HTML Form

From [email protected] on 2012-05-20T16:33:55Z

It would be very useful in multiple scenarios to be able to get a report out of Bpipe that is readable by normal humans (eg: that you can email to somebody, etc.).

Ideally this will be HTML form and will show all the stages that ran, whether they succeeded or failed, the timings, the inputs and outputs, and will also allow the user to add other documentation.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=25

Nohup Issue on Slackware < 13.0

From [email protected] on 2012-01-27T09:53:46Z

Information from user:

Bpipe depends on nohup. But the "nohup" command runs on the different
slackware versions (slackware 12.1) vs other Linux distros differently. The
"nohup" command is used to execute each section of the pipe script in the
background. If you edit the bpipe script in the bin/ directory and search
for "nohup" you'll see how it is used for bpipe.

The "nohup" command insists on having everything directed to a file (either
specified by "> filename" or by default it will go to a file "nohup.out") so
nothing gets sent to the terminal. Funny thing is this behavior is only
limited to slackware versions < 13.0.

If you edit the bpipe script, and add the following lines at line 184 (just
before the nohup line) and comment out the nohup line:

j=$$
touch .bpipe/logs/$j.log
java -classpath "$CP" -Dbpipe.pid=$j bpipe.Runner $TESTMODE $* 2>&1 &
disown

then you'll get the output to the screen and it will still run in the
background if you logout. I'm sure if we spend more time researching "nohup"
we'll find away to get it to display to stdout, but I'm not too well to do
that.

That being said, since you may be running complex, long jobs that won't
finish after a few seconds, you can just use the original bpipe script with
the "nohup" command as it is. The output will be appended in
bin/.bpipe/logs/$$.log which you can tail to see the progress output of
your pipe script.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=1

Bpipe log command hangs requiring Ctrl-C when Job is Finished

From [email protected] on 2012-01-31T09:28:22Z

After Bpipe runs a pipeline one can see the running log of outputs using the Bpipe log command, executed using 'tail -f'.

However Bpipe still uses 'tail -f' even when a pipeline is finished, creating a confusing situation for the user who may still wait for the command to finish.

It would be better to execute tail without '-f' when the pipeline has finished running.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=4

Not Work for Symbolic link to bpipe

From yanlinlin82 on 2012-06-08T01:03:36Z

What steps will reproduce the problem? 1. PATH is set to a symbolic link to the unzipped bpipe.
2. 'bpipe run some.pipe' do not work. What is the expected output? What do you see instead? It should work for symbolic link What version of the product are you using? On what operating system? I was using bpipe-0.9.5 on Gentoo-Linux-3.2.12 system. Please provide any additional information below. Changing bpipe file's line 287 as:
BPIPE_HOME=$(dirname $(realpath $0))/..
should fix the problem.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=34

About the frequency of polling job status

From [email protected] on 2012-06-14T07:44:57Z

I'm using bpipe on TORQUE server.

I see bpipe will constantly check the job status by calling 'bpipe-torque.sh status <job_id>'. From my point of view, bpipe is checking too much (every second).

So far I didn't see a way to set this frequency. I'm asking here is it possible to add a parameter for this in the next release of bpipe?

Original issue: http://code.google.com/p/bpipe/issues/detail?id=35

Splitting to Parallel Paths by Chromosome not Working at End of Pipeline

From [email protected] on 2012-07-08T10:32:57Z

A pipeline like this doesn't work:

run { stage_one + chr(1..5) * [ stage_two ] }

What is the expected output?

It should run 5 copies of stage_two, one for each chromosome processed.

What do you see instead?

An error is reported

Original issue: http://code.google.com/p/bpipe/issues/detail?id=39

Pipeline without Bpipe.run does nothing, perhaps should warn

From [email protected] on 2012-02-07T16:33:09Z

Create a pipeline without Bpipe.run

Then run the pipeline with:

bpipe run pipeline.txt

It silently does nothing.

Perhaps it should give a warning?

Original issue: http://code.google.com/p/bpipe/issues/detail?id=6

Support "include" statement

From yanlinlin82 on 2012-06-17T20:57:43Z

Although bpipe load all scripts in ~/bpipe/, an "include" statement for loading some other scripts will be useful for setting up system-wide pipelines.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=36

Execute multiline commands

From [email protected] on 2012-07-25T17:44:42Z

Hi Simon,

I have started using Bpipe.

When I initially attempted a multiline command with exec, e.g.
exec "some
command"
I got the error:
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
script_from_command_line: 10: expecting anything but ''\n''; got it anyway @ line 10, column 90.

The problem was easily fixable by rewriting the command with triple double-quotes:
exec """some
command"""

I believe that the documentation only mentions triple double-quotes in the context of commands that contain quotes, so I am not sure whether this is a bug or not.

Cheers,

Florent

Original issue: http://code.google.com/p/bpipe/issues/detail?id=50

Error Displayed if Run Pipeline with HTML Report but no Stages Execute

From [email protected] on 2012-07-18T12:46:33Z

What steps will reproduce the problem? 1. make a pipeline that needs an input file
2. run the pipeline with -r (produce report) but do not give it the input file

What is the expected output?

just a message about input being needed

What do you see instead?

giant groovy stack trace from report code

Original issue: http://code.google.com/p/bpipe/issues/detail?id=48

RealPipelineTutorial should make use of variables to define reference.fa and MarkDuplicates.jar

From [email protected] on 2012-02-07T17:39:04Z

https://code.google.com/p/bpipe/wiki/RealPipelineTutorial refers to reference.fa multiple times and uses an absolute path for MarkDuplicates.jar.

In both of these cases it might be valuable to assign them to variables at the top of the script.

It would create a single point of control and show the user how to use variables in their script.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=8

Output from commands like 'bpipe test' or printing help should not become the most recent log

From [email protected] on 2012-02-26T17:55:50Z

Say you have a long running pipeline, and you look at the output with

bpipe log

Then if you execute

bpipe

The help is printed out. But then after that

bpipe log

Shows the help as the log output from the previous command. This is very annoying because you can't see the actual log file from your long running command.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=15

More generalized form of Parallelization than Splitting by Chromosome

From [email protected] on 2012-04-20T23:49:11Z

Currently Bpipe supports splitting up work in a pipeline into parallel execution threads separated by chromosome. However actually there are numerous other ways to split up work, so this would make sense as a generalised feature. For example:

@-Transform("sam")
align_stampy = {
exec """
python $STAMPY_HOME/stampy.py
--bwaoptions="-q10 $REFERENCE"
-g $STAMPY_GENOME_INDEX
-h $STAMPY_HASH_FILE
-M $input1,$input2
-o $output
--readgroup=ID:$rg_id,LB:$rg_lb,PL:$rg_pl,PU:$rg_pu,SM:$rg_sm
--processpart=$part
"""
}

Bpipe.run {
part("1/3", "2/3", "3/3") * [align_stampy]
}

Original issue: http://code.google.com/p/bpipe/issues/detail?id=21

Pause Command

From [email protected] on 2012-07-20T11:27:16Z

Currently you can use 'bpipe stop' to stop a running pipeline, which interrupts all jobs ie. kills them, and cleans up their outputs.

Sometimes I want to "nicely" stop a pipeline without abandoning the tasks in process. So it should let all running commands continue to completion, but not launch anything new, and when the last command finishes, exit.

This would allow me to adjust pipelines, interleave a different job I forgot on the same computer, etc. without losing lots of work every time I stop a Bpipe job.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=49

Ability to Pass Command Line Options to Script

From [email protected] on 2012-07-05T20:58:51Z

Bpipe now supports parameters inside a pipeline, eg:

foo = {
exec """echo "$mesage" """
}
run { foo.using(message: "hello world") }

However this forces you to hard code the parameter in your pipeline. It would be nice to be able to pass that from the command line when you run your Bpipe script.

message = "hello there"
foo = {
exec """echo "$mesage" """
}
run { foo }

Now "hello there" is a default, but (ideally) we could run Bpipe like this:

bpipe run -p foo="hello world" pipeline.groovy

And this would override the value of the foo variable.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=38

Support for Detecting when Commands have Changed in Rerunning Pipeline

From [email protected] on 2012-02-08T12:55:25Z

Currently Bpipe just uses timestamps to detect whether it needs to re-run a command in a pipeline stage.

If, however, the command itself changed then the outputs will be invalid, even though they are newer than the input files. Therefore Bpipe should support a way to detect if the command changed since the one that created the files.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=14

bpipe log hangs

From [email protected] on 2012-02-07T16:31:30Z

bpipe log prints some text and then hangs with no output, using no CPU. Appears to be blocked on an IO action. Could be related to using tail -f ? What version of the product are you using? On what operating system? Version 0.2. OS X Snow Leopard.

Original issue: http://code.google.com/p/bpipe/issues/detail?id=5