gwf,gwforg

Local execution

Add an option for running a workflow locally, i.e. without submitting to a computer grid. This can be useful for testing workflows with smaller data sets before employing it on larger data sets where the full power of a computer grid is needed.

Function templates

Specify template functions as Python functions.

Write a python function and have it called as a target.

Set names to output files for slurm

No errors on missing files that cannot be generated

Files that are not generated by targets are identified as "system files" and we can test if they are present or not. If they are not, we should report an error somewhere. Right now, we simply ignore it.

Poor testing

The testing is pretty shitty as well. I have just been testing while writing the code. I need some properly thought out tests to make sure this doesn't bite me at some point...

Show H jobs in slurm

The slurm status right now shows waiting jobs (status H in torque) as just queued.

Add example workflows for documentation

The test workflows are not that useful for documentation examples.

I have better examples on our grid that actually does something useful, but they manipulate rather large files for genomic and sequencing data, so they are too large to be included in the distribution.

We should have some small but realistic examples with the distribution.

Provide regexp transformation of elements in lists

Globs are not quite enough for building rules from files on the system; we also need a way of transforming filenames.

Update documentation

The wiki still describes the old specification language.

Synchronous targets?

Is it possible to make a target that runs synchronous? If you need to run a target that provides information that actually affects the workflow -- such as determining the number of targets later in the workflow based on the data being analysed -- you will want to run it when calling the workflow but you probably don't want it to run on the front end node.

It is easy enough to wrap a function so it is only run when we haven't saved its output, so it doesn't have to cost much to have a function that produces information that guides the workflow, but I don't know if it is easy to submit a job and wait for its completion.

Put stdout and stderr in a log directory

Using lists in input and output specifications

You should be able to use lists to specify input and output files.

This makes it easier to make targets that builds for everything a glob finds.

Pattern matching / template targets

It is not uncommon to have to run the same commands on a number of files. In gmake, at least, that can be done with patterns such as

%.o: %.c:
gcc ...

which matches all *.c files and creates a target that generates the corresponding *.o files.

We need something similar here, although I am not sure it is really a single feature or really two: one for specifying a script with parameters and one for creating targets based on globs or regexps.

Add script for setting defaults

Right now I just use defaults to get the backend but there should be some support for modifying this easily

Add a "clean" option

should delete output files for a target.

Gwf resubmits running jobs when getting checking a large workflow

Running 20000 jobs - gwf submits as it should - but gwf -s shows hundreds of jobs as not submitted...

probably the interface with squeue that either times out or truncates the output (?)

Example script that will produce the error (20 + 20*10 + 20 * 10 *100 jobs = 20220 jobs).

from gwf import *

hello_world = template(input="",
                       output='{filename}',
                       nodes=1, cores=1, memory="4g", walltime="2:00:00") << '''
echo "Hello world" > {filename}
'''

def concat_files(inbams, outbam):
    options = {'nodes': 1, 'cores':1, 'memory': "2g", 'walltime':"24:00:00",
               'output': outbam, 'input': inbams}
    spec = """
cat {inbams} >{outbam}
    """.format(outbam=outbam, inbams=' '.join(inbams))
    return (options, spec)

jobname = "hwhw"
for i in range(1,21):
    outs2 = []
    for j in range(1,11):
        outs1 = []
        for k in range(1,101):
            outfile=jobname + "hw_" + str(i)+"_"+str(j)+"_"+str(k) 
            outs1.append(outfile)
            target(outfile)          << hello_world(filename = outfile)
        # Merge the 40 outputs
        outfile = jobname +"merge1_"+str(i)+"_"+str(j) 
        outs2.append(outfile)
        target(outfile)              << concat_files(inbams = outs1, outbam=outfile)
    # Merge the 50 outputs
    outfile = jobname+"merge2_"+str(i) 
    target(outfile)                  << concat_files(inbams = outs2, outbam=outfile)

Rethink the template specifications

The template function is a horrible way to specify templates, but it is slightly easier than using functions and there keeping track of options and specs.

It should be possible to do something a bit smarter for specifying templates. A small language for specifying options and specs of some sort.

It would also be good to make sure that templates have doc-strings to the extend that this can be guaranteed.

Verbose output explaining why a task must run

It can be useful to know why a given task thinks it needs to be re-run, so some option for storing this and displaying would be nice.

Redo grid options so they work with different backends

We need to change the way options are passed to the grid backend to easy the migration to support for slurm as well. Using PBS options won't work there and we shouldn't have to specify the same thing both as PBS and SBATCH.

For templates that uses environment variables I am not completely sure how to deal with this, but renaming the backend variables is a possibility.

Workflow visualisation

It would be nice with a script that displays the workflows and which jobs needs to be run for each target.

Include sub-workflows

This was implemented before the spec language was python and should be implemented again.

Feature request: extra verbose

An extra verbose flag that makes gwf print the options and bach commands for each target.

Documentation of workflow specifications and usage

Write documentation for the specification language and a Getting Started guild for how to use gwf.

Keep track of running jobs

When a target is already running, the output files could either exists or not but in any case dependent jobs submitted should wait for the job to complete.

memorize function

If a thunk takes a little while to run, we should remember its output from previous calls.

Ideally we should also be running the function on a computation node but that is another issue and this is at least a step in the direction of solving that.

Need to change path to import template file

I followed the documentation and split my workflow into workflow.py and templates.py, and then put an import in workflow.py:

from templates import *

However, this doesn't work.

Traceback (most recent call last):
  File "/home/das/.local/bin/gwf", line 5, in <module>
    pkg_resources.run_script('gwf==0.6.1', 'gwf')
  File "/com/extra/python/2.7/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/com/extra/python/2.7/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1239, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/das/.local/lib/python2.7/site-packages/gwf-0.6.1-py2.7.egg/EGG-INFO/scripts/gwf", line 57, in <module>
    execfile(args.workflow_file)
  File "workflow.py", line 5, in <module>
    from templates import *
ImportError: No module named templates

To fix this I manually adding the cwd to the path at the top of workflow.py.

sys.path.append(os.getcwd())

Is this a bug? If not, I think it should be added to the documentation.

Poor error handling

There's currently no propper error handling. If you are lucky, an assertion is broking and the script is terminated. To make this useful, a lot more error checking and error handling is needed.

Document @memorize

There is a bit of documentation in the news posts on the home page but documentation should be added to the main manual.

Fix the example workflow

This has been messed up with changes to file names and templates and should be updated

Spaces in variable substitution in template targets

The parser right now assumes that there are no spaces between key=value pairs. This should be fixed so that it doesn't crash if there are spaces around =.

Add dummy targets

Have an option to make targets that are used for computing dependencies and submitting jobs, but are never actually submitted to the grid themselves

Document @function_template

There is a bit of documentation in the news posts on the home page but documentation should be added to the main manual.

Add comments tags

gwf script crashes if the target provided is not in the workflow

There should be a test for whether the target exists and error handling when it doesn't.

Absolute paths are not working with working_dir

We always attach the working_dir to files, but when it is already an absolute path this doesn't work.

Overriding cores argument doesn't make sense

Disclaimer: I don't really know if this is the intended behaviour.

I created a template for aligning sequences with clustalo:

clustalo = template(input=['{sequence_file}', '{hmm_file}'], output=['{alignment_file}', '{distmat_file}'], cores=16) << """                                                              
clustalo --in {sequence_file} --outfmt fa --wrap=80 --out {alignment_file} --hmm-in {hmm_file} --threads 16 --distmat-out {distmat_file}                                                  
"""

As you see, the template sets the cores argument to 16 and I have --threads 16 in the command. But the "user" of my template is able to override this argument.

target('AlignSequences') << clustalo(sequence_file='a_domains_all.fa',
                                     hmm_file='a_domain.hmm',
                                     alignment_file='a_domain_aligned.fa',
                                     distmat_file='a_domain_distmat.txt',
                                     cores=8)

This is misleading since clustalo still starts 16 threads (this may also hurt performance if the job is restricted to 8 cores by SLURM, but I'm not sure if this is the case).

cores could be added to the formatting of the command string, such that one can write:

clustalo = template(input=['{sequence_file}', '{hmm_file}'], output=['{alignment_file}', '{distmat_file}'], cores=16) << """                                                              
clustalo --in {sequence_file} --outfmt fa --wrap=80 --out {alignment_file} --hmm-in {hmm_file} --threads {cores} --distmat-out {distmat_file}                                                  
"""

Cheers,
Dan

Move graphviz code generation out of Node class

E.g. @runefriborg talked about having an ASCII representation too.

Using files as template substitutions in names

If you use a file name to substitute in the name of a target-template, / will mess up the script path ... need to fix that.

Add option for printing parts of the dependency graph

Add an option to the script so it only output the nodes leading to a given target (or perhaps list of targets)

Re-introduce dummy targets

Better handling of submission errors

If sbatch is not in the path the error message is less than clear

Possibility to set queue

Would like to be able to speficy queue in template

allpath = template(..., queue="fat1", memory="380g") << ''' ... '''

target(..., queue="fat1",...)

Make the coupling between the dependency graph and the workflow tasks looser

The nodes in the graph are tightly coupled to the tasks in the workflow, but most if not all of the graph code doesn't really need to know what the node contains, so this coupling should be removed. The graph should only worry about the graph structure, and we must handle information it needs from the workflow through some polymorphic interface.

gwforg / gwf Goto Github PK

gwf's People

Contributors

Stargazers

Watchers

Forkers

gwf's Issues

Recommend Projects

Recommend Topics

Recommend Org