Giter Site home page Giter Site logo

gwf's People

Contributors

birc-aeh avatar dansondergaard avatar gregoryleeman avatar hogfeldt avatar jakobjn avatar jbethune avatar kaspermunch avatar mailund avatar micknudsen avatar tobiasmadsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gwf's Issues

Local execution

Add an option for running a workflow locally, i.e. without submitting to a computer grid. This can be useful for testing workflows with smaller data sets before employing it on larger data sets where the full power of a computer grid is needed.

Function templates

Specify template functions as Python functions.

Write a python function and have it called as a target.

No errors on missing files that cannot be generated

Files that are not generated by targets are identified as "system files" and we can test if they are present or not. If they are not, we should report an error somewhere. Right now, we simply ignore it.

Poor testing

The testing is pretty shitty as well. I have just been testing while writing the code. I need some properly thought out tests to make sure this doesn't bite me at some point...

Show H jobs in slurm

The slurm status right now shows waiting jobs (status H in torque) as just queued.

Add example workflows for documentation

The test workflows are not that useful for documentation examples.

I have better examples on our grid that actually does something useful, but they manipulate rather large files for genomic and sequencing data, so they are too large to be included in the distribution.

We should have some small but realistic examples with the distribution.

Synchronous targets?

Is it possible to make a target that runs synchronous? If you need to run a target that provides information that actually affects the workflow -- such as determining the number of targets later in the workflow based on the data being analysed -- you will want to run it when calling the workflow but you probably don't want it to run on the front end node.

It is easy enough to wrap a function so it is only run when we haven't saved its output, so it doesn't have to cost much to have a function that produces information that guides the workflow, but I don't know if it is easy to submit a job and wait for its completion.

Pattern matching / template targets

It is not uncommon to have to run the same commands on a number of files. In gmake, at least, that can be done with patterns such as

%.o: %.c:
gcc ...

which matches all *.c files and creates a target that generates the corresponding *.o files.

We need something similar here, although I am not sure it is really a single feature or really two: one for specifying a script with parameters and one for creating targets based on globs or regexps.

Gwf resubmits running jobs when getting checking a large workflow

Running 20000 jobs - gwf submits as it should - but gwf -s shows hundreds of jobs as not submitted...

probably the interface with squeue that either times out or truncates the output (?)

Example script that will produce the error (20 + 20*10 + 20 * 10 *100 jobs = 20220 jobs).

from gwf import *

hello_world = template(input="",
                       output='{filename}',
                       nodes=1, cores=1, memory="4g", walltime="2:00:00") << '''
echo "Hello world" > {filename}
'''

def concat_files(inbams, outbam):
    options = {'nodes': 1, 'cores':1, 'memory': "2g", 'walltime':"24:00:00",
               'output': outbam, 'input': inbams}
    spec = """
cat {inbams} >{outbam}
    """.format(outbam=outbam, inbams=' '.join(inbams))
    return (options, spec)

jobname = "hwhw"
for i in range(1,21):
    outs2 = []
    for j in range(1,11):
        outs1 = []
        for k in range(1,101):
            outfile=jobname + "hw_" + str(i)+"_"+str(j)+"_"+str(k) 
            outs1.append(outfile)
            target(outfile)          << hello_world(filename = outfile)
        # Merge the 40 outputs
        outfile = jobname +"merge1_"+str(i)+"_"+str(j) 
        outs2.append(outfile)
        target(outfile)              << concat_files(inbams = outs1, outbam=outfile)
    # Merge the 50 outputs
    outfile = jobname+"merge2_"+str(i) 
    target(outfile)                  << concat_files(inbams = outs2, outbam=outfile)


Rethink the template specifications

The template function is a horrible way to specify templates, but it is slightly easier than using functions and there keeping track of options and specs.

It should be possible to do something a bit smarter for specifying templates. A small language for specifying options and specs of some sort.

It would also be good to make sure that templates have doc-strings to the extend that this can be guaranteed.

Redo grid options so they work with different backends

We need to change the way options are passed to the grid backend to easy the migration to support for slurm as well. Using PBS options won't work there and we shouldn't have to specify the same thing both as PBS and SBATCH.

For templates that uses environment variables I am not completely sure how to deal with this, but renaming the backend variables is a possibility.

Workflow visualisation

It would be nice with a script that displays the workflows and which jobs needs to be run for each target.

Include sub-workflows

This was implemented before the spec language was python and should be implemented again.

Keep track of running jobs

When a target is already running, the output files could either exists or not but in any case dependent jobs submitted should wait for the job to complete.

memorize function

If a thunk takes a little while to run, we should remember its output from previous calls.

Ideally we should also be running the function on a computation node but that is another issue and this is at least a step in the direction of solving that.

Need to change path to import template file

I followed the documentation and split my workflow into workflow.py and templates.py, and then put an import in workflow.py:

from templates import *

However, this doesn't work.

Traceback (most recent call last):
  File "/home/das/.local/bin/gwf", line 5, in <module>
    pkg_resources.run_script('gwf==0.6.1', 'gwf')
  File "/com/extra/python/2.7/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/com/extra/python/2.7/lib/python2.7/site-packages/distribute-0.6.28-py2.7.egg/pkg_resources.py", line 1239, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/das/.local/lib/python2.7/site-packages/gwf-0.6.1-py2.7.egg/EGG-INFO/scripts/gwf", line 57, in <module>
    execfile(args.workflow_file)
  File "workflow.py", line 5, in <module>
    from templates import *
ImportError: No module named templates

To fix this I manually adding the cwd to the path at the top of workflow.py.

sys.path.append(os.getcwd())

Is this a bug? If not, I think it should be added to the documentation.

Poor error handling

There's currently no propper error handling. If you are lucky, an assertion is broking and the script is terminated. To make this useful, a lot more error checking and error handling is needed.

Document @memorize

There is a bit of documentation in the news posts on the home page but documentation should be added to the main manual.

Add dummy targets

Have an option to make targets that are used for computing dependencies and submitting jobs, but are never actually submitted to the grid themselves

Document @function_template

There is a bit of documentation in the news posts on the home page but documentation should be added to the main manual.

Overriding cores argument doesn't make sense

Disclaimer: I don't really know if this is the intended behaviour.

I created a template for aligning sequences with clustalo:

clustalo = template(input=['{sequence_file}', '{hmm_file}'], output=['{alignment_file}', '{distmat_file}'], cores=16) << """                                                              
clustalo --in {sequence_file} --outfmt fa --wrap=80 --out {alignment_file} --hmm-in {hmm_file} --threads 16 --distmat-out {distmat_file}                                                  
"""

As you see, the template sets the cores argument to 16 and I have --threads 16 in the command. But the "user" of my template is able to override this argument.

target('AlignSequences') << clustalo(sequence_file='a_domains_all.fa',
                                     hmm_file='a_domain.hmm',
                                     alignment_file='a_domain_aligned.fa',
                                     distmat_file='a_domain_distmat.txt',
                                     cores=8)

This is misleading since clustalo still starts 16 threads (this may also hurt performance if the job is restricted to 8 cores by SLURM, but I'm not sure if this is the case).

cores could be added to the formatting of the command string, such that one can write:

clustalo = template(input=['{sequence_file}', '{hmm_file}'], output=['{alignment_file}', '{distmat_file}'], cores=16) << """                                                              
clustalo --in {sequence_file} --outfmt fa --wrap=80 --out {alignment_file} --hmm-in {hmm_file} --threads {cores} --distmat-out {distmat_file}                                                  
"""

Cheers,
Dan

Possibility to set queue

Would like to be able to speficy queue in template

allpath = template(..., queue="fat1", memory="380g") << ''' ... '''

target(..., queue="fat1",...)

Make the coupling between the dependency graph and the workflow tasks looser

The nodes in the graph are tightly coupled to the tasks in the workflow, but most if not all of the graph code doesn't really need to know what the node contains, so this coupling should be removed. The graph should only worry about the graph structure, and we must handle information it needs from the workflow through some polymorphic interface.

MapReduce

Make a map-reduce like framework to cover a typical design pattern.

Speed up gwf -s

When there are a lot of jobs running, getting their status is a bit slow. Can we optimise how we get the status so we can speed it up on some back-ends at least?

Better status output

When running "gwf -s" the jobs are sorted according to their status only. Would it be better to sort them topologically so the jobs are grouped according to how they will need to be run?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.