Giter Site home page Giter Site logo

freesurfer-osg-workflow's People

Contributors

rynge avatar vascosa avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

freesurfer-osg-workflow's Issues

Option to not tar output directory

When the job finishes, I get an output file like subject_output.tar.gz inside my workdir. Is there a way to make the workflow not tar-up all the files and instead leave all files expanded under "subject" directory?

I think I just have to adjust scripts/autorecon3.sh but I am not sure how best to handle this.

DAGMan freesurfer-0.dag finished with exit code 1

I just ran my first freesurfer-osg-workflow on osgconnect, but I can't seem to find the generated output. I believe the job failed, but I can't tell which log to check to see what went wrong.

On osgconnect, here is the path for the workflow I've submitted.

/local-scratch/hayashis/workflows/5e03c800cc6723b534bd2d7b/5e03c800cc67234a3ebd2d7e

Could someone help me troubleshoot the problem?

Add capability to specify freesurfer version

I believe the current version of freesurfer-osg-workflow runs freesurfer 6.0.1(?). Is there anyway to specify the version of freesurfer? We run 6.0.0 and 6.0.1, and sometimes the dev (nightly build) version. Also, freesurfer 7.0 RC is out now, so I might want to start testing with it also.

Integrity check: output-t1.nii.gz: Expected checksum does not match the calculated checksum

I've run another test job and it failed with this error message.

##################### Checking file integrity for input files #####################
Integrity check: output-t1.nii.gz: Expected checksum (315e3a7365017a2dfd9472e629a53d48b2777767b3a9ec7a830e9bb609a45685) does not match the calculated checksum (1819fdd61883a2da3e5c249649ef91034df97ce6891a66fb5a6676797581386e) (timing: 0.446)

I don't see output-t1.nii.gz delivered to the submit host (osgconnect) though. Here is a bit more log output

  
FAILED ~5 hours ago
Submit Directory   : work
 Total jobs         :     14 (100.00%)
 # jobs succeeded   :      4 (28.57%)
 # jobs failed      :      1 (7.14%)
 # jobs held        :      1 (7.14%)
 # jobs unsubmitted :      9 (64.29%)

*******************************Held jobs' details*******************************

===========================autorecon1_sh_output_00001===========================

submit file            : autorecon1_sh_output_00001.sub
last_job_instance_id   : 7
reason                 :  Error from slot1_1@[email protected]: STARTER at 192.168.3.174 failed to send file(s) to <192.170.227.166:9618>: error reading from /sge-batch/3958692.1.lnxfarm/glide_LzUphU/execute/dir_31678/output_recon1_output.tar.xz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.12.238.129:35960>

******************************Failed jobs' details******************************

===========================autorecon1_sh_output_00001===========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon1_sh_output_00001.sub
output file: 00/00/autorecon1_sh_output_00001.out.002
 error file: 00/00/autorecon1_sh_output_00001.err.002

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : -
executable  : /public/hayashis/workdir/5ee37996529ab4fcd686772f/5ef206c9bf709388263a46aa/work/00/00/autorecon1_sh_output_00001.sh
arguments   : -
exitcode    : -1
working dir : /public/hayashis/workdir/5ee37996529ab4fcd686772f/5ef206c9bf709388263a46aa/work

-----------Job stderr file - 00/00/autorecon1_sh_output_00001.err.002-----------

/bin/singularity exec  --bind /cvmfs --bind /hadoop --bind /mnt/hadoop --contain --bind /batch/lnxfarm274/3958692.1.lnxfarm/glide_LzUphU/execute/dir_31678:/srv --no-home --ipc --pid /cvmfs/singularity.opensciencegrid.org/.images/d2/9c76edc34defb5998d5f6c0fa9704c55485d03141e6e3b112076ee777c9bea /srv/.osgvo-user-job-wrapper.sh /srv/condor_exec.exe
2020-06-23 15:05:28: PegasusLite: version 4.9.3dev
2020-06-23 15:05:29: Executing on host lnxfarm274.colorado.edu OSG_SITE_NAME=UColorado_HEP GLIDEIN_Site=Colorado GLIDEIN_ResourceName=UColorado_HEP

########################[Pegasus Lite] Setting up workdir ########################
2020-06-23 15:05:29: Not creating a new work directory as it is already set to /srv

##############[Pegasus Lite] Figuring out the worker package to use ##############
2020-06-23 15:05:29: The job contained a Pegasus worker package

##################### Setting the xbit for executables staged #####################

##################### Checking file integrity for input files #####################
Integrity check: output-t1.nii.gz: Expected checksum (315e3a7365017a2dfd9472e629a53d48b2777767b3a9ec7a830e9bb609a45685) does not match the calculated checksum (1819fdd61883a2da3e5c249649ef91034df97ce6891a66fb5a6676797581386e) (timing: 0.446)

2020-06-23 15:05:30: Last command exited with 1
PegasusLite: exitcode 1

workflow-generator.py: error: argument --inputs-def is required

Hi,

I am trying to run the test script from the /freesurfer-osg-workflow directory using the following command:

./submit.sh --input-def example-run.yml

however I got the following error message:

workflow-generator.py: error: argument --inputs-def is required

Do you have an idea on what is going on?

Thanks

pegasus-status shows Failurer shortly after job start

If I submit a workflow (with pegasus-plan --submit), and immediately runs pegasus-status, it reports Failurer.

STAT  IN_STATE  JOB                                                        
Idle     00:16  freesurfer-0 ( /home/hayashis/git/app-freesurfer-osg/work )
Summary: 1 Condor job total (I:1)

STATE  
Failure
Summary: 1 DAG total (Failure:1)

If I sleep for about 20 seconds, then run pegasus-status, it will report Running status.

STAT  IN_STATE  JOB                                                        
Run      00:20  freesurfer-0 ( /home/hayashis/git/app-freesurfer-osg/work )
Idle     00:19   ┗━create_dir_freesurfer_0_local                           
Summary: 2 Condor jobs total (I:1 R:1)

STATE  
Running
Summary: 1 DAG total (Running:1)

I've added sleep 20 on my startup script for now, but it would be nice if pegasus-status will report anything other than "Failure" as soon as I submit the job.

Translating pegasus status

I am updating our freesurfer App to use the new pegasus workflow version.

For it to function properly on our system, I will need to query and translate pegasus status output to one of the following exit code.

#return code 0 = running
#return code 1 = finished successfully
#return code 2 = failed
#return code 3 = unknown status

I am currently working on the following script

https://github.com/brainlife/app-freesurfer-osg/blob/master/status.sh

My questions are..

  1. What are the possible statuses that pegasus could generate?
  2. Instead of grep/tail-ing the stdout output from pegasus-status (brittle) like I am doing, is there way to make pegasus-status outout in json/xml or any machine readable format?
  3. When I run pegasus-remove, the state become "Failurer". Is this by design? Can the status become "removed" or "stopped" instead?

Thanks!

STARTER at 192.168.4.2 failed to send file(s) to <192.170.227.166:9618>

I was able to run the test job successfully, and obtained what seems to be a valid freesurfer output.

However, I ran another test job using the same t1 input, and this time it failed with this error message.

$ pegasus-analyzer work

************************************Summary*************************************

 Submit Directory   : work
 Total jobs         :     14 (100.00%)
 # jobs succeeded   :      4 (28.57%)
 # jobs failed      :      1 (7.14%)
 # jobs held        :      1 (7.14%)
 # jobs unsubmitted :      9 (64.29%)

*******************************Held jobs' details*******************************

==========================autorecon1_sh_subject_00001===========================

submit file            : autorecon1_sh_subject_00001.sub
last_job_instance_id   : 7
reason                 :  Error from slot1_6@[email protected]: STARTER at 192.168.4.2 failed to send file(s) to <192.170.227.166:9618>: error reading from /var/lib/condor/execute/dir_1807/subject_recon1_output.tar.xz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.170.236.165:60541>

******************************Failed jobs' details******************************

==========================autorecon1_sh_subject_00001===========================

 last state: POST_SCRIPT_FAILED
       site: condorpool
submit file: 00/00/autorecon1_sh_subject_00001.sub
output file: 00/00/autorecon1_sh_subject_00001.out.002
 error file: 00/00/autorecon1_sh_subject_00001.err.002

-------------------------------Task #1 - Summary--------------------------------

site        : condorpool
hostname    : condor-worker-7c7d97844f-ht4ml
executable  : /srv/autorecon1_sh
arguments   :   subject   subject-t1.nii.gz   4   -notal-check   -cw256  
exitcode    : 1
working dir : /srv

----------------Task #1 - autorecon1.sh - subject_00001 - stdout----------------

Will use SUBJECTS_DIR=/srv/tmp.1PLxkG0bOW
Subject Stamp: freesurfer-Linux-centos6_x86_64-stable-pub-v6.0.1-f53a55a
Current Stamp: freesurfer-Linux-centos6_x86_64-stable-pub-v6.0.1-f53a55a
INFO: SUBJECTS_DIR is /srv/tmp.1PLxkG0bOW
Actual FREESURFER_HOME /opt/freesurfer-6.0.1
Linux condor-worker-7c7d97844f-ht4ml 5.3.2-1.el7.elrepo.x86_64 #1 SMP Tue Oct 1 08:18:21 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
'/opt/freesurfer-6.0.1/bin/recon-all' -> '/srv/tmp.1PLxkG0bOW/subject/scripts/recon-all.local-copy'
-cw256 option is now persistent (remove with -clean-cw256)
/srv/tmp.1PLxkG0bOW/subject

 mri_convert /srv/subject-t1.nii.gz /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz 

mri_convert.bin /srv/subject-t1.nii.gz /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz 
$Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $
reading from /srv/subject-t1.nii.gz...
TR=6.40, TE=0.00, TI=0.00, flip angle=0.00
i_ras = (1, 0, 0)
j_ras = (0, 1, 0)
k_ras = (0, 0, 1)
writing to /srv/tmp.1PLxkG0bOW/subject/mri/orig/001.mgz...
#--------------------------------------------

How should I handle this error?

Stream freesurfer log back to the submit host

Would it be possible to stream the output from the recon-all command back to the submit host while the job is being executed? I can live without this, but I think users will want to see how the job is progressing.

I am particularly interested in the following markers in the stdout


#--------------------------------------------
#@# ASeg Stats Mon Apr 20 21:52:51 UTC 2020

...

#-----------------------------------------
#@# WMParc Mon Apr 20 21:53:29 UTC 2020

...

#--------------------------------------------
#@# BA_exvivo Labels lh Mon Apr 20 21:59:04 UTC 2020

...

#--------------------------------------------
#@# BA_exvivo Labels rh Mon Apr 20 22:01:08 UTC 2020

These markers tell where it is in terms of overall processing.

If something similar to this can be output to the log, I can then relay that to the users. Would this be difficult to do?

How to specify path for t2?

I am trying to figure out how to set -hippocampal-subfields-T1T2 option for autorecon-options in run.yml. This option requires path to the T2 file if specified.

Currently, I have the following script to customize the command line option based on user input.

cmd="-i $t1 -subject output -all -parallel -openmp $OMP_NUM_THREADS"
if [ -f $t2 ]; then
    cmd="$cmd -T2 $t2 -T2pial"
    #https://surfer.nmr.mgh.harvard.edu/fswiki/HippocampalSubfields
    #https://surfer.nmr.mgh.harvard.edu/fswiki/HippocampalSubfieldsAndNucleiOfAmygdala
    if [ $hippocampal == "true" ]; then
        cmd="$cmd -hippocampal-subfields-T1T2 $t2 t1t2"
    fi
else
    if [ $hippocampal == "true" ]; then
        cmd="$cmd -hippocampal-subfields-T1"
    fi
fi

Since I can't just use the local path for T2 in autorecon-options (right?) I am not sure how to go about specifying this option. Is it possible to set this option?

Unable to submit the workflow using pegasus-run

I just tried running this again, and now I am seeing a different error message.

submitting with this this config
+ cat run.yml
output:
    input: ../5ed1a215529ab4221883209e/5e6a9956874067bc9ea3d445/t1.nii.gz
+ ./workflow-generator.py --inputs-def run.yml
{'output': {'input': '../5ed1a215529ab4221883209e/5e6a9956874067bc9ea3d445/t1.nii.gz'}}
+ export PYTHONPATH=:/usr/lib/python2.6/site-packages
+ PYTHONPATH=:/usr/lib/python2.6/site-packages
+ pegasus-plan --conf pegasus.conf --dir /public/hayashis/workdir/5ed1a215529ab491cf83209d/5ed1a215529ab4764b8320a0 --relative-dir work --sites condorpool --output-site local --dax freesurfer-osg.xml --cluster horizontal --submit
2020.05.30 00:00:40.676 GMT:    
2020.05.30 00:00:40.682 GMT:   ----------------------------------------------------------------------- 
2020.05.30 00:00:40.688 GMT:   File for submitting this DAG to HTCondor           : freesurfer-0.dag.condor.sub 
2020.05.30 00:00:40.694 GMT:   Log of DAGMan debugging messages                 : freesurfer-0.dag.dagman.out 
2020.05.30 00:00:40.701 GMT:   Log of HTCondor library output                     : freesurfer-0.dag.lib.out 
2020.05.30 00:00:40.707 GMT:   Log of HTCondor library error messages             : freesurfer-0.dag.lib.err 
2020.05.30 00:00:40.714 GMT:   Log of the life of condor_dagman itself          : freesurfer-0.dag.dagman.log 
2020.05.30 00:00:40.720 GMT:    
2020.05.30 00:00:40.727 GMT:   -no_submit given, not submitting DAG to HTCondor.  You can do this with: 
2020.05.30 00:00:40.738 GMT:   ----------------------------------------------------------------------- 
2020.05.30 00:00:42.582 GMT:   Your database is compatible with Pegasus version: 4.9.3dev 
2020.05.30 00:00:42.691 GMT:   Submitting to condor freesurfer-0.dag.condor.sub 
2020.05.30 00:01:03.444 GMT:   Submitting job(s) 
2020.05.30 00:01:03.451 GMT:   ERROR: store_cred failed! 
2020.05.30 00:01:03.509 GMT: [ERROR]  ERROR: Running condor_submit /usr/local/bin/condor_submit freesurfer-0.dag.condor.sub failed with exit code 1 at /usr/bin/pegasus-run line 327. 
2020.05.30 00:01:03.515 GMT: [FATAL ERROR]  
 [1] java.lang.RuntimeException: Unable to submit the workflow using pegasus-run at edu.isi.pegasus.planner.client.CPlanner.executeCommand(CPlanner.java:697) 

Freesurfer 7.1.0

I'd like to request for freesurfer 7.1.0 container to be made available

ls: cannot access /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-freesurfer:7.1.0/: No such file or directory
Error: unable to access /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-freesurfer:7.1.0

Also, the Freesurfer 7.0.0 should be removed as it was recalled by the developer.

https://surfer.nmr.mgh.harvard.edu/fswiki/ReleaseNotes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.