Giter Site home page Giter Site logo

Comments (9)

rimorob avatar rimorob commented on June 18, 2024

I do have a clarifying question: are the packages and the IP address supposed to refer to the compute node the job is running on? In other words, could this mean that the compute node, rather than the master AMI, is lacking batchtools?

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Updates: I checked - batchtools were missing and are installed, but didn't fix the problem. However, the listed IP doesn't match either the master node's public or private address or the compute node's public or private address. I don't understand what's going on here.

from future.batchtools.

HenrikBengtsson avatar HenrikBengtsson commented on June 18, 2024

Hi. Some quick comments:

  1. I've updated your top comment to surround your code blocks with triple backticks (```) above and below. You want to use that for code blocks - easier to read. The single backtick (`) is only used for in-line code, e.g. some code here but not here. See https://guides.github.com/features/mastering-markdown/

  2. When you troubleshoot, especially when you want to get started, rule out as many packages as possible. In your case, you don't need to involve foreach and doFuture. Instead, use a bare-bones future. Use the first example on https://github.com/HenrikBengtsson/future.batchtools. I don't think the problems are with foreach + doFuture but it's always easier if you can simplify your example as far as possible.

  3. Yes, you need to have future.batchtools and friends installed on compute nodes too.

  4. I don't understand what you mean by IP numbers.

  5. If this is the first time you ever used future.batchtools and batchtools, it might be easier if you start by getting a simple batchtools batchMap() example going. This way you don't have to worry at all about future and future.batchtools. (https://cran.r-project.org/package=batchtools)

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Sorry, to clarify - the inside of the foreach loop intentionally didn't perform any computation - just a system call. This was meant to be a test of the piping. This is why I was trying to understand whether and why any of the batch* packages were even necessary on the nodes just to run a 'hostname' call. I suppose let's start at the beginning. I don't understand the traceback. Where in the process is it happening? At what stage? I will look at batchMap in parallel as well, but given that the inside of the foreach loop is completely gutted I doubt that this should be hard to troubleshoot even in its current form. Agree that there's a lot to the future* set of packages to a new user, even with decades of HPC experience, as in my case.

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Oops, closed the issue by accident. Thank you for the comments. Also relatively little github experience, especially with the web UI.

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Regarding ip numbers - the traceback refers to a machine by IP. I don't see any such machine in my cluster either by private or by public IP. That's what I was referring to. It's possible that the default SGE cluster configuration is somehow mangled, though the SGE compute node duly spins up, so it's unclear that this is at all a real issue.

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Ok, this simpler use case - using only the batchtools package - also doesn't work, probably because I can't find adequate documentation with a single example of usage anywhere at all. The jobs crash because the job folder in the tmp directory doesn't seem to get created prior to submission. First the code then the error message:

library(batchtools)

reg = makeRegistry(file.dir = NA, seed = 42)

reg$cluster.functions = makeClusterFunctionsSGE(
    template="sge.tmpl")

piApprox = function(n) {
    nums = matrix(runif(2 * n), ncol = 2)
    d = sqrt(nums[, 1]^2 + nums[, 2]^2)
    4 * mean(d <= 1)
}

piApprox(1000)

ids = batchMap(fun = piApprox, n = rep(1e3, 3))
names(getJobTable())
submitJobs(resources = list(walltime = 60, memory = 1024, ncpus=3, chunks.as.array.jobs = T))

and the error message from a job that ends up hanging with an Eqw status:

ubuntu@ip-172-31-97-248:~$ qstat -j 39 |grep error
error reason          1:      04/02/2020 21:40:28 [1000:6762]: error: can't open output file "/tmp/RtmpymOXuF/registry271e3e24ad73/logs/jobd266f15145f80da477cb855810ee5790.log": No such file or directory

Everything up to "logs" exists.

from future.batchtools.

HenrikBengtsson avatar HenrikBengtsson commented on June 18, 2024

When you use:

> reg <- makeRegistry(file.dir = NA, seed = 42)

your registry ends up on your local temp folder;

> reg
reg
Job Registry
  Backend  : Interactive
  File dir : /tmp/alice/RtmpkEU3OV/registry6f7933696c60
  Work dir : /home/alice/projects/SegalM_2017-FISH/article
  Jobs     : 0
  Seed     : 42
  Writeable: TRUE
> 

The compute nodes have their own, independent local /tmp/. Avoid this by not setting file.dir = NA, e.g.

reg <- makeRegistry(seed = 42)

Make sure your working directory is accessible by all machines.

from future.batchtools.

rimorob avatar rimorob commented on June 18, 2024

Btw, that tip resolved the issue, thank you.

from future.batchtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.