Giter Site home page Giter Site logo

Comments (6)

rhc54 avatar rhc54 commented on July 29, 2024

I'm a little puzzled here. The Python client includes a resource manager plugin - e.g., Slurm. That plugin has the ability to request an allocation in it. The intent was that the Python client would not need to be started from within an allocation, but would instead execute its initial operations (fetching and building things) and then request an allocation when one was needed for actually running the tests.

The only issue we have encountered with that method is that the allocation request can take some time to be granted. Our internal solution was to simply create a high-priority queue for MTT operations and submit the request to it.

Is this not adequate for ECP systems? If not, note that the Python client already has a C/R capability in that you can have one ini file that downloads and builds things, and another ini file that flags the download/build components with ASIS to indicate that MTT is to use the existing installations if present. Does that not also solve the problem?

from mtt.

hppritcha avatar hppritcha commented on July 29, 2024

I don't see how this would work with the current IU database reporter. Anyway I've written the code and it appears to serve my purposes.

from mtt.

rhc54 avatar rhc54 commented on July 29, 2024

You are welcome to use your code - however, the methods I described work just fine with the current IU reporter. We use it every day precisely that way. The builds are reported correctly even if previously built.

from mtt.

ribab avatar ribab commented on July 29, 2024

Hi @hppritcha , I am trying to understand your use-case and how it differs from ours. Your process allocates the cluster from a compute node? How does this work?

You said:

the requesting process is typically put on some compute node

This means the process that requested an allocation runs on a compute node? Don't you need an allocation before running anything on a compute node?

from mtt.

hppritcha avatar hppritcha commented on July 29, 2024

@ribab one invokes the allocation command from a front end node - the one you get placed on when you ssh to the system. For example with the ANL theta cluster, here's how you'd get an allocation:

ssh theta
XXX@thetalogin6:qsub -n 8 --jobname ompi -q debug-flat-quad -t 60 -I

upon granting the allocation, the user is placed on one of the internal mom nodes on theta. Then one uses aprun or mpirun to launch the application.

A similar thing occurs on SLURM configured systems like NERSC cori. One can try using the salloc --no-shell option to remain on one of the cori login nodes, but we've found that option to be unreliable for running many tests like we do in MTT.

The only way I can see how one might use MTT ALPS plugin on theta would be for the allocate command to include the name of a script which would somehow continue the MTT run on the backend mom node. The front end process running the ALPS plugin up to that point would be disconnected with whatever was going on in the backend.

The way the MTT SLURM and ALPS plugins are written, I suspect systems configured similar to some of Cray's internal systems were being used. There, SLURM is configured so that when one does an salloc, one remains on the front end nodes, not ssh'd into a compute or mom node on the backend. In this case, the plugins as is with their allocate/deallocate commands should work fine. PBS was similarly configured on those systems.

from mtt.

hppritcha avatar hppritcha commented on July 29, 2024

closed via #916

from mtt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.