Giter Site home page Giter Site logo

Comments (7)

matthchr avatar matthchr commented on August 17, 2024

The monitor looks for tasks in the specific state you requested. It doesn't know that "Completed" means the task was actually "Running" at some previous point (actually, there are ways to get to completed without ever going to "Running" anyway).

This means it actually has to see the task in running - depending on how long the tasks run for and how many you have it may be hard for the monitor to see all the tasks in running. For example with 200 tasks which each run 10s, it's entirely possible/likely the monitor doesn't see all 200 tasks every 10s (since it tries to optimize how it queries to be efficient) which means it could miss the running state for one or more tasks. If it misses "Running" for even one task then it will time out waiting for that task to get to running state.

Is there a particular reason you need to wait for "Running"? Generally speaking it's bad practice to do that, as "Running" is a transient state. A task could go to Running and then to Active and then back to Running and then to Completed. Usually it's best to wait for a steady state such as Completed to avoid races and/or your information being out of date.

from azure-batch-samples.

vejuhust avatar vejuhust commented on August 17, 2024

@matthchr Thanks! I monitor "Running" because I need to know if all my task is up and running, and figure out status of the pool indirectly. Do you have better/correct solutions?

from azure-batch-samples.

matthchr avatar matthchr commented on August 17, 2024

If you want to know the status of the pool, could you just list the VMs in the pool and check their state?

I've made a note that we should improve the behavior of the monitor for short-lived task states such as running, but for now your best bet is going to be to avoid using TaskStateMonitor to monitor for short-lived task states (i.e. Running).

You could also look into using the new GetTaskCounts API, which gives you the count of tasks in a particular state but not their names.

from azure-batch-samples.

vejuhust avatar vejuhust commented on August 17, 2024

@matthchr Yeah, monitoring the pool directly may be a good solution. I used to watch the pool by myself.

Once in a while, I noticed some nodes in the pool became 'Unusable' while resizing, and it seems that such nodes would stay in that state forever --- would not reboot nor reimage. It couldn't be resolved by automatic scaling, and I had to add extra health node manually. Do you have any solutions to this situation?

from azure-batch-samples.

matthchr avatar matthchr commented on August 17, 2024

There are a variety of reasons why a pool node might go to unusable.
If you have application packages, failing to download them (due to storage errors, etc) will eventually move the node to unusuable. There can also be other things such as infrastructure blips that move a node to unusable. Lastly, nodes can go to unusuable if you have a pool in a VNet or are running a custom image and there are issues with your VNET or custom image (i.e. the vnet/custom image is blocking ports that Batch needs to communicate). We have autorecovery mechanisms in place to recover nodes that go to unusuable but they can sometimes take some time (15-45m or possibly more), and in some cases such as the custom image/vnet cases we can't change the configuration of your vnet and so the nodes would be stuck forever. Generally those sorts of "Configuration" problems will happen to all the nodes in the pool, not just some though.

Generally speaking, nodes going to unusuable is bad/unexpected, if it's happening often for your workload/pools it might not be a bad idea to go through the Azure Portal and raise a ticket for your Batch account and give the details of the pool it's happening on, that way somebody can investigate what is going wrong and fix it.

from azure-batch-samples.

vejuhust avatar vejuhust commented on August 17, 2024

@matthchr Thanks for your explanation!

In my case, it wasn't vNet issue or custom image issue --- I used Cloud Service batch nodes (OS Family: 5, Size: Medium). And it took a long time to 'Creating' the node before declaring it's 'Unusable'.

I encountered such issue with internal Cosmos node before and I did think about raising a ticket for the batch node, but --- 1) my free subscription 'Visual Studio Enterprise' has no technical supports, 2) keeping the spot costs money...so many computing hours 😢

from azure-batch-samples.

matthchr avatar matthchr commented on August 17, 2024

I am closing this because I think all of the above questions have been answered.

from azure-batch-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.