Comments (15)
from bosh-agent.
It might also be worth noting here that the windows VM is consistently at ~25% CPU utilization, even when nothing is going on. Not clear to me whether this is due to the concourse worker job or the bosh agent.
from bosh-agent.
@flavorjones Hrm, very interesting and definitely unexpected (cc: @davidjahn). Thanks for reporting! I've scheduled a bug investigation in our backlog here.
FWIW, naively killing the bosh director on my cf deployment on GCP, using a stemcell v1200.1 release candidate, wasn't enough to reproduce this issue for me. That's pretty surprising -- I can't imagine any running process aside from the agent would be affected by the director's accessibility. I think I know the answer to this question, but just to be sure, would you happen to know whether the agent was performing any director-requested work at the time the director stopped?
from bosh-agent.
Ohhh this is interesting... not sure if the problem is agent induced alone or is some interaction with concourse... we will see if we can reproduce it on GCP.
from bosh-agent.
One thing to note. If the agent dies it's jobs (the services it creates) will keep running.
from bosh-agent.
Is that red line the agent's CPU usage, and if not do we know which process it correlates to?
from bosh-agent.
In investigating this bug, we discovered a completely unrelated bug where the agent will not restart after termination. Thanks!
We're still trying to reproduce the behavior you're seeing. In the stemcell you're using (where the aforementioned unrelated bug is not present), the agent does try to restart when the director connection is lost. We don't see our VMs using 100% CPU, though. Could you tell us a bit more about the instance types in this deployment?
Also, logs from /var/vcap/bosh/log
would be super helpful.
from bosh-agent.
OK, just landed and will try to reproduce and get y'all some logs and maybe some screenshots from Task Manager.
from bosh-agent.
OK, so we do see that, with the default of restarting the agent every 5 seconds on failure, about 25% CPU usage. This is a bit excessive, so we're going to go with an exponential backoff for restarting the agent. We'll back off up to 5 minutes, and then try to start the agent every 5 minutes thereafter. Note that the Linux agent is also chatty on startup when NATS is unreachable, but probably doesn't have such an expensive bootstrap process.
It's worth noting that with a backoff of 5 minutes, when you bring your director back online and the resurrector is enabled, the resurrector may recreate the VM before the agent has a chance to restart itself.
from bosh-agent.
(Hopefully) Fixed in dc9a5b4.
from bosh-agent.
Looks good. I updated service_wrapper.xml
with the changes from dc9a5b4 and went through the stop/uninstall/install/start cycle. Here's what CPU util looked like while the director was stopped:
Thanks all!
from bosh-agent.
@flavorjones Great to hear! We'll ping back here and close this issue once this agent change makes its way into a 2012R2 stemcell release.
from bosh-agent.
@crawsible did this make it through?
from bosh-agent.
@cppforlife Yes it did, thanks for the reminder.
from bosh-agent.
For the record, I believe this was patched in 1200.3 (though it may have been 1200.2).
from bosh-agent.
Related Issues (20)
- bosh-agent reporting erroneous disk data on bionic HOT 3
- Unable to mount a persistent disk with Bionic Stemcell on Softlayer HOT 22
- Agent panics from sfdisk output
- bin/test dev script is broken HOT 5
- bin/test-bosh-integration dev script is broken HOT 6
- sfdisk -uM causes partitioning to fail on some linux platforms HOT 4
- Unit test script is broken in main HOT 3
- Agent could be blocked by blobstore access issues HOT 8
- Windows agent is not able to gracefully shutdown job processes HOT 1
- connection flooding from bosh-agents whilst director is unavailable HOT 10
- [Windows] Failing process not reliably reported HOT 1
- (question) Would it be desired to have BTRFS as an alternative filesystem? HOT 2
- Modify the default setup of connection between agent and nats on director HOT 2
- Device path resolution times out for aws and ali HOT 25
- Persistent disk resize fails to handle linux sparse files, resulting in filling up target persistent disk
- Agent fails to boot on Noble Numbat pre-release stemcells
- Document compile command on bosh.io HOT 4
- `bosh-agent compile` generates broken compiled releases HOT 1
- `release.MF` created by `bosh-agent compile` missing dependencies and has invalid fingerprint.
- noble switch from iptables to nftables HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bosh-agent.