cloudfoundry / bosh-agent Goto Github PK

View Code? Open in Web Editor NEW

59.0 65.0 116.0 75.85 MB

BOSH Agent runs on each BOSH deployed VM

License: Apache License 2.0

Shell 1.08% Go 98.66% PowerShell 0.23% Dockerfile 0.04%

bosh-agent's Introduction

BOSH Agent

Documentation: bosh.io/docs
Slack: #bosh on https://slack.cloudfoundry.org
Mailing list: cf-bosh

Developer Notes

See Dev Setup.

bosh-agent's People

Contributors

Stargazers

Watchers

Forkers

xingzhou huaweitech ipoddar-ibm abylaw abelhu yudai analaka jfuerth simonjohansson malanj-pivotal allomov alex8023 evanfarrar zhang-hua guoger viovanov nader-ziada vinodpanicker uzzz laboger ardnaxelarak dellemc-trigr sunatthegilddotcom nagyistge cloudxtreme henryaj vmware-archive kalambet ljfranklin evandbrown sneal knm3000 voelzmo arrowmeng krishnanrs datianshi xtreme-nitin-ravindran zhanggbj ivan-sap svrc rakutentech jianqiu kiemes beyhan dpb587 barthy1 swisscom bluebosh masters-of-cats dmutreja charlievieth everlag sleepsonthefloor oracle dheerajsshetty alibaba-archive suse xtreme-jason-smith andyliuliming krait007 1and1 tnqn armfoundry ibuystuff giner pinaki124 calculi-corp ciphercules zts-pash mattcui robday-reynolds eightseventhreethree hanlins laashub-soa michaelmccaskill isabella232 mariash mohammed90 gstackio luxingwen cpaelzer cunnie olivermautschke aadeshmisra nouseforaname sshyran ctlong qdwl daniel-hoefer locaigo danielfor lgtm-migrator speer romain-dartigues lafunamor fmoehler ten4o 00mjk anshrupani orange-cloudfoundry

bosh-agent's Issues

`get_task` action fails, if task to retrieved failed

When the original task fails, the get_task action wraps the error of the original task and returns it. Thus yielding an error itself: get_task.go

Consumers, like bosh-init cannot distinguish whether get_task or the original task failed:
agent_client.go. In this case get_task will be retried, not the original task.

Consumers should be able to distinguish in order to decide whether they have to retry the original task or to retry get_task.

Proposal: http registry client basic auth

What do you think about passing http basic auth username & password for registry along with the endpoint via metadata service (and using it later in httpRegistry to authenticate requests)? It would allow to avoid IaaS-specific code in bosh-registry (https://github.com/cloudfoundry/bosh/tree/master/bosh-registry/lib/bosh/registry/instance_manager, instance_ips method) which is useful when adopting bosh to a new IaaSes.

Disk percentage metrics are 5% off

It seems that Disk metrics reported by BOSH and the values that are observed on the VMs don't match. The most promimant difference is in the free disk percentage.

After some investigation, the problem was narrowed down to BOSH not taking into account the reserved disk space on the volume.

-m reserved-blocks-percentage
Set the percentage of the filesystem which may only be allocated by privileged processes. Reserving some number of filesystem blocks for use by privileged processes is done to avoid filesystem fragmentation, and to allow system daemons, such as syslogd(8), to continue to function correctly after non-privileged processes are prevented from writing to the filesystem. Normally, the default percentage of reserved blocks is 5%.
http://linux.die.net/man/8/tune2fs

You can further check that by running df -h on a machine. You will see that the sum of Used and Avail comes short of Size. That's because the Free value shows only what is available to a non-root user (i.e. does not include reserved disk). The reported percentage, however, takes reserved disk space into account and reports the correct value for a non-root user.

The problem is that BOSH reports the disk that is available to the root user and BOSH releases use the vcap user to run jobs. This means that when BOSH reports 95% usage, the job has actually consumed 100% of the disk and is crashing, something the operator might not be aware of.

Fixing this would require a change to sigar_stats_collector.

One way to solve this would be to correct the Total metric, taking reserved space into account.

func (s *sigarStatsCollector) GetDiskStats(mountedPath string) (stats boshstats.DiskStats, err error) {
    fsUsage, err := s.statsSigar.GetFileSystemUsage(mountedPath)
    if err != nil {
        err = bosherr.WrapError(err, "Getting Sigar File System Usage")
        return
    }

        // Free is the disk available to root (includes reserved)
        // Avail is the disk available to non-root
    reserved := fsUsage.Free - fsUsage.Avail
    stats.DiskUsage.Total = fsUsage.Total - reserved
    stats.DiskUsage.Used = fsUsage.Used
    stats.InodeUsage.Total = fsUsage.Files
    stats.InodeUsage.Used = fsUsage.Files - fsUsage.FreeFiles

    return
}

Another way would be to report metrics similar to the df -h command. This would require a fix to the reported percentage, which should be calculated as follows.

stats.DiskUsage.Percent = fsUsage.Used / (fsUsage.Total - reserved)

This requires a change to the DiskUsage struct and more but the idea I guess is clear.

Question: Does agent support to add non-IP entries in DNS resolver (resolv.conf or interface)

We got a requirement to add "options single-request" into DNS resolver, I think user_data.json should be set like below, I don't want agent to add an entry like "nameserver options single-request":

{"vm":{"name":"vm-24492677-7e7464fe3b"},"agent_id":"24492677-c4cb-43d5-8e4b-a17e7464fe3b","mbus":"nats://nats:[email protected]:4222","networks":{"default":{"type":"dynamic","dns":["8.8.8.8","10.100.100.99","10.0.99.99", "options single-request"],"default":["dns","gateway"]}},"blobstore":{"provider":"dav","options":{"endpoint":"http://110.10.10.9:25250","user":"agent","password":"agent"}},"server":{"name":"vm-24492677-7e7464fe3b"},"disks":{"persistent":{}}}

I know bosh-agent overwrites DNS resolver file during agent start. Is there any other condition to have agent overwrite DNS resolver? Thanks.

Timed out sending 'list_disk' because bosh-agent uses cached outdated settings.json

Issue Logs:

  Started updating instance database_z1 > database_z1/5101a816-47e5-48a4-966f-3afc6be350b8 (0) (canary). Failed: Timed out sending 'list_disk' to
cfc2345b-c38e-443c-8b29-db8d73ed1791 after 45 seconds (00:11:09)

Error 450002: Timed out sending 'list_disk' to cfc2345b-c38e-443c-8b29-db8d73ed1791 after 45 seconds

Reproduce steps on Azure:

Resize the persistent data disk of database_z1 (VM size: F1)
Update the deployment
It will fail because F1 only allows to attach 2 data disks(LUN 0: Ephemeral disk; LUN 1: Current data disk)
Change the VM size of database_z1 to F2
Update the deployment
It will fail with above error.

Root cause:
In step 5,

bosh called Azure CPI to attach the new disk with new size
bosh send list_disk to bosh-agent inside database_z1
bosh send mount_disk to bosh-agent inside database_z1
bosh-agent fetched new settings from bosh-registry and wrote it to /var/vcap/bosh/settings.json
bosh send migrate_disk to bosh-agent inside database_z1
bosh called Azure CPI to detach the old disk
bosh send list_disk to bosh-agent inside database_z1
bosh-agent did not fetch new settings from bosh-registry but got settings from /var/vcap/bosh/settings.json. There are two data disks (Exclude ephemeral data disk) in the settings so that it failed with timeout.

Happen a `no interface configured with that name` when virtual net interface pass Validate

When the VM created by cpi setup, we could get 3 interfaces in /sys/class/net (ls /sys/class/net: lo, eth0, eth1). Agent got its settings which is listed below and then setup networks:

{
  "networks": {
    "default": {
      "type": "manual",
      "ip": "10.112.166.136",
      "netmask": "255.255.255.192",
      "gateway": "",
      "resolved": false,
      "use_dhcp": false,
      "default": null,
      "dns": [
        "8.8.8.8"
      ],
      "mac": "",
      "preconfigured": false,
      "alias": "eth0:0"
    },
    "dynamic": {
      "type": "dynamic",
      "ip": "169.50.68.75",
      "netmask": "255.255.255.224",
      "gateway": "169.50.68.65",
      "resolved": false,
      "use_dhcp": false,
      "default": [
        "gateway",
        "dns"
      ],
      "dns": [
        "8.8.8.8",
        "10.0.80.11",
        "10.0.80.12"
      ],
      "mac": "06:f2:b7:01:a7:ca",
      "preconfigured": false,
      "alias": "eth1"
    },
    "dynamic_1": {
      "type": "dynamic",
      "ip": "10.112.39.113",
      "netmask": "255.255.255.128",
      "gateway": "",
      "resolved": false,
      "use_dhcp": false,
      "default": null,
      "dns": [
        "8.8.8.8",
        "10.0.80.11",
        "10.0.80.12"
      ],
      "mac": "06:b7:d5:35:6b:7a",
      "preconfigured": false,
      "alias": "eth0",
      "routes": [
        {
          "Destination": "10.0.0.0",
          "Gateway": "10.112.39.1",
          "NetMask": "255.0.0.0"
        },
        {
          "Destination": "161.26.0.0",
          "Gateway": "10.112.39.1",
          "NetMask": "255.255.0.0"
        }
      ]
    }
  }
}

Two dynamic networks are provided by the IaaS, The manual network is a virtual network interface based on the original net interface(dynamic_1). These configurations could write to /etc/network/interfaces and be applied.

But when invoke net.interfaceAddressesValidator.Validate(staticAddresses) in code#1, a no interface configured with that name error with eth0:0 occured because systemInterfaceAddresses list didn't have the virtual interface.

2017-10-12_12:39:11.89281 [main] 2017/10/12 12:39:11 ERROR - App setup Running bootstrap: Setting up networking: Validating static network configuration: Validating network interface 'eth0:0' IP addresses, no interface configured with that name: <nil cause>

The temporary solution is to exclude virtual IP from validation as I could think around.
Could you please help us to tell some useful advise or add some mechanism for dealing with VIF-related problems in the bosh-agent.

Thank you, guys!

Ephemeral disk is re-partitioned due to memory size change after Xen server upgrade

We observed that the memory size change after Xen server upgrade (like from 6.x to 7.x) is usually more than 20M, which caused the ephemeral disk was re-patitioned and formatted after Xen upgrade (VMs were all rebooted), seems that 100M is a safe size for such case, is it possible to extend the size to 100M in agent code below? Thanks.

https://github.com/cloudfoundry/bosh-agent/blob/master/platform/disk/parted_partitioner.go#L90
https://github.com/cloudfoundry/bosh-agent/blob/master/platform/disk/sfdisk_partitioner.go#L150

/cc @maximilien @cppforlife

DelayedAuditLogger add extra bootstrap time of bosh-agent

There are lots of retries to wait for the setup of DelayedAuditLogger which adds extra 2 or 3 minutes boostrap time of bosh-agent, is it possible to disable AuditLogger by default and enable it if needs. My stemcell is 3421.6 and bosh-agent binary version is 0.0.38

2017-06-15_03:13:28.16985 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #110
2017-06-15_03:13:28.16998 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.27032 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #111
2017-06-15_03:13:28.27043 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.37079 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #112
2017-06-15_03:13:28.37086 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.47118 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #113
2017-06-15_03:13:28.47129 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.57163 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #114
2017-06-15_03:13:28.57172 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.67210 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #115
2017-06-15_03:13:28.67217 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.77253 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #116
2017-06-15_03:13:28.77274 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.87299 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #117
2017-06-15_03:13:28.87323 [DelayedAuditLogger] 2017/06/15 03:13:28 ERROR - Unix syslog delivery error
2017-06-15_03:13:28.97351 [unlimitedRetryStrategy] 2017/06/15 03:13:28 DEBUG - Making attempt #118

@maximilien @mattcui

Can't build with go1.4

$ ./bin/build
Currently using go version go1.4, must be using go1.3.3

Agent should fail when it does not recognize the type of partition vs formatting it

The bosh-agent should be as defensive as possible, especially when it performs destructive actions. In the current code in linux_formatter.go there is a method to check the type of a partition, pasted here:

func (f linuxFormatter) partitionHasGivenType(partitionPath string, fsType FileSystemType) bool {
    stdout, _, _, err := f.runner.RunCommand("blkid", "-p", partitionPath)
    if err != nil {
        return false
    }

    return strings.Contains(stdout, fmt.Sprintf(` TYPE="%s"`, fsType))
}

If the blkid call fails (for whatever reasons) instead of returning false which in the caller will result in agent potentially partitioning/formatting the disk which will destroy all data.

Instead the agent should be defensive and propagate failure and thus avoid destroying data in disk.

The releases that have binary in release.MF can't be compiled with stemcell 3363.12

When run bosh-init deploy using stemcell 3363.12, it reports

Started deploying
  Creating VM for instance 'bosh/0' from stemcell '1530051'... Finished (00:08:27)
  Waiting for the agent on VM '29693995' to be ready... Finished (00:00:13)
  Creating disk... Finished (00:01:18)
  Attaching disk '21542745' to VM '29693995'... Finished (00:02:04)
  Rendering job templates... Finished (00:00:12)
  Compiling package 'powerdns/256336d00b1689138490c385c03ad3a8f54b4a9e'... Finished (00:00:18)
  Compiling package 's3cli/1c5a91f02feff8a0e3a506ac51c4a3140e86f049'... Finished (00:00:18)
  Compiling package 'post-deploy/OGE3ODNlYWUzMTg5YzliZmJlZGQ3NDNhMWFmYzc0NjI5NjI0YTJkNw=='... Failed (00:01:28)
Failed deploying (00:14:38)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)

Command 'deploy' failed:
  Deploying:
    Building state for instance 'bosh/0':
      Compiling job package dependencies for instance 'bosh/0':
        Compiling job package dependencies:
          Remotely compiling package 'post-deploy' with the agent:
            Sending 'compile_package' to the agent:
              Sending 'get_task' to the agent:
                Agent responded with error: Action Failed get_task: Task 2d7690a5-0489-4463-7555-cf494b7fcdf6 result: Extracting method arguments from payload: Unmarshalling action argument: Unable to parse digest string. Digest and algorithm key can only contain alpha-numeric characters.

So I think the binary string in the release is not supported by the latest bosh agent.

We have a lot of release tgz files that have binary for packages/jobs like this

  version: !binary |-
    NmI2MmVmY2Y0NjY3MmRhNTNhNGUxOWIzZjRlMzYwNGFiZDk5YThhNA==
  fingerprint: !binary |-
    NmI2MmVmY2Y0NjY3MmRhNTNhNGUxOWIzZjRlMzYwNGFiZDk5YThhNA==
  sha1: !binary |-
    YzgyODc0ODA3MDk1ODJmYjI1MjVmNDNjMThiNDA3N2JlMWFhZTI1MA==

Seems it's not supported now. Is there a way to convert them to non-binary like this

  version: 63889d018ded88d660b01994d7b00c74ce6fe8d5
  fingerprint: 63889d018ded88d660b01994d7b00c74ce6fe8d5
  sha1: 5a41368cc3a1c57694ba913cd1612aeafad71334

[Stemcell 3262.12] Permissions for /tmp are 700 instead of 770

We've seen permissions for /tmp being wrong for many VMs that have been updated to use stemcell 3262.12. Instead of 770, it has 700.

Here an example from the agent logs of a Director provisioned with bosh-init on a 3262.12 stemcell:

It seems like first the monit jobs are started

2016-09-15_11:50:35.74294 [HTTPS Dispatcher] 2016/09/15 11:50:35 INFO - POST /agent
2016-09-15_11:50:35.74295 [MBus Handler] 2016/09/15 11:50:35 INFO - Received request with action start
2016-09-15_11:50:35.74295 [MBus Handler] 2016/09/15 11:50:35 DEBUG - Payload

Then the agent is re-started

2016-09-15_11:55:51.34013 [main] 2016/09/15 11:55:51 DEBUG - Starting agent

Then some chmodding happens on /tmp and the directory which is later bind-mounted to /tmp

2016-09-15_11:55:59.15746 [File System] 2016/09/15 11:55:59 DEBUG - Symlinking oldPath /var/vcap/data/sys with newPath /var/vcap/sys
2016-09-15_11:55:59.15746 [File System] 2016/09/15 11:55:59 DEBUG - Making dir /var/vcap/data/tmp with perm 493
2016-09-15_11:55:59.15749 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Running command: chown root:vcap /tmp
2016-09-15_11:55:59.15828 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stdout: 
2016-09-15_11:55:59.15828 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stderr: 
2016-09-15_11:55:59.15829 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Successful: true (0)
2016-09-15_11:55:59.15832 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Running command: chmod 0770 /tmp
2016-09-15_11:55:59.15886 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stdout: 
2016-09-15_11:55:59.15887 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stderr: 
2016-09-15_11:55:59.15888 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Successful: true (0)
2016-09-15_11:55:59.15891 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Running command: chmod 0700 /var/tmp
2016-09-15_11:55:59.15946 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stdout: 
2016-09-15_11:55:59.15947 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stderr: 
2016-09-15_11:55:59.15947 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Successful: true (0)
2016-09-15_11:55:59.15950 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Running command: mkdir -p /var/vcap/data/root_tmp
2016-09-15_11:55:59.16018 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stdout: 
2016-09-15_11:55:59.16018 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stderr: 
2016-09-15_11:55:59.16019 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Successful: true (0)
2016-09-15_11:55:59.16022 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Running command: chmod 0700 /var/vcap/data/root_tmp
2016-09-15_11:55:59.16075 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stdout: 
2016-09-15_11:55:59.16075 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Stderr: 
2016-09-15_11:55:59.16076 [Cmd Runner] 2016/09/15 11:55:59 DEBUG - Successful: true (0)

At that point in time, the directory /var/vcap/data/root_tmp is not mounted to /tmp yet

2016-09-15_11:55:59.16076 [File System] 2016/09/15 11:55:59 DEBUG - Reading file /proc/mounts
2016-09-15_11:55:59.16082 [File System] 2016/09/15 11:55:59 DEBUG - Read content
2016-09-15_11:55:59.16083 ********************
2016-09-15_11:55:59.16083 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
2016-09-15_11:55:59.16083 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
2016-09-15_11:55:59.16083 udev /dev devtmpfs rw,relatime,size=3818716k,nr_inodes=954679,mode=755 0 0
2016-09-15_11:55:59.16083 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
2016-09-15_11:55:59.16084 tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=765856k,mode=755 0 0
2016-09-15_11:55:59.16084 /dev/xvda1 / ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16084 none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
2016-09-15_11:55:59.16084 none /sys/fs/fuse/connections fusectl rw,relatime 0 0
2016-09-15_11:55:59.16084 none /sys/kernel/debug debugfs rw,relatime 0 0
2016-09-15_11:55:59.16085 none /sys/kernel/security securityfs rw,relatime 0 0
2016-09-15_11:55:59.16085 none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
2016-09-15_11:55:59.16085 none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
2016-09-15_11:55:59.16085 none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
2016-09-15_11:55:59.16085 none /sys/fs/pstore pstore rw,relatime 0 0
2016-09-15_11:55:59.16086 rpc_pipefs /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
2016-09-15_11:55:59.16086 /dev/xvdb2 /var/vcap/data ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16086 /dev/xvdb2 /var/log ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16086 tmpfs /var/vcap/data/sys/run tmpfs rw,relatime,size=1024k 0 0
2016-09-15_11:55:59.16087 /dev/xvdb2 /tmp ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16087 /dev/xvdb2 /var/tmp ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16087 /dev/xvdf1 /var/vcap/store ext4 rw,relatime,data=ordered 0 0
2016-09-15_11:55:59.16087

I can't really find in the agent logs now when that happened. On the VM itself it seems, however, that the bind-mount happened some time

# mount
/dev/xvda1 on / type ext4 (rw)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw)
/dev/xvdb2 on /var/vcap/data type ext4 (rw)
/var/vcap/data/root_log on /var/log type none (rw,noexec,nosuid,nodev,bind)
tmpfs on /var/vcap/data/sys/run type tmpfs (rw,size=1m)
/var/vcap/data/root_tmp on /tmp type none (rw,noexec,nosuid,nodev,bind)
/var/vcap/data/root_tmp on /var/tmp type none (rw,noexec,nosuid,nodev,bind)
/dev/xvdf1 on /var/vcap/store type ext4 (rw)

And in /tmp we have some postgres .lock file which breaks Director updates

# ls -la /tmp/
total 12
drwx------  2 root vcap 4096 Sep 16 14:17 .
drwxr-xr-x 23 root root 4096 Sep 15 11:55 ..
srwxrwxrwx  1 vcap vcap    0 Sep 15 11:50 .s.PGSQL.5432
-rw-------  1 vcap vcap   56 Sep 15 11:50 .s.PGSQL.5432.lock

# cat /tmp/.s.PGSQL.5432.lock
16322
/var/vcap/store/postgres-9.4
1473940236
5432
/tmp

Note that the files were created before the above things in the agent log happened?

For comparison, here is some output for a stemcell where it actually works

Creation of /tmp and some chmodding

2016-09-16_13:44:02.44369 [File System] 2016/09/16 13:44:02 DEBUG - Making dir /var/vcap/data/tmp with perm 493
2016-09-16_13:44:02.44491 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chown root:vcap /tmp
2016-09-16_13:44:02.44580 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.44581 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.44581 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.44583 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chmod 0770 /tmp
2016-09-16_13:44:02.44657 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.44658 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.44658 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.44658 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chmod 0700 /var/tmp
2016-09-16_13:44:02.44723 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.44724 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.44724 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.44724 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: mkdir -p /var/vcap/data/root_tmp
2016-09-16_13:44:02.44953 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.44953 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.44954 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.44956 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chmod 0700 /var/vcap/data/root_tmp
2016-09-16_13:44:02.45021 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45021 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45022 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)

some chmodding and bind-mounting

2016-09-16_13:44:02.45043 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: mount /var/vcap/data/root_tmp /tmp -o nodev -o noexec -o nosuid --bind
2016-09-16_13:44:02.45175 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45176 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45176 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.45177 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chown root:vcap /tmp
2016-09-16_13:44:02.45268 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45269 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45269 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.45271 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chmod 0770 /tmp
2016-09-16_13:44:02.45339 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45339 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45340 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)

Then some more chmodding and actual bind-mounting

2016-09-16_13:44:02.45350 [File System] 2016/09/16 13:44:02 DEBUG - Reading file /proc/mounts
2016-09-16_13:44:02.45353 [File System] 2016/09/16 13:44:02 DEBUG - Read content
2016-09-16_13:44:02.45354 ********************
2016-09-16_13:44:02.45354 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
2016-09-16_13:44:02.45354 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
2016-09-16_13:44:02.45354 udev /dev devtmpfs rw,relatime,size=4077376k,nr_inodes=1019344,mode=755 0 0
2016-09-16_13:44:02.45354 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
2016-09-16_13:44:02.45355 tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=817588k,mode=755 0 0
2016-09-16_13:44:02.45355 /dev/vda1 / ext4 rw,relatime,data=ordered 0 0
2016-09-16_13:44:02.45355 none /var/lib/ureadahead/debugfs debugfs rw,relatime 0 0
2016-09-16_13:44:02.45355 none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
2016-09-16_13:44:02.45356 none /sys/fs/fuse/connections fusectl rw,relatime 0 0
2016-09-16_13:44:02.45356 none /sys/kernel/debug debugfs rw,relatime 0 0
2016-09-16_13:44:02.45356 none /sys/kernel/security securityfs rw,relatime 0 0
2016-09-16_13:44:02.45356 none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
2016-09-16_13:44:02.45356 none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
2016-09-16_13:44:02.45357 none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
2016-09-16_13:44:02.45357 none /sys/fs/pstore pstore rw,relatime 0 0
2016-09-16_13:44:02.45357 rpc_pipefs /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
2016-09-16_13:44:02.45357 /dev/vda3 /var/vcap/data ext4 rw,relatime,data=ordered 0 0
2016-09-16_13:44:02.45358 /dev/vda3 /var/log ext4 rw,relatime,data=ordered 0 0
2016-09-16_13:44:02.45358 tmpfs /var/vcap/data/sys/run tmpfs rw,relatime,size=1024k 0 0
2016-09-16_13:44:02.45358 /dev/vda3 /tmp ext4 rw,relatime,data=ordered 0 0
2016-09-16_13:44:02.45358
2016-09-16_13:44:02.45358 ********************
2016-09-16_13:44:02.45359 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: mount /var/vcap/data/root_tmp /var/tmp -o nodev -o noexec -o nosuid --bind
2016-09-16_13:44:02.45478 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45479 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45479 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.45479 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chown root:vcap /var/tmp
2016-09-16_13:44:02.45568 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45569 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45569 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)
2016-09-16_13:44:02.45569 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Running command: chmod 0770 /var/tmp
2016-09-16_13:44:02.45633 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stdout:
2016-09-16_13:44:02.45633 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Stderr:
2016-09-16_13:44:02.45634 [Cmd Runner] 2016/09/16 13:44:02 DEBUG - Successful: true (0)

Monit start seems to happen only after that:

2016-09-16_13:44:17.80942 [Action Dispatcher] 2016/09/16 13:44:17 INFO - Running sync action start

any ideas what happened there?

Can't use File source without registry.

You can set the agent to read settings from file using following config options:

{
  ...
  "Infrastructure": {
    "Settings": {
      "Sources": [
        {
          "Type": "File",
          "MetaDataPath": "/home/vcap/metadata",
          "UserDataPath": "/home/vcap/userdata",
          "SettingsPath": "/home/vcap/settings"
        }
      ]
    }
  }
}

Without UseRegistry option you'll get this error: File source is not supported without registry.

Could you tell if there is any actual reason to restrict to read settings from file if you don't use registry?

Can we close 6868 port due to Poodle Vulnerability problem?

We got a security warning where 6868 port (used by Bosh Agent in the director VM) is affected by Poodle Vulnerability. So we want to confirm with Bosh community whether this port is still useful after director is deployed. To solve this security issue, we consider to set iptable to block inbound access to this port if it's no use, or is there a better way you guys could advise. Thanks.

CC @maximilien

failed to run godep get to generate go dependency

When running godep get ./..., it complains:

package github.com/cloudfoundry/bosh-micro-cli/deployer/registry

    imports github.com/cloudfoundry/bosh-micro-cli/deployer/registry

    imports github.com/cloudfoundry/bosh-micro-cli/deployer/registry: cannot find package "github.com/cloudfoundry/bosh-micro-cli/deployer/registry" in any of:

    /usr/local/go/src/pkg/github.com/cloudfoundry/bosh-micro-cli/deployer/registry (from $GOROOT)
    /Users/dongdong/Work/Projects/bosh-agent/src/github.com/cloudfoundry/bosh-micro-cli/deployer/registry (from $GOPATH)
godep: exit status 1

Based on the investigation, we found the only dependency exists in integration test for fake-registry here:

bosh-agent/integration/fake-registry/fake-registry.go

Line 7 in 587fefc

bmregistry "github.com/cloudfoundry/bosh-micro-cli/deployer/registry"

The actual referenced code does exists here:
https://github.com/cloudfoundry/bosh-agent/tree/master/Godeps/_workspace/src/github.com/cloudfoundry/bosh-micro-cli/deployer/registry
but this is not synchronized with bosh-micro-cli repo which broke the godep.

-IBM pair Edward + Tom

no state command

I try to figure out the state of microBOSH deployment with bosh micro agent state command, but receive message that there is no such method.

$ bosh micro agent state
/Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/agent_client-1.2768.0/lib/agent_client/base.rb:21:in `method_missing': {"message"=>"unknown message state"} (Bosh::Agent::HandlerError)
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/bosh_cli_plugin_micro-1.2768.0/lib/bosh/cli/commands/micro.rb:293:in `agent'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/bosh_cli-1.2768.0/lib/cli/command_handler.rb:57:in `run'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/bosh_cli-1.2768.0/lib/cli/runner.rb:56:in `run'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/bosh_cli-1.2768.0/lib/cli/runner.rb:16:in `run'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/gems/bosh_cli-1.2768.0/bin/bosh:7:in `<top (required)>'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/bin/bosh:23:in `load'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/bin/bosh:23:in `<main>'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/bin/ruby_executable_hooks:15:in `eval'
  from /Users/lexsys/.rvm/gems/ruby-2.0.0-p481/bin/ruby_executable_hooks:15:in `<main>'

Possible to enlarge the size of /tmp (/dev/loop0)

Currently, the size of /tmp is 128M, hardcoded in agent code ->

bosh-agent/platform/linux_platform.go

Line 634 in 56fc289

    
           _, _, _, err = p.cmdRunner.RunCommand("truncate", "-s", "128M", boshRootTmpPath)

    if !systemTmpDirIsMounted {
        // If it's not mounted on /tmp, blow it away
        _, _, _, err = p.cmdRunner.RunCommand("truncate", "-s", "128M", boshRootTmpPath)
        if err != nil {
            return bosherr.WrapError(err, "Truncating root tmp dir")
        }

        _, _, _, err = p.cmdRunner.RunCommand("chmod", "0700", boshRootTmpPath)
        if err != nil {
            return bosherr.WrapError(err, "Chmoding root tmp dir")
        }

        _, _, _, err = p.cmdRunner.RunCommand("mke2fs", "-t", "ext4", "-m", "1", "-F", boshRootTmpPath)
        if err != nil {
            return bosherr.WrapError(err, "Creating root tmp dir filesystem")
        }

        err = p.diskManager.GetMounter().Mount(boshRootTmpPath, systemTmpDir, "-t", "ext4", "-o", "loop")
        if err != nil {
            return bosherr.WrapError(err, "Mounting root tmp dir over /tmp")
        }

Some guys raised the request to me to hope to enlarge the size in the stemcell since they want to install some software which need to consume much space in /tmp. Can we consider to parameterize it to be able to specify the size of /tmp in deployment yaml? Thanks.

Happen a `Unix syslog delivery error` when GetPublicKey

When I used BOSH to deploy a simple Redis release, a Unix syslog delivery error occurred.

stemcell: bosh-softlayer-xen-ubuntu-trusty-go_agent v3445.2.1
bosh: v262.3

And error details in /var/vcap/bosh/log/current:

2017-08-21_03:01:56.29799 [DelayedAuditLogger] 2017/08/21 03:01:56 ERROR - Unix syslog delivery error
2017-08-21_03:01:56.30244 panic: runtime error: invalid memory address or nil pointer dereference
2017-08-21_03:01:56.30246 [signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x573f60]
2017-08-21_03:01:56.30246
2017-08-21_03:01:56.30247 goroutine 18 [running]:
2017-08-21_03:01:56.30248 panic(0x8e2000, 0xc420014040)
2017-08-21_03:01:56.30248       /usr/local/go/src/runtime/panic.go:500 +0x1a1
2017-08-21_03:01:56.30248 github.com/cloudfoundry/bosh-agent/infrastructure.(*MultiSourceMetadataService).GetPublicKey(0xc4201ae960, 0xc4201a8600, 0xbe7be0, 0xc42019e3c0, 0xbed240)
2017-08-21_03:01:56.30249       /tmp/build/9674af12/gopath/src/github.com/cloudfoundry/bosh-agent/infrastructure/multi_source_metadata_service.go:17 +0x30
2017-08-21_03:01:56.30249 github.com/cloudfoundry/bosh-agent/infrastructure.ComplexSettingsSource.PublicSSHKeyForUsername(0xbeca60, 0xc4201ae960, 0xbe21a0, 0xc4201a2540, 0x97baaf, 0x15, 0xbed8e0, 0xc4200dda60, 0x96d569, 0x4, ...)
2017-08-21_03:01:56.30251       /tmp/build/9674af12/gopath/src/github.com/cloudfoundry/bosh-agent/infrastructure/complex_settings_source.go:31 +0x31
2017-08-21_03:01:56.30251 github.com/cloudfoundry/bosh-agent/infrastructure.(*ComplexSettingsSource).PublicSSHKeyForUsername(0xc4201aa540, 0x96d569, 0x4, 0xbe8fe0, 0xc4201aa3c0, 0x0, 0x0)
2017-08-21_03:01:56.30252       <autogenerated>:10 +0x86
2017-08-21_03:01:56.30252 github.com/cloudfoundry/bosh-agent/settings.(*settingsService).PublicSSHKeyForUsername(0xc4201b8180, 0x96d569, 0x4, 0x1d, 0xc41fff16fb, 0xc4201c6470, 0x4106ee)

The agent.json is below:

{
  "Platform": {
    "Linux": {
      "CreatePartitionIfNoEphemeralDisk": true,
      "ScrubEphemeralDisk": true
    }
  },
  "Infrastructure": {
    "Settings": {
      "Sources": [
        {
          "Type": "File",
          "SettingsPath": "/var/vcap/bosh/user_data.json"
        }
      ],
      "UseRegistry": true
    }
  }
}

And trackback codes in file_metadata_service.go is below:

func (ms fileMetadataService) GetPublicKey() (string, error) {
	var p PublicKeyContent

	contents, err := ms.fs.ReadFile(ms.settingsFilePath)
	if err != nil {
		return "", bosherr.WrapError(err, "Reading metadata file")
	}

	err = json.Unmarshal([]byte(contents), &p)
	if err != nil {
		return "", bosherr.WrapError(err, "Unmarshalling metadata")
	}

	return p.PublicKey, nil
}

And I find user_data.json not exsting in the path. Is it right?

Thank you for your help.

Migrating data during disk size change is slow

Changing disk size when the persistent volume contains a lot of data is very slow (e.g. migrating 600GB of data takes over 1 hour in our deployment)

bosh-agent already uses, to speed up the process, two tar processes connected via pipe to migrate data from one disk to the other during disk size changes:

 (tar -C /var/vcap/store -cf - .) | (tar -C /var/vcap/store_migration_target -xpf -)

The default tar parameters are to use ~10KB blocks. This causes a high number of IOPS that (likely) limit throughput.

strace output of one of the tar processes:

read(0, "*****"..., 10240) = 10240
write(4, "*****"..., 10240) = 10240
read(0, "*****"..., 10240) = 10240
write(4, "*****"..., 10240) = 10240
read(0, "*****"..., 10240) = 10240
write(4, "*****"..., 10240) = 10240

output of iostat 10:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.23    0.00    4.08   17.44    0.00   78.25

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
scd0              0.00         0.00         0.00          0          0
sda               1.80        13.60         9.20        136         92
sdb               0.00         0.00         0.00          0          0
sdc            1174.30    136634.40         0.00    1366344          0
sdd             353.30         4.40    179683.20         44    1796832

We propose to use the tar -b flag to use larger buffers and, if the pipe between the processes becomes the bottleneck, a buffer process between the two tar.

Using a 64K block size (the default pipe capacity on linux; this should already lower significantly the number of IOPS)

(tar -C /var/vcap/store  -b 128 -cf - .) | (tar -C /var/vcap/store_migration_target -b 128 -xpf -)

Using a 512K block size (with buffer in between to avoid blocking):

(tar -C /var/vcap/store -b 1024 -cf - .) | buffer | (tar -C /var/vcap/store_migration_target -b 1024 -xpf -)

Failed loading settings via fetcher: Getting settings from url: Get http://127.0.0.1:6901/instances/vm-12fbe1e9-1d54-46c6-baff-b1f3ccb4d0ab/settings: dial tcp 127.0.0.1:6901: getsockopt: connection refused

Hi,all:
I had encountered the error indicating persistent disk attached to the bosh director didn't exist. I don't know what caused this. I hope somebody who knows the reason can help me. I had blocked by this issue for a week. Thank you very much.
I had confirmed that the disk had been created and attached to the VM. I thought maybe some reason caused the failure contact to the registry to get the persistent disk info. which resulted the error. But what caused this occur? and is there a workaround for this issue ?

Versions:
BOSH: 260
CPI： bosh-openstack-cpi v27
stemcell: openstack-kvm-3312.12

Detailed errors:
2017-03-08_07:09:22.26482 [MBus Handler] 2017/03/08 07:09:22 INFO - Responding
2017-03-08_07:09:22.26482 [MBus Handler] 2017/03/08 07:09:22 DEBUG - Payload
2017-03-08_07:09:22.26483 ********************
2017-03-08_07:09:22.26483 {"value":{"agent_task_id":"3c6c9e54-6b88-4792-462e-843146f175f6","state":"running"}}
2017-03-08_07:09:22.26483 ********************
2017-03-08_07:09:22.49887 [clientRetryable] 2017/03/08 07:09:22 DEBUG - [requestID=8871b890-8163-45fe-71c2-5f8134e96f40] Request succeeded (attempts=1), response: Response{ StatusCode: 200, Status: '200 OK' }
2017-03-08_07:09:22.50084 [settingsService] 2017/03/08 07:09:22 ERROR - Failed loading settings via fetcher: Getting settings from url: Get http://127.0.0.1:6901/instances/vm-12fbe1e9-1d54-46c6-baff-b1f3ccb4d0ab/settings: dial tcp 127.0.0.1:6901: getsockopt: connection refused
2017-03-08_07:09:22.50116 [File System] 2017/03/08 07:09:22 DEBUG - Reading file /var/vcap/bosh/settings.json
2017-03-08_07:09:22.50117 [File System] 2017/03/08 07:09:22 DEBUG - Read content
2017-03-08_07:09:22.50117 ********************
2017-03-08_07:09:22.50118 {"agent_id":"034c2844-3d8c-4d1a-797f-2371413fa00f","blobstore":{"provider":"local","options":{"blobstore_path":"/var/vcap/micro_bosh/data/cache"}},"disks":{"system":"/dev/sda","ephemeral":null,"persistent":{},"raw_ephemeral":null},"env":{"bosh":{"password":"","keep_root_password":false,"remove_dev_tools":false,"authorized_keys":null},"persistent_disk_fs":""},"networks":{"cf1":{"type":"manual","ip":"192.168.0.111","netmask":"255.255.255.0","gateway":"192.168.0.1","resolved":false,"use_dhcp":true,"default":["dns","gateway"],"dns":["192.168.0.5"],"mac":"fa:16:3e:e0:db:ad","preconfigured":false}},"ntp":["0.ntp"],"mbus":"https://vcap:[email protected]:6868","vm":{"name":"vm-12fbe1e9-1d54-46c6-baff-b1f3ccb4d0ab"}}
2017-03-08_07:09:22.50118 ********************
2017-03-08_07:09:22.50118 [settingsService] 2017/03/08 07:09:22 DEBUG - Successfully read settings from file
2017-03-08_07:09:22.50119 [Cmd Runner] 2017/03/08 07:09:22 DEBUG - Running command 'route -n'
2017-03-08_07:09:22.50157 [Cmd Runner] 2017/03/08 07:09:22 DEBUG - Stdout: Kernel IP routing table
2017-03-08_07:09:22.50158 Destination Gateway Genmask Flags Metric Ref Use Iface
2017-03-08_07:09:22.50158 0.0.0.0 192.168.0.1 0.0.0.0 UG 0 0 0 eth0
2017-03-08_07:09:22.50159 169.254.169.254 192.168.0.1 255.255.255.255 UGH 0 0 0 eth0
2017-03-08_07:09:22.50159 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
2017-03-08_07:09:22.50159 [Cmd Runner] 2017/03/08 07:09:22 DEBUG - Stderr:
2017-03-08_07:09:22.50159 [Cmd Runner] 2017/03/08 07:09:22 DEBUG - Successful: true (0)
2017-03-08_07:09:22.50160 [Task Service] 2017/03/08 07:09:22 ERROR - Failed processing task #3c6c9e54-6b88-4792-462e-843146f175f6 got: Persistent disk with volume id '605a5d60-975d-4fae-a31d-d8203a8dbdc1' could not be found

Why do you return empty string as a name in file metadata source?

In this code empty string is return as a server name. I thought that file metadata source works much a like as another metadata services. So if you set Name option, then agent can get it from any type of source. Think that this should be changed.

Want to change the password of root and vcap separately

The setUserPasswords method changes the password of both root and vcap, but we just want to change vcap password or set the different password for vcap and root. Possible to consider to separate this method to 2 method: setUserPasswords and setRootPasswords? Thanks. @maximilien @jianqiu

func (boot bootstrap) setUserPasswords(env boshsettings.Env) error {
    password := env.GetPassword()
    if password == "" {
        return nil
    }

    err := boot.platform.SetUserPassword(boshsettings.RootUsername, password)
    if err != nil {
        return bosherr.WrapError(err, "Setting root password")
    }

    err = boot.platform.SetUserPassword(boshsettings.VCAPUsername, password)
    if err != nil {
        return bosherr.WrapError(err, "Setting vcap password")
    }

    return nil
}

/tmp is not cleaned up during VM reboot

Due to the design change in this commit, the root_tmp dir is created with "mkdir -p", so if the dir root_tmp already exists, it will not be recreated, as a result, the /tmp is not cleaned up during VM reboot. Due to this change, our DEA (CF v235) can't work properly after VM reboot, since /tmp/warden dir is not removed. We know that this problem has already been fixed from CF release side, but I wonder if it's a right behavior to not clean up /tmp after VM reboot. Thanks.

The release info is not correct after "bosh upload release"

At first, the uploaded release info is correct (the same as specified in release.MF). After the director being used for some days (bosh db grows larger), bosh upload release shows wrong release info. For example,

release info specified in release.MF:

name: cf-services-235012
version: ibm-v235.12

root@gubin-preyf-boshcli:~/releases/cf-services/ibm-v235.12# bosh upload release cf-services-release.tgz
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'bosh'

Verifying manifest...
Extract manifest                                             OK
Manifest exists                                              OK
Release name/version                                         OK

Checking jobs format                                         OK
Read job 'mysql_node_external' (1 of 4), version 9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b OK
Job 'mysql_node_external' checksum                           OK
Extract job 'mysql_node_external'                            OK
Read job 'mysql_node_external' manifest                      OK
Check template 'mysql_node_external_ctl' for 'mysql_node_external' OK
Check template 'mysql_node_external.yml.erb' for 'mysql_node_external' OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_node_external' OK
Monit file for 'mysql_node_external'                         OK
Read job 'mysql_node' (2 of 4), version 6990d459b9db09f97ef7be23296f827866120745 OK
Job 'mysql_node' checksum                                    OK
Extract job 'mysql_node'                                     OK
Read job 'mysql_node' manifest                               OK
Check template 'mysql_node_ctl' for 'mysql_node'             OK
Check template 'mysql_worker_ctl' for 'mysql_node'           OK
Check template 'mysql_migration_util.erb' for 'mysql_node'   OK
Check template 'my.bootstrap.erb' for 'mysql_node'           OK
Check template 'my.shutdown.erb' for 'mysql_node'            OK
Check template 'my.cnf.erb' for 'mysql_node'                 OK
Check template 'mysql_ctl.erb' for 'mysql_node'              OK
Check template 'my55.bootstrap.erb' for 'mysql_node'         OK
Check template 'my55.shutdown.erb' for 'mysql_node'          OK
Check template 'my55.cnf.erb' for 'mysql_node'               OK
Check template 'mysql55_ctl.erb' for 'mysql_node'            OK
Check template 'create_mysql_tmp_dir.erb' for 'mysql_node'   OK
Check template 'mysql_node.yml.erb' for 'mysql_node'         OK
Check template 'mysql_worker.yml.erb' for 'mysql_node'       OK
Check template 'mysql_init.erb' for 'mysql_node'             OK
Check template 'mysql_backup.yml.erb' for 'mysql_node'       OK
Check template 'mysql_backup.cron.erb' for 'mysql_node'      OK
Check template 'mysql_backup.erb' for 'mysql_node'           OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_node'  OK
Check template 'warden_ctl' for 'mysql_node'                 OK
Check template 'warden.yml' for 'mysql_node'                 OK
Check template 'warden_service_ctl' for 'mysql_node'         OK
Check template 'warden_mysql_init.erb' for 'mysql_node'      OK
Monit file for 'mysql_node'                                  OK
Read job 'rds_mysql_gateway' (3 of 4), version 5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284 OK
Job 'rds_mysql_gateway' checksum                             OK
Extract job 'rds_mysql_gateway'                              OK
Read job 'rds_mysql_gateway' manifest                        OK
Check template 'mysql_gateway_ctl' for 'rds_mysql_gateway'   OK
Check template 'mysql_gateway.yml.erb' for 'rds_mysql_gateway' OK
Check template 'syslog_forwarder.conf.erb' for 'rds_mysql_gateway' OK
Monit file for 'rds_mysql_gateway'                           OK
Read job 'mysql_gateway' (4 of 4), version 407fc635dbfc67cc3b1b1a27205a31424f16d5e7 OK
Job 'mysql_gateway' checksum                                 OK
Extract job 'mysql_gateway'                                  OK
Read job 'mysql_gateway' manifest                            OK
Check template 'mysql_gateway_ctl' for 'mysql_gateway'       OK
Check template 'mysql_gateway.yml.erb' for 'mysql_gateway'   OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_gateway' OK
Monit file for 'mysql_gateway'                               OK

Release info
------------
Name:    cf-services-release
Version: 0+dev.17

Packages
  - mysql (74653bb6634a11dbd7310fb90e549e21497a6226)
  - libyaml (457456673cad30a6b3277daceefb310989a6f8db)
  - sqlite (80722951ac13d3323ff9d95d3faf3e955707a2d1)
  - mysql_node (98d7592aefb18410fc769a3950d5200d76add17b)
  - common (8ecced6383310492b543d2a5a3041410c7b33622)
  - rootfs_lucid64 (9b3f611b46e076b94b37645c98f9100e7bcef5dd)
  - ruby (2e75afc9b6a3646b31ad81b77e678af783f87b15)
  - mysqlclient (115688e79fe39a60e67866c29648e10d236cb95e)
  - ruby_next (2e75afc9b6a3646b31ad81b77e678af783f87b15)
  - mysql_gateway (eb525e5e56fc7763356a72eac64760b888dcc410)
  - syslog_aggregator (6c74ed236eaf0011ebeb97f2145ef932a9ea4d3a)
  - mysql55 (ab1c2a08c90b0ebcfbe7805686186b6c74cd8b05)

Jobs
  - mysql_node_external (9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b)
  - mysql_node (6990d459b9db09f97ef7be23296f827866120745)
  - rds_mysql_gateway (5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284)
  - mysql_gateway (407fc635dbfc67cc3b1b1a27205a31424f16d5e7)

License
  - license (51ef01ca8e29f17b39ffb6eb4d50d44267601c2c)

Checking if can repack release for faster upload...
mysql (74653bb6634a11dbd7310fb90e549e21497a6226) SKIP
libyaml (457456673cad30a6b3277daceefb310989a6f8db) SKIP
sqlite (80722951ac13d3323ff9d95d3faf3e955707a2d1) SKIP
mysql_node (98d7592aefb18410fc769a3950d5200d76add17b) SKIP
common (8ecced6383310492b543d2a5a3041410c7b33622) SKIP
rootfs_lucid64 (9b3f611b46e076b94b37645c98f9100e7bcef5dd) SKIP
ruby (2e75afc9b6a3646b31ad81b77e678af783f87b15) SKIP
mysqlclient (115688e79fe39a60e67866c29648e10d236cb95e) SKIP
ruby_next (2e75afc9b6a3646b31ad81b77e678af783f87b15) SKIP
mysql_gateway (eb525e5e56fc7763356a72eac64760b888dcc410) SKIP
syslog_aggregator (6c74ed236eaf0011ebeb97f2145ef932a9ea4d3a) SKIP
mysql55 (ab1c2a08c90b0ebcfbe7805686186b6c74cd8b05) SKIP
mysql_node_external (9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b) UPLOAD
mysql_node (6990d459b9db09f97ef7be23296f827866120745) UPLOAD
rds_mysql_gateway (5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284) UPLOAD
mysql_gateway (407fc635dbfc67cc3b1b1a27205a31424f16d5e7) UPLOAD
Release repacked (new size is 35.2K)

Uploading release
release-repac:  96% |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo    |  34.0KB 490.3KB/s ETA:  00:00:00
Director task 9486
  Started extracting release > Extracting release. Done (00:00:00)

  Started verifying manifest > Verifying manifest. Done (00:00:00)

  Started resolving package dependencies > Resolving package dependencies. Done (00:00:00)

  Started processing 12 existing packages > Processing 12 existing packages. Done (00:00:00)

  Started processing 4 existing jobs > Processing 4 existing jobs. Done (00:00:00)

  Started release has been created > cf-services-release/0+dev.17. Done (00:00:00)

Task 9486 done

Started     2016-08-15 07:12:41 UTC
Finished    2016-08-15 07:12:41 UTC
Duration    00:00:00
release-repac:  96% |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo    |  34.0KB  29.2KB/s Time: 00:00:01

Release uploaded

Please notice "Started release has been created > cf-services-release/0+dev.17. Done". The release name and version are not the same as specified in release.MF.

root@gubin-preyf-boshcli:~/releases/cf-services/ibm-v235.12# bosh releases
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'bosh'

+----------------------------------------+---------------+-------------+
| Name                                   | Versions      | Commit Hash |
+----------------------------------------+---------------+-------------+
| admin-ui                               | 1.5.0.131*    | 76483577+   |
| cf                                     | 1010-ibm-v235 | ed43e9d3+   |
| cf-235012                              | ibm-v235.12*  | ed43e9d3+   |
| cf-services-235007                     | ibm-v235.7*   | 746d1f0c+   |
| cf-services-contrib-235007             | ibm-v235.7*   | ac8b8c03+   |
| cf-services-release                    | 0+dev.17      | 746d1f0c+   |
| habr                                   | v1.14*        | eaf22bde    |
| ipsec-orchestrator                     | 3*            | c72a258b+   |
| loginserver                            | 2.10-184*     | b8910c70+   |
| marmot-logstash-forwarder-bosh-release | 0+dev.50*     | dc317458+   |
| mccp                                   | 13*           | 4f8f0b73+   |
| mod-vms                                | v0.0-51*      | 9cf70ee8+   |
| mrr_cons                               | 2.4.3-51*     | eb3a553f+   |
| mrr_prod                               | 2.4.3-51*     | eb3a553f+   |
| routerapi                              | 6*            | 00000000+   |
| security-release                       | v0.1-33       | 0666970d+   |
|                                        | v0.1-35*      | ed541d7b+   |
| service_proxy_release                  | v0.1-2        | 9f6a56a6+   |
|                                        | v0.1-3*       | 9f6a56a6+   |
| unbound                                | v0.2-2*       | e64e0a2e+   |
+----------------------------------------+---------------+-------------+
(*) Currently deployed
(+) Uncommitted changes

Releases total: 18

Then I created a clean new bosh director and upload the same release. This time the uploaded release info is correct:

root@gubin-preyf-boshcli:~/releases/cf-services/ibm-v235.12# bosh upload release cf-services-release.tgz
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'bosh'

Verifying manifest...
Extract manifest                                             OK
Manifest exists                                              OK
Release name/version                                         OK

File exists and readable                                     OK
Read package 'mysql' (1 of 12)                               OK
Package 'mysql' checksum                                     OK
Read package 'libyaml' (2 of 12)                             OK
Package 'libyaml' checksum                                   OK
Read package 'sqlite' (3 of 12)                              OK
Package 'sqlite' checksum                                    OK
Read package 'mysql_node' (4 of 12)                          OK
Package 'mysql_node' checksum                                OK
Read package 'common' (5 of 12)                              OK
Package 'common' checksum                                    OK
Read package 'rootfs_lucid64' (6 of 12)                      OK
Package 'rootfs_lucid64' checksum                            OK
Read package 'ruby' (7 of 12)                                OK
Package 'ruby' checksum                                      OK
Read package 'mysqlclient' (8 of 12)                         OK
Package 'mysqlclient' checksum                               OK
Read package 'ruby_next' (9 of 12)                           OK
Package 'ruby_next' checksum                                 OK
Read package 'mysql_gateway' (10 of 12)                      OK
Package 'mysql_gateway' checksum                             OK
Read package 'syslog_aggregator' (11 of 12)                  OK
Package 'syslog_aggregator' checksum                         OK
Read package 'mysql55' (12 of 12)                            OK
Package 'mysql55' checksum                                   OK
Package dependencies                                         OK
Checking jobs format                                         OK
Read job 'mysql_node_external' (1 of 4), version 9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b OK
Job 'mysql_node_external' checksum                           OK
Extract job 'mysql_node_external'                            OK
Read job 'mysql_node_external' manifest                      OK
Check template 'mysql_node_external_ctl' for 'mysql_node_external' OK
Check template 'mysql_node_external.yml.erb' for 'mysql_node_external' OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_node_external' OK
Job 'mysql_node_external' needs 'common' package             OK
Job 'mysql_node_external' needs 'mysql_node' package         OK
Job 'mysql_node_external' needs 'mysqlclient' package        OK
Job 'mysql_node_external' needs 'ruby' package               OK
Job 'mysql_node_external' needs 'ruby_next' package          OK
Job 'mysql_node_external' needs 'syslog_aggregator' package  OK
Monit file for 'mysql_node_external'                         OK
Read job 'mysql_node' (2 of 4), version 6990d459b9db09f97ef7be23296f827866120745 OK
Job 'mysql_node' checksum                                    OK
Extract job 'mysql_node'                                     OK
Read job 'mysql_node' manifest                               OK
Check template 'mysql_node_ctl' for 'mysql_node'             OK
Check template 'mysql_worker_ctl' for 'mysql_node'           OK
Check template 'mysql_migration_util.erb' for 'mysql_node'   OK
Check template 'my.bootstrap.erb' for 'mysql_node'           OK
Check template 'my.shutdown.erb' for 'mysql_node'            OK
Check template 'my.cnf.erb' for 'mysql_node'                 OK
Check template 'mysql_ctl.erb' for 'mysql_node'              OK
Check template 'my55.bootstrap.erb' for 'mysql_node'         OK
Check template 'my55.shutdown.erb' for 'mysql_node'          OK
Check template 'my55.cnf.erb' for 'mysql_node'               OK
Check template 'mysql55_ctl.erb' for 'mysql_node'            OK
Check template 'create_mysql_tmp_dir.erb' for 'mysql_node'   OK
Check template 'mysql_node.yml.erb' for 'mysql_node'         OK
Check template 'mysql_worker.yml.erb' for 'mysql_node'       OK
Check template 'mysql_init.erb' for 'mysql_node'             OK
Check template 'mysql_backup.yml.erb' for 'mysql_node'       OK
Check template 'mysql_backup.cron.erb' for 'mysql_node'      OK
Check template 'mysql_backup.erb' for 'mysql_node'           OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_node'  OK
Check template 'warden_ctl' for 'mysql_node'                 OK
Check template 'warden.yml' for 'mysql_node'                 OK
Check template 'warden_service_ctl' for 'mysql_node'         OK
Check template 'warden_mysql_init.erb' for 'mysql_node'      OK
Job 'mysql_node' needs 'common' package                      OK
Job 'mysql_node' needs 'mysql_node' package                  OK
Job 'mysql_node' needs 'mysqlclient' package                 OK
Job 'mysql_node' needs 'mysql' package                       OK
Job 'mysql_node' needs 'mysql55' package                     OK
Job 'mysql_node' needs 'ruby' package                        OK
Job 'mysql_node' needs 'ruby_next' package                   OK
Job 'mysql_node' needs 'sqlite' package                      OK
Job 'mysql_node' needs 'syslog_aggregator' package           OK
Job 'mysql_node' needs 'rootfs_lucid64' package              OK
Monit file for 'mysql_node'                                  OK
Read job 'rds_mysql_gateway' (3 of 4), version 5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284 OK
Job 'rds_mysql_gateway' checksum                             OK
Extract job 'rds_mysql_gateway'                              OK
Read job 'rds_mysql_gateway' manifest                        OK
Check template 'mysql_gateway_ctl' for 'rds_mysql_gateway'   OK
Check template 'mysql_gateway.yml.erb' for 'rds_mysql_gateway' OK
Check template 'syslog_forwarder.conf.erb' for 'rds_mysql_gateway' OK
Job 'rds_mysql_gateway' needs 'common' package               OK
Job 'rds_mysql_gateway' needs 'mysql_gateway' package        OK
Job 'rds_mysql_gateway' needs 'mysqlclient' package          OK
Job 'rds_mysql_gateway' needs 'ruby' package                 OK
Job 'rds_mysql_gateway' needs 'sqlite' package               OK
Job 'rds_mysql_gateway' needs 'syslog_aggregator' package    OK
Monit file for 'rds_mysql_gateway'                           OK
Read job 'mysql_gateway' (4 of 4), version 407fc635dbfc67cc3b1b1a27205a31424f16d5e7 OK
Job 'mysql_gateway' checksum                                 OK
Extract job 'mysql_gateway'                                  OK
Read job 'mysql_gateway' manifest                            OK
Check template 'mysql_gateway_ctl' for 'mysql_gateway'       OK
Check template 'mysql_gateway.yml.erb' for 'mysql_gateway'   OK
Check template 'syslog_forwarder.conf.erb' for 'mysql_gateway' OK
Job 'mysql_gateway' needs 'common' package                   OK
Job 'mysql_gateway' needs 'mysql_gateway' package            OK
Job 'mysql_gateway' needs 'mysqlclient' package              OK
Job 'mysql_gateway' needs 'ruby' package                     OK
Job 'mysql_gateway' needs 'sqlite' package                   OK
Job 'mysql_gateway' needs 'syslog_aggregator' package        OK
Monit file for 'mysql_gateway'                               OK

Release info
------------
Name:    cf-services-release
Version: 0+dev.17

Packages
  - mysql (74653bb6634a11dbd7310fb90e549e21497a6226)
  - libyaml (457456673cad30a6b3277daceefb310989a6f8db)
  - sqlite (80722951ac13d3323ff9d95d3faf3e955707a2d1)
  - mysql_node (98d7592aefb18410fc769a3950d5200d76add17b)
  - common (8ecced6383310492b543d2a5a3041410c7b33622)
  - rootfs_lucid64 (9b3f611b46e076b94b37645c98f9100e7bcef5dd)
  - ruby (2e75afc9b6a3646b31ad81b77e678af783f87b15)
  - mysqlclient (115688e79fe39a60e67866c29648e10d236cb95e)
  - ruby_next (2e75afc9b6a3646b31ad81b77e678af783f87b15)
  - mysql_gateway (eb525e5e56fc7763356a72eac64760b888dcc410)
  - syslog_aggregator (6c74ed236eaf0011ebeb97f2145ef932a9ea4d3a)
  - mysql55 (ab1c2a08c90b0ebcfbe7805686186b6c74cd8b05)

Jobs
  - mysql_node_external (9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b)
  - mysql_node (6990d459b9db09f97ef7be23296f827866120745)
  - rds_mysql_gateway (5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284)
  - mysql_gateway (407fc635dbfc67cc3b1b1a27205a31424f16d5e7)

License
  - license (51ef01ca8e29f17b39ffb6eb4d50d44267601c2c)

Checking if can repack release for faster upload...
mysql (74653bb6634a11dbd7310fb90e549e21497a6226) UPLOAD
libyaml (457456673cad30a6b3277daceefb310989a6f8db) UPLOAD
sqlite (80722951ac13d3323ff9d95d3faf3e955707a2d1) UPLOAD
mysql_node (98d7592aefb18410fc769a3950d5200d76add17b) UPLOAD
common (8ecced6383310492b543d2a5a3041410c7b33622) UPLOAD
rootfs_lucid64 (9b3f611b46e076b94b37645c98f9100e7bcef5dd) UPLOAD
ruby (2e75afc9b6a3646b31ad81b77e678af783f87b15) UPLOAD
mysqlclient (115688e79fe39a60e67866c29648e10d236cb95e) UPLOAD
ruby_next (2e75afc9b6a3646b31ad81b77e678af783f87b15) UPLOAD
mysql_gateway (eb525e5e56fc7763356a72eac64760b888dcc410) UPLOAD
syslog_aggregator (6c74ed236eaf0011ebeb97f2145ef932a9ea4d3a) UPLOAD
mysql55 (ab1c2a08c90b0ebcfbe7805686186b6c74cd8b05) UPLOAD
mysql_node_external (9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b) UPLOAD
mysql_node (6990d459b9db09f97ef7be23296f827866120745) UPLOAD
rds_mysql_gateway (5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284) UPLOAD
mysql_gateway (407fc635dbfc67cc3b1b1a27205a31424f16d5e7) UPLOAD
Uploading the whole release

Uploading release
cf-services-r:  96% |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo    | 308.4MB  22.4MB/s ETA:  00:00:00
Director task 2
  Started extracting release > Extracting release. Done (00:00:03)

  Started verifying manifest > Verifying manifest. Done (00:00:00)

  Started resolving package dependencies > Resolving package dependencies. Done (00:00:00)

  Started creating new packages
  Started creating new packages > mysql/74653bb6634a11dbd7310fb90e549e21497a6226. Done (00:00:01)
  Started creating new packages > libyaml/457456673cad30a6b3277daceefb310989a6f8db. Done (00:00:00)
  Started creating new packages > sqlite/80722951ac13d3323ff9d95d3faf3e955707a2d1. Done (00:00:00)
  Started creating new packages > mysql_node/98d7592aefb18410fc769a3950d5200d76add17b. Done (00:00:00)
  Started creating new packages > common/8ecced6383310492b543d2a5a3041410c7b33622. Done (00:00:00)
  Started creating new packages > rootfs_lucid64/9b3f611b46e076b94b37645c98f9100e7bcef5dd. Done (00:00:03)
  Started creating new packages > ruby/2e75afc9b6a3646b31ad81b77e678af783f87b15. Done (00:00:00)
  Started creating new packages > mysqlclient/115688e79fe39a60e67866c29648e10d236cb95e. Done (00:00:00)
  Started creating new packages > ruby_next/2e75afc9b6a3646b31ad81b77e678af783f87b15. Done (00:00:00)
  Started creating new packages > mysql_gateway/eb525e5e56fc7763356a72eac64760b888dcc410. Done (00:00:01)
  Started creating new packages > syslog_aggregator/6c74ed236eaf0011ebeb97f2145ef932a9ea4d3a. Done (00:00:00)
  Started creating new packages > mysql55/ab1c2a08c90b0ebcfbe7805686186b6c74cd8b05. Done (00:00:01)
     Done creating new packages (00:00:06)

  Started creating new jobs
  Started creating new jobs > mysql_node_external/9b14b2fdfb38f47e5c25d86c3e164ebaf1b77e3b. Done (00:00:00)
  Started creating new jobs > mysql_node/6990d459b9db09f97ef7be23296f827866120745. Done (00:00:00)
  Started creating new jobs > rds_mysql_gateway/5f7a3e38d1675d31f6cc0dcaf85bd7c940b8a284. Done (00:00:00)
  Started creating new jobs > mysql_gateway/407fc635dbfc67cc3b1b1a27205a31424f16d5e7. Done (00:00:00)
     Done creating new jobs (00:00:00)

  Started release has been created > cf-services-235012/ibm-v235.12. Done (00:00:00)

Task 2 done

Started     2016-08-15 06:52:12 UTC
Finished    2016-08-15 06:52:21 UTC
Duration    00:00:09
cf-services-r:  96% |oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo    | 309.1MB  13.3MB/s Time: 00:00:23

Release uploaded

Please notice "Started release has been created > cf-services-235012/ibm-v235.12. Done" This time the release name and version are correct.

root@gubin-preyf-boshcli:~/releases/cf-services/ibm-v235.12# bosh releases
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on 'bosh'

+--------------------+-------------+-------------+
| Name               | Versions    | Commit Hash |
+--------------------+-------------+-------------+
| cf-235012          | ibm-v235.12 | ed43e9d3+   |
| cf-services-235012 | ibm-v235.12 | 746d1f0c+   |
+--------------------+-------------+-------------+
(+) Uncommitted changes

Releases total: 2

Not sure if this issue is related to the bosh db growing larger.

/cc @maximilien @mattcui @jianqiu

bosh agent rewrites partition mistakenly during VM reboot

We found bosh agent always rewrites the partition table every time when the VM is rebooted even if the disk is already partitioned and formatted. In the part of code of determining if needing to do partition, there is a problem where to get the device size at the line of size, err := p.GetDeviceSizeInBytes(partitionPath)

There will return the following error when running this line:

2016-08-31_10:01:11.18516 [Cmd Runner] 2016/08/31 10:01:11 DEBUG - Running command 'sfdisk -s /dev/mapper/3600a09803830357a392448666a38304f1'
2016-08-31_10:01:11.18813 [Cmd Runner] 2016/08/31 10:01:11 DEBUG - Stdout:
2016-08-31_10:01:11.18814 [Cmd Runner] 2016/08/31 10:01:11 DEBUG - Stderr: /dev/mapper/3600a09803830357a392448666a38304f1: No such file or directory
2016-08-31_10:01:11.18815
2016-08-31_10:01:11.18815 sfdisk: cannot open /dev/mapper/3600a09803830357a392448666a38304f1 for reading

It results in the problem where the variable size is not assigned, and it is still 0 as the initial value. The simplest way to fix the issue would be to modify partitionPath to devicePath at the line of size, err := p.GetDeviceSizeInBytes(partitionPath), but I am not sure if it's a good way to fix this issue. Need BOSH guys to give some comments/advice.

This issue has been already under discussion with @cppforlife, @dpb587-pivotal and @maximilien, I raised it here for better tracking.

bosh agent resets /tmp dir access to 0770 on restart

Hello Team,
we are facing issues with insufficient access rights on /tmp which are set to 0770 by default and 0700 on /var/tmp.

We figured out that the following line calls a function that sets the access to /tmp to 0770:
https://github.com/cloudfoundry/bosh-agent/blob/master/platform/linux_platform.go#L663

func (p linux) changeTmpDirPermissions(path string) error {
_, _, _, err := p.cmdRunner.RunCommand("chown", "root:vcap", path)
if err != nil {
return bosherr.WrapErrorf(err, "chown %s", path)
}

_, _, _, err = p.cmdRunner.RunCommand("chmod", "0770", path)
if err != nil {
return bosherr.WrapErrorf(err, "chmod %s", path)
}

return nil
}

Many system related tasks require 1777 access to /tmp though, e.g. we are using them for our DB backup jobs or also for running ssh-agent.

Could you please clarify what the rational behind the decision is to have the access settings on 770 for /tmp and also in what cases they are automatically being reset except for restarts?

We were just facing issues that hundreds of our VMs had access rights reset for /tmp to 770 and we are tapping in the dark what has caused this (in this case certainly not a restart)

Thanks!

bosh-blobstore-dav omits 'Content-Length' header and is missing error handling

After compiling packages in a deploy, the agent uses the blobstore client to upload the compiled package to the blob store. When the target blobstore is dav backed by an older version of nginx, nginx rejects the request with a 411 'Length Required' response code.

The client never checks the response code (go's http client does not error on a non-200 response) so it exits with a return code 0 making the agent believe the upload was successful. Eventually the director gets the message that the compilation was successful and updates the bosh db with the blobstore ID even though the blob isn't there; things fall apart rapidly from then on.

Here's a snippet of an strace when running the client:

connect(5, {sa_family=AF_INET, sin_port=htons(21081), sin_addr=inet_addr("192.168.50.4")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_wait(6, {{EPOLLOUT, {u32=2852779432, u64=139924297348520}}}, 128, 0) = 1
connect(5, {sa_family=AF_INET, sin_port=htons(21081), sin_addr=inet_addr("192.168.50.4")}, 16) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(51230), sin_addr=inet_addr("10.244.0.146")}, [16]) = 0
getpeername(5, {sa_family=AF_INET, sin_port=htons(21081), sin_addr=inet_addr("192.168.50.4")}, [16]) = 0
setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
read(5, 0xc21006d000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
read(4, "\246", 1)                      = 1
read(4, "\333\225Se\222[\371\214*\263\337\273\3302\341\322\16P\201\10\21\23\266;\0001'\276\217,\360\255"..., 32768) = 32768
write(5, "PUT /20/903af504-0b16-45a3-57f0-"..., 4096) = 4096
write(5, "a\301\356\204\250k\300\373\261\1G9\2\203C.\370\203\n:\360t\30\334E\334g\25K,\314\17"..., 28898) = 8688
write(5, "\235-\221\217t\353\177Zir{\25B\240\205 \5\277T\253\f\315!L\26{/\202\244\0077#"..., 20210) = 15928
write(5, "\333a\215\331U\27\304Ra\333\250\227\311\365\255b\\\242\311n\227\326\6\375)\32\267\5\243\300\"\303"..., 4282) = 4282
read(4, "\250\331v\203\354\32\244vC\202\354\364\211\234&t\vd\277=\256\344A2@P\256x\255\224\220s"..., 32768) = 32768
write(5, "\r\n8000\r\n\250\331v\203\354\32\244vC\202\354\364\211\234&t\vd\277=\256\344A2"..., 4096) = 4096
write(5, "\33\262\301\16\260\342~\26\366\355\351U\301\322\303`}\273`\204O\315\363+h\363b<\221'\264\301"..., 28680) = 18824
write(5, "\354\362\312\240\311h^\357\344\21\323\247\220\273\254\340|*@\346\310\"C\351$\352\236\242\3730\244j"..., 9856) = 9856
read(4, "|\257\216\275Y\253k\211G;\4\n\367\200nll\361\360-\334\273\220\7\356J/\35E\305L\275"..., 32768) = 32768
write(5, "\r\n8000\r\n|\257\216\275Y\253k\211G;\4\n\367\200nll\361\360-\334\273\220\7"..., 4096) = 4096
write(5, "\226\304\336\3'\242\217fgW\22\343\375\t5\242Y\362f`#\304m\303\321\20\325\31\1\f\233\260"..., 28680) = 17376
write(5, "\262w\317\265\225\364O\212M3Z\300\222\23\335\201X\373U\3663\331\177\375\257\236S-\257y\356\203"..., 11304) = 11304
read(4, "h\24NXu\r\355\234\23x\303\250{\356\201\376,-\333\327x\220Y*\260\2466\27\262\367\222\276"..., 32768) = 4095
write(5, "\r\nfff\r\nh\24NXu\r\355\234\23x\303\250{\356\201\376,-\333\327x\220Y*\260"..., 4096) = 4096
read(4, "", 32768)                      = 0
close(4)                                = 0
write(5, "n\201\373{\344{\r\n0\r\n\r\n", 13) = 13
read(5, "HTTP/1.1 411 Length Required\r\nSe"..., 4096) = 323
exit_group(0)                           = ?
+++ exited with 0 +++

The blobstore clients shouldn't be returning a 0 return code when the blobs were not successfully updated.

bosh agent cert is expired

HI, all,
the agent.cert is expired from 2016.12.1, Can someone help to update it?
root@1ab957b6-289c-405b-565b-ee6edc242a8a:/var/vcap/bosh# openssl x509 -in /var/vcap/bosh/agent.cert -noout -text
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 0 (0x0)
Signature Algorithm: sha1WithRSAEncryption
Issuer: C=US, O=Pivotal, CN=localhost
Validity
Not Before: Dec 1 22:11:32 2013 GMT
Not After : Dec 1 22:11:32 2016 GMT
Subject: C=US, O=Pivotal, CN=localhost
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
Public-Key: (2048 bit)

https://github.com/cloudfoundry/bosh-agent/blob/master/httpsdispatcher/agent.cert

[windows][gcp] agent failed to get blobs to start job

Compilation of blobs worked on windows/gcp; but fetching the blobs to start a job failed. Agent logs:

https://gist.github.com/drnic/3cec1b90f3b4f36adbf51e5523eddaa5

Drain/stop scripts should be run on OS shutdown

In case the OS is requested to shutdown (e.g. by the underlying IaaS) it would be appropriate for the drain scripts and monit stop to be called to at least try to limit impact due to the VM shutting down.

While this is unlikely to be useful on normal public cloud deployments, it is a scenario that happens in practice on prem. It may also be useful for e.g. AWS spot instances.

bosh-agent kills dhclient

We are working on Azure CPI. And in our stemcell, there is a linux agent which is monitoring dhclient and it will restore the routes if dhcp client is restarted.
https://github.com/Azure/WALinuxAgent/blob/2.0/waagent#L3088

But dhclient is killed by bosh-agent.
https://github.com/cloudfoundry/bosh-agent/blob/master/platform/net/ubuntu_net_manager.go#L140

Any suggestion on this? Thanks in advance.

VM swap size should be configurable

Currently, bosh hard-codes the size of a VM swap partition to be equal to the size of RAM or half the ephemeral disk:

https://github.com/cloudfoundry/bosh-agent/blob/255.x/platform/linux_platform.go#L997-L1014

For large amounts of memory with relatively small ephemeral disks this is a problem.
In my case I have 32G RAM with a 100G ephemeral disk. With all other overhead this cuts my ephemeral disk down to about 65G which is too low for my use cases. Also: Who needs 32G swap?

So my suggestion is to introduce a parameter for resource pools that allows CPIs to set the swap size just like memory. It would also be nice to set the swappiness of the kernel as for high-performance workloads it is generally undesired to swap at all.

Unreachable NATS leads to unreasonable amount of HTTP metadata service requests

When for some reason the BOSH NATS isn't available (Director update, network problems, etc), the agent exits and re-starts every few seconds. Reason is that heartbeats cannot be sent to the HM.
Here is an example from the agent logs that we see every few seconds:

2017-02-14_13:15:41.94560 [main] 2017/02/14 13:15:41 DEBUG - Starting agent
2017-02-14_13:15:41.94565 [File System] 2017/02/14 13:15:41 DEBUG - Reading file /var/vcap/bosh/agent.json
2017-02-14_13:15:41.94566 [File System] 2017/02/14 13:15:41 DEBUG - Read content
2017-02-14_13:15:41.94566 ********************
--
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stdout:
2017-02-14_13:15:55.06415 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Stderr: SIOCDARP(dontpub): Network is unreachable
2017-02-14_13:15:55.06416 [Cmd Runner] 2017/02/14 13:15:55 DEBUG - Successful: false (255)
2017-02-14_13:15:55.06416 [NATS Handler] 2017/02/14 13:15:55 ERROR - Cleaning ip-mac address cache for: 192.168.1.11
2017-02-14_13:15:55.06617 [main] 2017/02/14 13:15:55 ERROR - App run Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused
2017-02-14_13:15:55.06618 [main] 2017/02/14 13:15:55 ERROR - Agent exited with error: Running agent: Message Bus Handler: Starting nats handler: Connecting: dial tcp 192.168.1.11:4222: getsockopt: connection refused

Each agent startup makes 4 (or 5?) calls to the metadata service, which makes up for a pretty big amount for huge CF installations using HTTP metadata service: Multiply that by VMs and by Director downtime during an bosh-init update.

I'm open for suggestions how to approach this issue. Possible workarounds are:

using config-drive instead of HTTP metadata service (disk reads are probably less of a problem than HTTP access). However, there might be other reasons why people chose HTTP metadata service over config-drive.
Installing NATS separately from the Director to reduce downtime. Possibly clustered? Not sure what the state is here?
Some kind of exponential backoff in the agent instead of just exiting when NATSHandler cannot connect. This would avoid the source of frequent metadata access due to NATS being not available
something else?

I'd prefer the exponential backoff solution, what do you think?

bosh agent reports persistent disk usage with no persistent disk

Sometimes, likely depending on jobs configured on the VM, bosh agent reports persistent disk usage in the heartbeats, even though no persistent disk is mounted. The actual persistent disk statistics are then the stats of the root disk (because the /var/vcap/store dir is present still on the root disk).

e.g.

+------------------------------------+---------+---------------+------------+-----------------------+------+------+------+--------------+------------+------------+------------+------------+
| Job/index                          | State   | Resource Pool | IPs        |         Load          | CPU  | CPU  | CPU  | Memory Usage | Swap Usage | System     | Ephemeral  | Persistent |
|                                    |         |               |            | (avg01, avg05, avg15) | User | Sys  | Wait |              |            | Disk Usage | Disk Usage | Disk Usage |
+------------------------------------+---------+---------------+------------+-----------------------+------+------+------+--------------+------------+------------+------------+------------+
| api_worker_z1/0                    | running | small_z1      | 10.0.10.44 | 0.00, 0.03, 0.05      | 0.0% | 0.0% | 0.2% | 22% (213.3M) | 1% (14.6M) | 46%        | 4%         | 46%        |
| api_z1/0                           | running | large_z1      | 10.0.10.42 | 0.00, 0.01, 0.05      | 0.2% | 0.2% | 0.1% | 13% (525.3M) | 0% (0B)    | 46%        | 4%         | 46%        |
| clock_global/0                     | running | medium_z1     | 10.0.10.43 | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.1% | 20% (202.0M) | 0% (0B)    | 46%        | 4%         | n/a        |
| consul_z1/0                        | running | medium_z1     | 10.0.10.37 | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.0% | 6% (62.5M)   | 0% (0B)    | 46%        | 1%         | 0%         |
| doppler_z1/0                       | running | medium_z1     | 10.0.10.47 | 0.00, 0.02, 0.05      | 0.2% | 0.0% | 0.6% | 7% (64.7M)   | 0% (0B)    | 46%        | 1%         | n/a        |
| elasticsearch_master_z1/0          | running | small_z1      | 10.0.10.30 | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.2% | 48% (471.2M) | 0% (0B)    | 46%        | 3%         | 0%         |
| etcd_z1/0                          | running | medium_z1     | 10.0.10.20 | 0.07, 0.16, 0.16      | 0.1% | 0.1% | 6.0% | 9% (87.6M)   | 0% (0B)    | 46%        | 1%         | 0%         |
| hm9000_z1/0                        | running | medium_z1     | 10.0.10.45 | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.2% | 11% (110.7M) | 0% (0B)    | 46%        | 1%         | n/a        |
| loggregator_trafficcontroller_z1/0 | running | small_z1      | 10.0.10.48 | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.0% | 6% (61.1M)   | 0% (0B)    | 46%        | 0%         | n/a        |
| nats_z1/0                          | running | medium_z1     | 10.0.10.11 | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.0% | 8% (82.7M)   | 0% (0B)    | 46%        | 2%         | n/a        |
| nats_z2/0                          | running | medium_z2     | 10.0.11.11 | 0.00, 0.02, 0.05      | 0.1% | 0.1% | 0.5% | 9% (86.9M)   | 0% (0B)    | 46%        | 2%         | n/a        |
| postgres/0                         | running | medium_z1     | 10.0.10.38 | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.2% | 6% (58.3M)   | 0% (0B)    | 46%        | 1%         | 0%         |
| router_z1/0                        | running | router_z1     | 10.0.10.15 | 0.00, 0.01, 0.05      | 0.4% | 0.2% | 0.2% | 7% (72.6M)   | 0% (0B)    | 46%        | 1%         | 46%        |
| runner_z1/0                        | running | runner_z1     | 10.0.10.46 | 0.00, 0.01, 0.05      | 0.1% | 0.1% | 0.1% | 2% (278.7M)  | 0% (0B)    | 46%        | 3%         | n/a        |
| runner_z2/0                        | running | runner_z2     | 10.0.11.41 | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.1% | 2% (272.5M)  | 0% (0B)    | 46%        | 3%         | n/a        |
| uaa_z1/0                           | running | medium_z1     | 10.0.10.41 | 0.00, 0.01, 0.05      | 1.1% | 0.0% | 0.2% | 37% (370.8M) | 0% (0B)    | 46%        | 5%         | 46%        |
+------------------------------------+---------+---------------+------------+-----------------------+------+------+------+--------------+------------+------------+------------+------------+

All the 46% persistent disk usage VMs have no persistent disk and should report n/a instead. Looking at the bosh agent logs on one of these VMs, bosh agent initially doesn't report disk statistics. After deploying the jobs, it does so. The last log mentioning persistent disk pools before agent starts reporting persistent disk usage:

{
    "arguments": [
           {
            ...
            "persistent_disk": 0,
            "persistent_disk_pool": {
                "cloud_properties": {},
                "disk_size": 0,
                "name": "2b14b466-cf9b-447f-ba8b-e8d6c2c014d4"
            },
            ...
            }
}

Curiously, there are similar persistent_disk_pool messages in the logs on the machines which correctly report no persistent disk. I was not able further determine the reason why the issue is happening on some VMs whereas not on others. The only difference seems to be jobs configured. We are using standard cf-release for AWS here. Bosh release v207, stemcell 3074.

LinuxOptions.BindMountPersistentDisk used for tmpfs mounts

The agent settings for the linux platform includes the following:

	// When set to true persistent disk will be mounted as a bind-mount
	BindMountPersistentDisk bool

When that option is enabled, it uses the linuxBindMounter for more than persistent disks. In my case, it's attempting to create and mount a tmpfs at /var/vcap/data/sys/run with the --bind option; this doesn't work:

2016-11-17_19:56:25.34094 [main] 2016/11/17 19:56:25 ERROR - App setup Running bootstrap: Setting up data dir: Mounting tmpfs to /var/vcap/data/sys/run: Shelling out to mount: Running command: 'mount tmpfs /var/vcap/data/sys/run -t tmpfs -o size=1m --bind', stdout: '', stderr: 'mount: special device tmpfs does not exist
2016-11-17_19:56:25.34095 ': exit status 32
2016-11-17_19:56:25.34098 [main] 2016/11/17 19:56:25 ERROR - Agent exited with error: Running bootstrap: Setting up data dir: Mounting tmpfs to /var/vcap/data/sys/run: Shelling out to mount: Running command: 'mount tmpfs /var/vcap/data/sys/run -t tmpfs -o size=1m --bind', stdout: '', stderr: 'mount: special device tmpfs does not exist
2016-11-17_19:56:25.34099 ': exit status 32

Mounting without the bind option works:

/:/var/vcap/bosh# mount tmpfs /var/vcap/data/sys/run -t tmpfs -o size=1m       
/:/var/vcap/bosh# ls -l /var/vcap/data/sys/
total 4
drwxr-x--- 2 root vcap 4096 Nov 17 19:53 log
drwxrwxrwt 2 root root   40 Nov 17 21:52 run
/:/var/vcap/bosh#

bosh-agent ignores use_dhcp setting when attach multiple networks

I'm using bosh-openstack-kvm-ubuntu-trusty-go_agent v3181 on OpenStack.
So, attach multiple networks to vm with use_dhcp: false option but can not disable DHCP.

Following is bosh-agent's log.
You can see two networks attached and each network is disabled DHCP.

But bosh-init uses dhcp networking mode because does not extract network settings.

p.s.
Work correctly when attach only one network.

Stemcell 3262.14 breaks persistent disk formatting on Openstack

It seems that with stemcell 3262.14 a newer version of bosh-agent has been introduced that fails mounting/formatting persistent disks on Openstack.
I was trying to deploy the latest bosh release and stemcell with bosh-init, but it fails at formatting the partition.
I tried various combinations, bosh 257.9 and 257.14 both work with stemcell versions 3262.12 or earlier, starting with 3262.14 it fails.

output of bosh-init 0.0.96:

Started deploying
  Creating VM for instance 'bosh/0' from stemcell 'c3c0b4d4-a0d9-40ee-8324-fc034e035a64'... Finished (00:01:41)
  Waiting for the agent on VM '8895e425-78d9-4cb7-8b70-7190a4d69ca8' to be ready... Finished (00:00:18)
  Creating disk... Finished (00:00:07)
  Attaching disk '7c6dadc7-3231-4f60-a327-1245c895f2ab' to VM '8895e425-78d9-4cb7-8b70-7190a4d69ca8'... Failed (00:00:19)
Failed deploying (00:02:26)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)

Command 'deploy' failed:
  Deploying:
    Creating instance 'bosh/0':
      Updating instance disks:
        Updating disks:
          Deploying disk:
            Mounting disk:
              Sending 'get_task' to the agent:
                Agent responded with error: Action Failed get_task: Task d5818063-217f-41e3-5110-01a5539d86d2 result: Mounting persistent disk: Formatting partition with : Checking filesystem format of partition: Running command: 'blkid -p /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1', stdout: '', stderr: 'error: /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1: No such file or directory

how it looks on the VM:

root@0c5397b0-7010-4b03-45f5-87f33a0a5eae:/dev/disk/by-id# ll
total 0
drwxr-xr-x 2 root root  80 Sep 18 08:33 ./
drwxr-xr-x 5 root root 100 Sep 18 08:33 ../
lrwxrwxrwx 1 root root   9 Sep 18 08:33 virtio-7c6dadc7-3231-4f60-a -> ../../vdc
lrwxrwxrwx 1 root root  10 Sep 18 08:33 virtio-7c6dadc7-3231-4f60-a-part1 -> ../../vdc1

relevant part of bosh-agent log:

2016-09-18_08:33:14.04760 [File System] 2016/09/18 08:33:14 DEBUG - Glob '/dev/disk/by-id/*7c6dadc7-3231-4f60-a'
2016-09-18_08:33:14.04771 [File System] 2016/09/18 08:33:14 DEBUG - Checking if file exists /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a
2016-09-18_08:33:14.04775 [virtioDevicePathResolver] 2016/09/18 08:33:14 DEBUG - Resolved disk {ID:7c6dadc7-3231-4f60-a327-1245c895f2ab DeviceID: VolumeID:/dev/sdc Lun: HostDeviceID: Path:/dev/sdc FileSystemType:} by ID as '/dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a'
2016-09-18_08:33:14.04777 [File System] 2016/09/18 08:33:14 DEBUG - Reading file /proc/mounts
2016-09-18_08:33:14.04792 [File System] 2016/09/18 08:33:14 DEBUG - Read content
2016-09-18_08:33:14.04793 ********************
2016-09-18_08:33:14.04793 sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
2016-09-18_08:33:14.04793 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
2016-09-18_08:33:14.04793 udev /dev devtmpfs rw,relatime,size=4077380k,nr_inodes=1019345,mode=755 0 0
2016-09-18_08:33:14.04794 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
2016-09-18_08:33:14.04794 tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=817588k,mode=755 0 0
2016-09-18_08:33:14.04794 /dev/vda1 / ext4 rw,relatime,data=ordered 0 0
2016-09-18_08:33:14.04794 none /var/lib/ureadahead/debugfs debugfs rw,relatime 0 0
2016-09-18_08:33:14.04795 none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
2016-09-18_08:33:14.04795 none /sys/fs/fuse/connections fusectl rw,relatime 0 0
2016-09-18_08:33:14.04795 none /sys/kernel/debug debugfs rw,relatime 0 0
2016-09-18_08:33:14.04795 none /sys/kernel/security securityfs rw,relatime 0 0
2016-09-18_08:33:14.04795 none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
2016-09-18_08:33:14.04796 none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
2016-09-18_08:33:14.04796 none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
2016-09-18_08:33:14.04796 none /sys/fs/pstore pstore rw,relatime 0 0
2016-09-18_08:33:14.04796 rpc_pipefs /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
2016-09-18_08:33:14.04797 /dev/vda3 /var/vcap/data ext4 rw,relatime,data=ordered 0 0
2016-09-18_08:33:14.04797 /dev/vda3 /var/log ext4 rw,relatime,data=ordered 0 0
2016-09-18_08:33:14.04797 tmpfs /var/vcap/data/sys/run tmpfs rw,relatime,size=1024k 0 0
2016-09-18_08:33:14.04798 /dev/vda3 /tmp ext4 rw,relatime,data=ordered 0 0
2016-09-18_08:33:14.04798 /dev/vda3 /var/tmp ext4 rw,relatime,data=ordered 0 0
2016-09-18_08:33:14.04798
2016-09-18_08:33:14.04798 ********************
2016-09-18_08:33:14.04801 [linuxPlatform] 2016/09/18 08:33:14 INFO - realPath = /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a, devicePath = , isMountPoint = %!s(bool=false)
2016-09-18_08:33:14.04804 [File System] 2016/09/18 08:33:14 DEBUG - Making dir /var/vcap/store with perm 0700
2016-09-18_08:33:14.04820 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Running command: lsblk --nodeps -nb -o SIZE /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a
2016-09-18_08:33:14.05305 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stdout: 68719476736
2016-09-18_08:33:14.05306 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stderr:
2016-09-18_08:33:14.05307 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Successful: true (0)
2016-09-18_08:33:14.05307 [linuxPlatform] 2016/09/18 08:33:14 DEBUG - Persistent disk size to be partitioned is: 68719476736, and error is: <nil>
2016-09-18_08:33:14.05307 [linuxPlatform] 2016/09/18 08:33:14 DEBUG - fdisk partitioner was chosen
2016-09-18_08:33:14.05310 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Running command: sfdisk -d /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a
2016-09-18_08:33:14.06026 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stdout:
2016-09-18_08:33:14.06027 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stderr:
2016-09-18_08:33:14.06027 sfdisk: ERROR: sector 0 does not have an msdos signature
2016-09-18_08:33:14.06027  /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a: unrecognized partition table type
2016-09-18_08:33:14.06028 No partitions found
2016-09-18_08:33:14.06028 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Successful: true (0)
2016-09-18_08:33:14.06032 [attemptRetryStrategy] 2016/09/18 08:33:14 DEBUG - Making attempt #0
2016-09-18_08:33:14.06035 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Running command: sfdisk -uM /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a
2016-09-18_08:33:14.07894 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stdout:
2016-09-18_08:33:14.07896 Disk /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a: 133152 cylinders, 16 heads, 63 sectors/track
2016-09-18_08:33:14.07897 Old situation:
2016-09-18_08:33:14.07897 New situation:
2016-09-18_08:33:14.07897 Units = mebibytes of 1048576 bytes, blocks of 1024 bytes, counting from 0
2016-09-18_08:33:14.07897
2016-09-18_08:33:14.07898    Device Boot Start   End    MiB    #blocks   Id  System
2016-09-18_08:33:14.07898 /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a-part1         0+ 65535- 65536-  67108607+  83  Linux
2016-09-18_08:33:14.07898 /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a-part2         0      -      0          0    0  Empty
2016-09-18_08:33:14.07898 /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a-part3         0      -      0          0    0  Empty
2016-09-18_08:33:14.07899 /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a-part4         0      -      0          0    0  Empty
2016-09-18_08:33:14.07899 Successfully wrote the new partition table
2016-09-18_08:33:14.07899
2016-09-18_08:33:14.07899 Re-reading the partition table ...
2016-09-18_08:33:14.07899
2016-09-18_08:33:14.07900 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stderr: Checking that no-one is using this disk right now ...
2016-09-18_08:33:14.07900 OK
2016-09-18_08:33:14.07900
2016-09-18_08:33:14.07900 sfdisk: ERROR: sector 0 does not have an msdos signature
2016-09-18_08:33:14.07900  /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a: unrecognized partition table type
2016-09-18_08:33:14.07901 No partitions found
2016-09-18_08:33:14.07901 Warning: no primary partition is marked bootable (active)
2016-09-18_08:33:14.07902 This does not matter for LILO, but the DOS MBR will not boot this disk.
2016-09-18_08:33:14.07902 If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
2016-09-18_08:33:14.07902 to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
2016-09-18_08:33:14.07902 (See fdisk(8).)
2016-09-18_08:33:14.07902 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Successful: true (0)
2016-09-18_08:33:14.07903 [SfdiskPartitioner] 2016/09/18 08:33:14 INFO - Succeeded in partitioning /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a with ,,L
2016-09-18_08:33:14.07903 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Running command: blkid -p /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1
2016-09-18_08:33:14.08014 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stdout:
2016-09-18_08:33:14.08015 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Stderr: error: /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1: No such file or directory
2016-09-18_08:33:14.08016 [Cmd Runner] 2016/09/18 08:33:14 DEBUG - Successful: false (2)
2016-09-18_08:33:14.08019 [Task Service] 2016/09/18 08:33:14 ERROR - Failed processing task #d5818063-217f-41e3-5110-01a5539d86d2 got: Mounting persistent disk: Formatting partition with : Checking filesystem format of partition: Running command: 'blkid -p /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1', stdout: '', stderr: 'error: /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1: No such file or directory
2016-09-18_08:33:14.08019 ': exit status 2
2016-09-18_08:33:14.72117 [HTTPS Dispatcher] 2016/09/18 08:33:14 INFO - POST /agent
2016-09-18_08:33:14.72121 [MBus Handler] 2016/09/18 08:33:14 INFO - Received request with action get_task
2016-09-18_08:33:14.72122 [MBus Handler] 2016/09/18 08:33:14 DEBUG - Payload
2016-09-18_08:33:14.72122 ********************
2016-09-18_08:33:14.72122 {"method":"get_task","arguments":["d5818063-217f-41e3-5110-01a5539d86d2"],"reply_to":"dd316ff5-3564-4277-7201-f2f26400c236"}
2016-09-18_08:33:14.72122 ********************
2016-09-18_08:33:14.72123 [Action Dispatcher] 2016/09/18 08:33:14 INFO - Running sync action get_task
2016-09-18_08:33:14.72139 [Action Dispatcher] 2016/09/18 08:33:14 ERROR - Action Failed get_task: Task d5818063-217f-41e3-5110-01a5539d86d2 result: Mounting persistent disk: Formatting partition with : Checking filesystem format of partition: Running command: 'blkid -p /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1', stdout: '', stderr: 'error: /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1: No such file or directory
2016-09-18_08:33:14.72140 ': exit status 2
2016-09-18_08:33:14.72150 [MBus Handler] 2016/09/18 08:33:14 INFO - Responding
2016-09-18_08:33:14.72152 [MBus Handler] 2016/09/18 08:33:14 DEBUG - Payload
2016-09-18_08:33:14.72152 ********************
2016-09-18_08:33:14.72153 {"exception":{"message":"Action Failed get_task: Task d5818063-217f-41e3-5110-01a5539d86d2 result: Mounting persistent disk: Formatting partition with : Checking filesystem format of partition: Running command: 'blkid -p /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1', stdout: '', stderr: 'error: /dev/disk/by-id/virtio-7c6dadc7-3231-4f60-a1: No such file or directory\n': exit status 2"}}

Is this the problem?
https://github.com/cloudfoundry/bosh-agent/blob/master/platform/linux_platform.go#L936-L939

Nothing has changed on part of our Openstack infrastructure. To confirm that I checked various older bosh deployments, the persistent disk partitions there are always under /dev/disk/by-id/virtio-*-part1, not as the bosh-agent expects with /dev/disk/by-id/virtio-*1.
There's also never anything under /dev/mapper to be found.

I'm curious as to why it works on stemcell 3262.12 and not anymore on 3262.14. I did not figure out which commit causes this problem. From looking the at code mentioned above it should have never worked?

bosh agent doesn't clean /var/vcap/data/root_tmp when upgrade from old stemcell

This problem should only happen in Softlayer as we do os-reload to upgrade stemcell. We open this issue to see if community can help solve it. Thanks.

We are trying to upgrade stemcell from 3169 to 3262.2. During the upgrade, the vm bosh agent failed to start due to

2016-07-21_06:55:19.43133 [main] 2016/07/21 06:55:19 ERROR - App setup Running bootstrap: Setting up tmp dir: Creating root tmp dir: Running command: 'mkdir -p /var/vcap/data/root_tmp', stdout: '', stderr: 'mkdir: cannot create directory '/var/vcap/data/root_tmp': File exists

This is because in old stemcell 3169, /var/vcap/data/root_tmp exists in the ephemeral disk like this:

-rwx------ 1 root root 134217728 Jul 21 08:38 /var/vcap/data/root_tmp*

During stemcell upgrade, we do os-reload in Softlayer and keep using the existing ephemeral disk. So the file /var/vcap/data/root_tmp is still there.

But in new stemcell 3262.2, bosh agent changes to create a directory with the same name at the same place. As the ephemeral disk is not recreated, bosh agent handles it as an existing one and doesn't clean it. As a result, bosh agent failed to create the directory /var/vcap/data/root_tmp and fail to start.

If possible, would you please fix it like this:
Before run mkdir -p /var/vcap/data/root_tmp, check if there is a file with the same name and path existing. If yes, remove it. Thanks.

/cc @maximilien @jianqiu @mattcui

Sticky bit is not set for system tmp dir.

Hi,
Looking around permissions for /tmp, we have

$ bosh ssh shell 0
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-72-generic x86_64)
...
...
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$ ls -ld /tmp
drwxrwx--- 2 root vcap 4096 Jun 26 10:30 /tmp

This output means any users who belongs to group 'vcap' may clobber any files owned by root.
For example

shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$ sudo -i
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# touch /tmp/foobar
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# ls -l /tmp/foobar
-rw-r--r-- 1 root root 0 Jun 26 10:32 /tmp/foobar
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# chmod 000 /tmp/foobar
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# su - vcap
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$ rm -f /tmp/foobar

But that user should not be allowed to do that.

Expected behaviour shall be (IMHO)

shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$ logout
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# chmod 1770 /tmp
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# touch /tmp/foobar                                                                                                                        
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# chmod 000 /tmp/foobar                                                                                                                    
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~# su - vcap
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$ rm -f /tmp/foobar                                                                                                                        
rm: cannot remove ‘/tmp/foobar’: Operation not permitted
shell/c88e48de-6200-4bce-8887-6a9ea812a9c2:~$

Regards,

Bruno

Drop `delaycompress` from the logrotate configuration

Since to maximize compatibility we're using copytruncate in the logrotate configuration it doesn't make much sense to also use delaycompress. The latter is supposed to be used "when some program cannot be told to close its logfile and thus might continue writing to the previous log file for some time". This doesn't apply when logs are rotated with copytruncate.

The downside of using delaycompress is that a 1:1 copy of the current log file needs to be created (using temporarily twice the amount of disk space and causing more IO write load). Without delaycompress we would compress immediately, using less disk space and generating less IO write load.

[windows2016][1093] golang web app job not running

I'm working on a very simple windows bosh release that runs a Golang web app https://github.com/cloudfoundry-community/sample-go-windows-boshrelease

I've upgraded it to use the latest GCP 1093 build.1 stemcell mentioned on the mailing list, but when the deployment finally finishes (each step takes 10mins) the instances is not responding and I cannot access the webapp on :3000; yet there is nothing bad in the job logs.

What steps should I take to debug? Or what is missing in my job? Are the syslog settings required?

$ tail -n 200 simple-go-web-app/simple-go-web-app/*
==> simple-go-web-app/simple-go-web-app/job-service-wrapper.err.log <==

==> simple-go-web-app/simple-go-web-app/job-service-wrapper.out.log <==
[martini] listening on :3000 (development)

==> simple-go-web-app/simple-go-web-app/job-service-wrapper.wrapper.log <==
2017-07-17 01:34:23,541 DEBUG - Starting ServiceWrapper in the CLI mode
2017-07-17 01:34:24,745 INFO  - Installing the service with id 'simple-go-web-app'
2017-07-17 01:34:24,834 DEBUG - Completed. Exit code is 0
2017-07-17 01:34:42,954 INFO  - Starting ServiceWrapper in the service mode
2017-07-17 01:34:43,203 INFO  - Starting C:\var\vcap\bosh\bin\pipe.exe  C:\var\vcap\packages\simple-go-web-app\simple-go-web-app.exe
2017-07-17 01:34:43,212 INFO  - Starting C:\var\vcap\bosh\bin\pipe.exe  C:\var\vcap\packages\simple-go-web-app\simple-go-web-app.exe
2017-07-17 01:34:43,360 INFO  - Started process 1948
2017-07-17 01:34:43,398 DEBUG - Forwarding logs of the process System.Diagnostics.Process (pipe) to winsw.SizeBasedRollingLogAppender

==> simple-go-web-app/simple-go-web-app/pipe.log <==
2017/07/17 01:34:43 pipe: configuration: &{ServiceName:simple-go-web-app LogDir:\var\vcap\sys\log/simple-go-web-app/simple-go-web-app NotifyHTTP:http://localhost:2825 SyslogHost: SyslogPort: SyslogTransport: MachineIP:10.0.0.10}
2017/07/17 01:34:43 syslog: configuration missing or incomplete
2017/07/17 01:34:43 pipe: starting

Update: I also deployed say-hello job from https://github.com/cloudfoundry-incubator/sample-windows-bosh-release with this manifest https://gist.github.com/drnic/14914246063db649750c98ccc02c36a2 and get same issue:

Instance                                    Process State       AZ  IPs
hello/d14bc6e0-c838-42ae-b573-abba877262f9  unresponsive agent  z1  10.0.0.6

As above, there are no errors or symptoms of misbehavior in the bosh2 logs

Update: The serial logs for an example GCE Windows VM is at https://gist.github.com/drnic/64dc4a171da54b687703637ae1ff0718; note that it ends with:

2017/07/19 00:30:32  Activating Windows(R), ServerDatacenter edition (21c56779-b449-4d20-adfc-eece0e1ad74b) ...
2017/07/19 00:30:32  Error: 0xC004F074 The Software Licensing Service reported that the computer could not be activated. No Key Management Service (KMS) could be contacted. Please see the Application Event Log for additional information.

Allow to configure logrotate frequency

Currently logrotate is hardcoded to run every hour. For some jobs that are generating lots of logs we would like to be able to increase the frequency at which logrotate runs.

WindowsNetManager.setupNetworking contains hard-coded sleep to 5s

That doesn't seem right, is this really necessary?
https://github.com/cloudfoundry/bosh-agent/blob/master/platform/net/windows_net_manager.go#L70

This also slows down the testsuite btw, as every test using setupNetworking now takes 5+ seconds:

• [SLOW TEST:5.004 seconds]
WindowsNetManager SetupNetworking Setting NIC settings sets the IP address and netmask on all interfaces, and the gateway on the default gateway interface
~/go/src/github.com/cloudfoundry/bosh-agent/platform/net/windows_net_manager_test.go:71

rsyslog can't write logs into /var/log due to the wrong permission

We are using the stemcell based on 3263.7 published on bosh.io, we found the bosh agent created a directory /var/vcap/data/root_log with permission 755 (although it showed "perm 0775 in the log), then mounted it to /var/log, actually, the permission of /var/log in the file system is 775 by default, after it's mounted, the permission of /var/log is changed to 755, which broke some logging service like rsyslog. There should make sure the permission of /var/log is 775.

2016-10-27_06:11:47.30523 [File System] 2016/10/27 06:11:47 DEBUG - Making dir /var/vcap/data/root_log with perm 0775
...
...
2016-10-27_06:11:47.31594 [Cmd Runner] 2016/10/27 06:11:47 DEBUG - Running command: mount /var/vcap/data/root_log /var/log -o nodev -o noexec -o nosuid --bind

@maximilien @gu-bin

/var/log has weird permissions

/var/log has recently been given the 0770 permissions, this breaks some packages that expect to be able to read/write from specific files in that directory

We went through the related stories but we couldn't find any rationale for the 0775 -> 0770 change, unless we're missing something we think this should be put back to the old value (0775) to avoid breaking other components.

windows agent cpu spikes when the director stops

Howdy,

Given I have bosh-deployed concourse
And that deployment includes a windows 2012R2 VM
And that windows VM is running the bosh-google-kvm-windows2012R2-go_agent v1200.0 stemcell
And I stop the director VM
Then I see the windows agent spike to 100% CPU

Here's a chart from GCP/Stackdriver:

I'll try to reproduce later and provide more detailed logs from the agent.

Commit 1c68a2efd72c3d7504ff8006f7a55e01dee82d5b will lead to the failure of stemcell building

We happened to see an error which block bosh-agent from launching during the building of stemcells. and this will lead to the failure of building stemcell. we tried to revert two commits of bosh-agent branch and then we can successfully build a stemcell.

seems the change of commit 1c68a2e, around the changes of 1c68a2e#diff-07cac04980e3140b8f2168929ed868cbL47
will have some problem like incorrect unmarshal public key, which may lead to the launch failure of bosh-agent.

Create this issue for further investigation.

IBM pair, Tom and Edward

cc: @cppforlife

alerts ignored on file content match

Currently the bosh agent ignores the alert generated by a successful file content match:

bosh-agent/agent/alert/monit_adapter.go

Line 93 in 1dca324

"content match": SeverityIgnored,

We want to get monit to send an alert when it recognises specific error patterns in a logfile.

cloudstack compatible ssh-key metadata URI

While working on the cloudstack stemcell support for bosh, met this issue.
cloudfoundry-community/bosh-cloudstack-cpi-release#14

The metadata location for public ssh keys is harcoded in bosh-agent, uses http:///latest/meta-data/public-keys/0/openssh-key
https://github.com/cloudfoundry/bosh-agent/blob/1dca3244702c18bf2c36483c529d4e7b3fb92b2e/infrastructure/http_metadata_service.go

In cloudstack, the URL to use is http:///latest/meta-data/public-keys

Could we provide a way in agent.json to specify the correct ssh key URI ? Or fallback to trying /latest/meta-data/public-keys

/var/vcap/store should not have 5% root reservation

Currently all filesystems are created with the default 5% disk space reservation for privileged processes. While this make sense for root filesystems, it's hard to justify for volumes that are normally only written to by non-root users (vcap), especially since it's not uncommon (at least in our experience) to have very large volumes (and, therefore, quite big amounts of unusable disk space)

I think it should be safe to use -m 0 on /var/vcap/store ~~and /var/vcap/data~~, but if that's deemed too risky (why?) I would at least suggest lowering the reservation to some smaller value.