cloudfoundry / bosh-google-cpi-release Goto Github PK

BOSH Google CPI

License: Apache License 2.0

Shell 4.67% HTML 0.41% Makefile 1.10% Go 91.89% HCL 1.57% Dockerfile 0.36%

bosh-google-cpi-release's Issues

I can't seem to get VMs to attach to an instance_group

I created a manifest using instance_group: nginx in the subnet cloud_properties section of my cloud config, but I don't see new instances being attached to that group. Are you sure that's wired up correctly internally?

Additionally - it seems weird that ELBs can be attached using vm_types (or vm_extensions, which is even better), but on GCP I have to attach a load balancer to every VM in the network. It's very common that I'd declare only 1 or 2 large networks, and I don't want all those VMs to have load balancers. Could we make google's instance_group a cloud_property to a BOSH job (now confusingly called BOSH instance groups)?

Here's what BOSH printed out when I did the deploy, so it clearly understood the change:

> bosh deploy --recreate

Detecting deployment changes
----------------------------
networks:
- name: private
  subnets:
  - range: 10.128.0.0/20
    cloud_properties:
      instance_group: nginx

Then it recreated the VM:

  Started updating job web_server > web_server/0 (18857c48-665f-4600-bd00-c9e6909d6c87) (canary). Done (00:04:51)

But that new web server isn't part of an instance group. I used CPI version 21, director 257.3, and stemcell 3262.2.

Here's the full network section of my cloud_config:

networks:
- name: private
  type: manual
  subnets:
  - dns:
    - 169.254.169.254
    range: 10.128.0.0/20
    gateway: 10.128.0.1
    azs:
    - us-central1-a
    cloud_properties:
      network_name: cf
      instance_group: nginx
      tags:
      - cf-bosh
    reserved:
    - 10.128.0.2
    - 10.128.0.10-10.128.15.254
    static:
    - 10.128.0.8

Let me know if I just did something wrong. Thanks!

Remove registry configuration from CPI since it uses updatable metadata

https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/blob/master/jobs/google_cpi/spec#L97
prolly some go code that can also we removed

Unable to ssh into vms

Trying to ssh into vms using bosh ssh results in an error:

root@bosh-bastion:~# bosh ssh
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on deployment 'cf' on 'micro-google'
Target deployment is 'cf'

Setting up ssh artifacts

Director task 36
Error 450001: Action Failed ssh: Getting host public key: Unable to read host public key file: /etc/ssh/ssh_host_rsa_key.pub: Opening file /etc/ssh/ssh_host_rsa_key.pub: open /etc/ssh/ssh_host_rsa_key.pub: no such file or directory

Task 36 error
Failed to set up SSH: see task 36 log for details

I'm using:

bosh release 257.9
google-cpi-release 25.3.0
stemcell 3262.9

After un-mounting the vm's root disk and mounting it on my bastion vm I was able to check the vm logs:

Sep  9 07:51:39 localhost sshd[4313]: fatal: No supported key exchange algorithms [preauth]
Sep  9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_rsa_key
Sep  9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep  9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_ecdsa_key
Sep  9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_ed25519_key

And indeed, ssh host keys are not there:

root@bosh-bastion:~/test/etc/ssh# ls -la
total 256
drwxr-xr-x  2 root root   4096 Aug 31 18:13 .
drwxr-xr-x 85 root root   4096 Sep  9 07:52 ..
-rw-r--r--  1 root root 242091 May  5 14:16 moduli
-rw-r--r--  1 root root   1690 May  5 14:16 ssh_config
-rw-------  1 root root   2912 Jun 23 21:29 sshd_config

Deploy BOSH with Terraform startup-script failed

I followed the steps here and used Terraform to deploy the infrastructure, but the whole startup script failed. The following was in /var/log/startupscript.log:

Running startup script /var/run/google.startup.script
/usr/share/google/run-scripts: /tmp/tmp.SA6GU6Llif: /bin/bash^M: bad interpreter: No such file or directory
Finished running startup script /var/run/google.startup.script

Project API is nonexistent

Relevant line: https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/blame/95df8c5dac4bb61969d6b4228bd12ffd9845cd02/docs/bosh/README.md#L28

This link does not take me to an API, and there is no such API called 'Project API' that I can search for.

support global cpi configuration for tags

to allow tagging of all vms with specific set of tags to enable setting security groups. global tags would be added on top of tags specified via env arg in create_vm. wdyt about google.default_tags: [tag1, tag2]?

cc @evandbrown @dsboulder

Property names are different with the google CPI than others

Hey GCP CPI developers -

I noticed some inconsistencies when using the google CPI next to the AWS, OpenStack, and vSphere CPIs. There's a few properties that aren't named the same with the google CPI, here's what they are:

The bosh-init cloud_provider agent: properties section seems to expect blobstore and ntp settings. Usually these settings are at the root level and are pulled in by the other CPIs.
The bosh-init job properties agent section has a similar inconsistency in it. Usually the blobstore and ntp settings are not inside the agent section.
The local bosh-init blobstore path in your CPI is provided in an additional namespace called blobstore_options like this: {provider: "local", options: {blobstore_path: "/var/vcap/micro_bosh/data/cache"}}. Other CPIs have the blobstore path more at the root level like this: {provider: "local", path: "/var/vcap/micro_bosh/data/cache"}

If y'all help make those changes, it'll be much easier for existing BOSH users of another IaaS to learn GCP. I've already had 5 or 6 people run into this, including me. Thanks!!!

Running Into Static Address Quota Limits

I'm running the terraform apply step and I'm running into

google_compute_address.cf-tcp: Error creating address: googleapi: Error 403: Quota 'STATIC_ADDRESSES' exceeded. Limit: 1.0, quotaExceeded

I was able to do this without any problems last week but needed to destroy everything and set it up inside a new project. So I'm not sure if the quota decreased for some reason in the past 4 days or if something changed because we're on the free trial. Any ideas?

Specifying Local SSD for disk

Is is possible to use Local SSD for the ephemeral disk mounted at /var/vcap/data? I could only find references to pd-ssd in the examples, which I assume is shorthand for persistent SSD:

disk_pools:
  - name: disks
    disk_size: 32_768
    cloud_properties:
      type: pd-ssd

Our present Concourse CI workers are provisioned as AWS C3 instance types, which provide local SSD and drastically better performance for our workloads. We'd like to achieve the same on GCP if possible.

Poor error when specified external IP is missing

When trying to use a global external IP I received the following error:

Error 100: VM failed to create: googleapi: Error 400: Invalid value for field 'resource.networkInterfaces[0].accessConfigs[0]': ''. Specified external IP address not found., invalid

There are two problems here:

The error is more busy than it needs to be, making resolution potentially difficult (accessConfigs[0]?)
It's not clear that the IP address has to be available in the same region, not a global address

Allow multiple target_pools or backend_services on a instance group

To support cross-regional deployments, it would be nice if I could pass a list of target_pools or backend_services and the CPI would attach any of them that makes sense based on the region and/or AZ of each VM.

For example:

instance_groups:
 - name: cloud_controller
   azs: ["west1a", "west1b", "central1a"]
   instances: 3
   cloud_properties: # via vm_extensions
     target_pools: ["west", "central"]  # with central likely being a backup pool in the load balancer
 - name: router
   azs: ["west1a", "west1b", "central1a", "central1b"]
   instances: 4
   cloud_properties: # via vm_extensions
     backend_services: ["westsvc", "centralsvc"]  # this time we'd probably go global distribution between svcs

When the CPI booted the 3 cloud controllers, it should obviously only attach the pool from the "west" region to the VMs actually in the west region and same with central. Likewise with the backend services.

Add internal load balancer support to CPI

Enhancement: Support for Accelerator in Cloud Properties

I have come across an use case that require GPU on some of the Bosh deployed VM in GCP. I'm proposing that the manifest would look something like this:

resource_pools:
- name: common
  network: private
  stemcell:
    name: bosh-google-kvm-ubuntu-trusty-go_agent
    version: latest
  cloud_properties:
    zone: us-east1-d
    region: us-east1
    machine_type: n1-standard-2
    root_disk_size_gb: 20
    root_disk_type: pd-standard
    accelerator:
      type: nvidia-tesla-k80
      count: 1

Behavior will not change if user does not fill out the accelerator properties. During VM creation, we'll pass in something like --accelerator type=nvidia-tesla-k80,count=1.

If it's ok with the community, my team can submit a PR for this enhancement. In the mean time, please let me know if there is any question or concern.

Thanks,
Victor

Docs for deploying Cloud Foundry are stale

The docs in this repo for deploying Cloud Foundry to GCP are outdated. They may still work, but the document in cf-deployment describes an easier way to deploy to GCP, and it's supported by the Infrastructure and RelInt teams. Would it make sense to replace the doc in this repo with a link to the one from cf-deployment?

Garden MTU exceeds GCP MTU causing packet loss

Garden defaults to a 1500 MTU. GCP supports a max MTU of 1460. This means a 1461 B packeting will be dropped and degrade network performance.

The manifests for Concourse and Cloud Foundry need to set the garden.network_mtu property.

Big thanks to Pivotal folks for finding/helping.

Default Application Security Groups are wide open

In the cloudfoundry example, the application security groups are wide open. The recommended groups disallow internal networks. This is very important because, as currently configured, 1) containers can see host metadata from the google metadata server, and 2) containers can see the entire bosh subnet, including internal components that should not be viewable to host applications.

Concourse example Error - VM failed to create: Backend Service "concourse" does not exist: <nil cause>

I have been working through https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/tree/master/docs/concourse and when its creating the web VM it fails with VM failed to create: Backend Service "concourse" does not exist: <nil cause>

Done creating missing vms > db/0 (a9cdb68d-9f11-4bea-af69-750c9f390e11) (00:01:01)
Done creating missing vms > worker/0 (f3ca4d71-6503-4545-b446-aa47e6395426) (00:01:01)
Failed creating missing vms > web/0 (f06432bb-1275-44ff-ad86-b55d0eda9697): VM failed to create: Backend Service "concourse" does not exist: (00:15:01)
Failed creating missing vms (00:15:01)

Codewise, this relates to https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/blob/e84f61b28ec0a98cff7f1628f7d930f598f939fc/src/bosh-google-cpi/google/backendservice_service/google_backendservice_service.go#L122

Its clearly expecting a backend service configured but having followed the documented set up from the README.md if I subsequently run gcloud compute backend-services list it returns Listed 0 items. Any ideas what might be missing here?

Deleting stemcell fails if image does not exist

When deleting a stemcell (e.g. during bosh delete-env) if the image has already been removed then the teardown fails with the following error:

Deleting deployment:
  Deleting stemcell from cloud:
    CPI 'delete_stemcell' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Deleting stemcell 'stemcell-ab5f5573-62d5-4f72-6620-3ab8f1aed3cf': Google Image 'stemcell-ab5f5573-62d5-4f72-6620-3ab8f1aed3cf' does not exists: \u003cnil cause\u003e","ok_to_retry":false}

Exit code 1

I would expect this to succeed if the image is gone, because that's exactly the operation we're trying to achieve.

For context, this is a problem for us as we accumulate lots of images because we spin up lots of bosh environments in our account, and each deployment results in a new image (stemcell-<guid>). To avoid hitting the limit we asynchronously delete the images, but then our tear down of the bosh director fails with the above error.

cc @cppforlife

concourse with HTTPS and fly-hijack support

We have recently been moving some of our development infrastructure over to GCP and were following your guide to deploy bosh and concourse.

Using the terraform script for concourse, we provision a GCP HTTP proxy + global forwarding rule for ingress and load balancing.

We have 2 problems with this setup: It does not support HTTPS, and fly hijack does not work as tcp:2222 is disallowed by the firewall rules.

We got HTTPS and hijack working using the following setup:

give the web server VM (running ATC) a static IP with bosh.
point DNS for our concourse domain at this static IP
configure the ATC bosh job to serve TLS using a valid certificate for our domain.
configure the ATC bosh job to bind standard HTTP/S ports 80 and 443 rather than the default 8080 and 4443 (the ATC executable deployed by bosh is given CAP_NET_BIND_SERVICE, so it can do this when run as any user).

We initially tried to get this setup working by "chaining" GCP-managed load balancing services (trying to mix TCP and SSL-terminating-HTTPS load balancing) but this didn't look possible.

What do you think?

cc @jamesjoshuahill

BOSH Google CPI error with `snapshot_disk` and payload unmarshalling

$ bosh restart db/0 --force
Restart db/*? (type 'yes' to continue): yes

Performing 'restart db/*'...

Director task 256
  Started preparing deployment > Preparing deployment. Done (00:00:00)

  Started preparing package compilation > Finding packages to compile. Done (00:00:00)

  Started updating instance db > db/<instance_guid> (0) (canary). Failed: CPI error 'Bosh::Clouds::CloudError' with message 'Extracting method arguments from payload: Unmarshalling action argument: json: cannot unmarshal number into Go value of type string' in 'snapshot_disk' CPI method (00:00:02)

Error 100: CPI error 'Bosh::Clouds::CloudError' with message 'Extracting method arguments from payload: Unmarshalling action argument: json: cannot unmarshal number into Go value of type string' in 'snapshot_disk' CPI method

Task 256 error

BOSH manifest for Concourse:

---
name: concourse
director_uuid: <director uuid>
releases:
- name: concourse
  version: 2.5.0
  url: https://bosh.io/d/github.com/concourse/concourse?v=2.5.0
  sha1: 0d1f436aad50bb09ac2c809cd6cb6df3e38a7767
- name: garden-runc
  version: 1.0.3
  url: https://bosh.io/d/github.com/cloudfoundry/garden-runc-release?v=1.0.3
  sha1: 0c04b944d50ec778f5b34304fd4bc8fc0ed83b2b

tls_key: &tls_key |
  <tls key>
tls_cert: &tls_cert |
  <tls cert>

instance_groups:
- name: web
  instances: 1
  vm_type: web
  azs:
  - z1
  - z2
  stemcell: trusty
  networks:
  - name: public
    default:
    - dns
    - gateway
  - name: vip
    static_ips:
    - <static ip>
  jobs:
  - name: atc
    release: concourse
    properties:
      external_url: <external url>
      publicly_viewable: true
      basic_auth_username: buildpacks
      basic_auth_password: "<auth password>"
      github_auth:
        client_id: <client id>
        client_secret: <client secret>
      postgresql_database: atc
      tls_cert: *tls_cert
      tls_key: *tls_key
      tls_bind_port: 443
  - name: tsa
    release: concourse
    properties: {}
- name: db
  instances: 1
  vm_type: database
  azs:
  - z1
  stemcell: trusty
  persistent_disk_type: database
  networks:
  - name: public
  jobs:
  - name: postgresql
    release: concourse
    properties:
      databases:
      - name: atc
        role: <role>
        password: <password>
- name: worker
  instances: 6
  vm_type: worker
  azs:
  - z1
  stemcell: trusty
  networks:
  - name: public
  jobs:
  - name: groundcrew
    release: concourse
    properties: {}
  - name: baggageclaim
    release: concourse
    properties: {}
  - name: garden
    release: garden-runc
    properties:
      garden:
        listen_network: tcp
        listen_address: 0.0.0.0:7777
        network_mtu: 1432
update:
  canaries: 1
  max_in_flight: 3
  serial: false
  canary_watch_time: 1000-120000
  update_watch_time: 1000-120000
stemcells:
- alias: trusty
  os: ubuntu-trusty
  version: latest

BOSH cloud config:

azs:
- name: z1
  cloud_properties:
    zone: us-east1-c
- name: z2
  cloud_properties:
    zone: us-east1-d

vm_types:
- name: web
  cloud_properties:
    machine_type: n1-standard-2
    root_disk_size_gb: 20
    root_disk_type: pd-ssd

- name: database
  cloud_properties:
    machine_type: n1-standard-4
    root_disk_size_gb: 100
    root_disk_type: pd-ssd

- name: worker
  cloud_properties:
    machine_type: n1-standard-4
    root_disk_size_gb: 300
    root_disk_type: pd-ssd

- name: bosh-lite-worker
  cloud_properties:
    machine_type: n1-standard-16
    root_disk_size_gb: 300
    root_disk_type: pd-ssd
    tags: [bosh-lite-public]

compilation:
  workers: 3
  network: public
  reuse_compilation_vms: true
  az: z1
  cloud_properties:
    machine_type: n1-standard-4
    root_disk_size_gb: 100
    root_disk_type: pd-ssd
    preemptible: true

networks:
  - name: public
    type: manual
    subnets:
    - az: z1
      range: <ip range>
      gateway: <gateway>
      cloud_properties:
        network_name: concourse
        subnetwork_name: <subnetwork name>
        ephemeral_external_ip: true
        tags:
          - concourse-public
          - concourse-internal
    - az: z2
      range: <ip range>
      gateway: <gateway>
      cloud_properties:
        network_name: concourse
        subnetwork_name: <subnetwork name>
        ephemeral_external_ip: true
        tags:
          - concourse-public
          - concourse-internal

  - name: vip
    type: vip

disk_types:
- name: database
  disk_size: 100000 #mb

This is with BOSH Google CPI version 25.6.1. We were able to deploy the db VM as part of the Concourse deployment using this CPI version. However, during a re-deploy of the Concourse deployment, this error showed up and the DB VM did not recreate and ended up stopped. Any attempts to bosh recreate, bosh start, bosh restart the DB VM result in this error message.

bosh-init deploy fails with ssh handshake error

When we try to deploy with bosh-init while in the GCP jumpbox that is created via the Concourse terraform templates, we get the following error at the end:

Command 'deploy' failed:
  Deploying:
    Creating instance 'bosh/0':
      Waiting until instance is ready:
        Starting SSH tunnel:
          Failed to connect to remote server:
            ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

We are using stemcell bosh-google-kvm-ubuntu-trusty-go_agent version 3262.14 and google CPI version 25.2.1. We have verified that the private key is in the jumpbox's ~/.ssh directory and the public key is in SSH keys section in metadata for the project.

Another user seemed to report the exact same issue with using stemcell 3262.14 on the Cloud Foundry Slack: https://cloudfoundry.slack.com/archives/bosh-gce-cpi/p1474256273000136

dynamic networks should fail fast if required properties are not present

For dynamic network configuration using bosh-init (or bosh create-env), we require cloud_properties and dns at the root, e.g.

networks:
- name: default
  type: dynamic
  cloud_properties:
    network_name: ((network))
    subnetwork_name: ((subnetwork))
  dns: 8.8.8.8

However, if we forget these fields (for example if they are incorrectly nested under subnets which is ignored by bosh-init/bosh create-env) the CPI will not fail, but will instead create the VM somewhere else - some default network?

cc @cppforlife @jpalermo

retry non-mutable connection requests

definitely can retry dial failures. probably should retry a lot more.

+++
CPI error 'Bosh::Clouds::CloudError' with message 'Creating disk: Failed to find Google Instance 'vm-xxx': Get https://www.googleapis.com/compute/v1/projects/xxx/aggregated/instances?alt=json&filter=name+eq+.%2Avm-xxx: oauth2: cannot fetch token: Post https://accounts.google.com/o/oauth2/token: dial tcp 173.194.74.84:443: i/o timeout' in 'create_disk' CPI method
+++

cc @dsboulder @ljfranklin

Support relative path for `private_key`

Per @cunnie in #31, private_key does not support relative paths. We should fix that.

Handle propagation of tags and labels

On March 7, 2017, VM tags will no longer propagate to VM labels with empty values. This may affect how the CPI labels resources.

RSA 1024 bit CA certificates are loaded due to old openssl compatibility

Just following the docs, and this response is concerning. Should I be concerned?

dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ bosh target 10.0.0.6
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Target set to 'micro-google'
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export vip=$(terraform output ip)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export tcp_vip=$(terraform output tcp_ip)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export zone=$(terraform output zone)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export zone_compilation=$(terraform output zone_compilation)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export region=$(terraform output region)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export region_compilation=$(terraform output region_compilation)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export private_subnet=$(terraform output private_subnet)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export compilation_subnet=$(terraform output compilation_subnet)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export network=$(terraform output network)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export director=$(bosh status --uuid)
RSA 1024 bit CA certificates are loaded due to old openssl compatibility

Can't seem to use S3 blobstore providers with google CPI

I'm trying to use the S3 blobstore with the google CPI, just like I can with other CPIs, and I think the blobstore options aren't being passed through. I'm attempting to pass blobstore options that look like this:

properties:
    blobstore:
      provider: s3
      host: storage.googleapis.com
      s3_port: 443
      use_ssl: true
      bucket_name: dstevenson-bosh
      access_key_id: GOOGB4BAQE4743O4YSAS
      secret_access_key: not-telling-you
      s3_force_path_style: true
      s3_multipart_threshold: 1099511627776
      s3_signature_version: '2'

But I get the following error, probably from the code inside of cpi.json.erb in the CPI:

                Running command: 'ruby /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-render.rb /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-context.json /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/bosh-init-release576250374/extracted_jobs/google_cpi/templates/config/cpi.json.erb /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/rendered-jobs472416459/config/cpi.json', stdout: '', stderr: '/home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-render.rb:189:in `rescue in render': Error filling in template '/home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/bosh-init-release576250374/extracted_jobs/google_cpi/templates/config/cpi.json.erb' for google_cpi/0 (line 52: #<TemplateEvaluationContext::UnknownProperty: Can't find property 'agent.blobstore.address', or 'blobstore.address'>) (RuntimeError)

Should we just be passing blobstore options directly through to the agent? Supporting google cloud storage for BOSH would be very useful for us.

concourse terraform creates an improper load balancer

We used the concourse terraform to create a concourse instance. It deployed perfectly!

However, when we are watching jobs through the browser the jobs were not updating. It wasn't till we reloaded a page we realized a job completed five minutes previously.

After investigation, the HTTP/S load balancer created is not supporting websocket and SSE events correctly. According to the docs these are not supported, unless using TCP/IP load balancer or by setting forwarding rules.

We'd like to help fix this as we are going to hack around the terraform configurations to make it work. Which then null/voids the usefulness of terraform for rolling back nicely.

CPI should automatically tag instances

I want to write firewalls that make sense, like diego_brain needs port 2222 allowed to it. While I could do this with a lot of VM extensions, I might end up with a VM extension that I have to attach to every job that give it a tag equal to its job name. So... could the CPI automatically tag VMs with their job name?

@cppforlife says that BOSH is about to start passing bosh.env.group_name to the CPI, which will equal the instance_group name (like diego_cell). We could use that! Ideally, we'd also tag a VM with its deployment name as well, and maybe also a combination of the two together.

dont set user_data custom metadata

CPI seems to set bosh_settings custom metadata, I dont think we need user_data one.

CPI installation fails on macOS Sierra

It looks like¹ various releases of Golang have compilation problems on macOS Sierra. I've built a dev version of the CPI release with Go v1.7.1, and found that work. I plan to submit a PR, bumping Go for Darwin and Linux... just wanted to open this up for discussion first.

Based on ↩

Could we make some network CPI parameters also changable on VM types?

We're coming up on a situation where we want a few VMs in a network to have public IPs, but not the rest of them. We may also need to tag them differently, depending on our desired firewall rules.

Could we make the following CPI network params also overridable on a VM type? Presumably the VM type would take precedence over the network settings.

ephemeral_external_ip
tags
ip_forwarding?

Thank you!

Document how to use preemptible VMs

Compilation VMs would be a good example. Document that unless disks are marked for deletion, they will remain present and will be billed for.

Stemcell 3363.1 light stemcell deployment fails with BOSH CLI `ssh: handshake failed`

Stemcell 3361.1 deployment fails with BOSH Golang CLI:

bosh create-env bosh-gce.yml -l <(lpass show --note deployments)
...
Deploying:
  Creating instance 'bosh/0':
    Waiting until instance is ready:
      Starting SSH tunnel:
        Starting SSH tunnel:
          Failed to connect to remote server:
            ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Exit code 1

The deployment succeeds with light stemcell 3312.18.

Successful manifest is here.

The only difference between the successful manifest and the failing manifest is the stemcell URL and SHA.

The BOSH Golang CLI is version 2.0.2-a0c78a5-2017-02-16T01:43:54Z.

Incorrect root disk size specified when resurrecting Windows VM

Context: I'm running Concourse with a Windows VM worker. Twice now, the Windows VM has been deleted (for reasons that I don't understand, but that's another story) and the BOSH resurrector has failed to be able to restart the VM.

The CPI command from the bosh logs is:

D, [2017-02-14 02:30:26 #23203] [task:4390] DEBUG -- DirectorJobRunner: External CPI sending request: {"method":"create_vm","arguments":["30ffe17c-0843-44e3-9f2e-c2d118b4f6f0","https://www.googleapis.com/compute/v1/projects/cf-greenhouse-mustang/global/images/packer-1484833635",{"zone":"us-east1-b","machine_type":"n1-standard-1","root_disk_size_gb":10,"root_disk_type":"pd-ssd"},{"private":{"ip":"10.0.16.7","netmask":"255.255.240.0","cloud_properties":{"ephemeral_external_ip":true,"network_name":"bbl-env-manitoba-2017-02-08t00-43z-network","subnetwork_name":"bbl-env-manitoba-2017-02-08t00-43z-subnet","tags":["bbl-env-manitoba-2017-02-08t00-43z-internal"]},"default":["dns","gateway"],"gateway":"10.0.16.1"}},[],{"bosh":{"group":"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows","groups":["bosh-bbl-env-manitoba-2017-02-08t00-43z","concourse","worker-windows","bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse","concourse-worker-windows","bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows"]}}],"context":{"director_uuid":"<redacted>"}} with command: /var/vcap/jobs/google_cpi/bin/cpi

the important part being

"root_disk_size_gb":10

which is not at all correct, and as a result receives this error:

Requested disk size cannot be smaller than the image size (50 GB), invalid

The fuller message is (with newlines unescaped):

D, [2017-02-14 02:30:36 #23203] [task:4390] DEBUG -- DirectorJobRunner: External CPI got response: {"result":null,"error":{"type":"Bosh::Clouds::VMCreationFailed","message":"VM failed to create: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.diskSizeGb': '10'. Requested disk size cannot be smaller than the image size (50 GB), invalid","ok_to_retry":true},"log":"[File System] 2017/02/14 02:30:26 DEBUG - Reading file /var/vcap/jobs/google_cpi/config/cpi.json
[File System] 2017/02/14 02:30:26 DEBUG - Read content
********************
{\"cloud\":{\"plugin\":\"google\",\"properties\":{\"google\":{\"project\":\"flavorjones-oss-concourse\",\"json_key\":\"{\
  \\\"type\\\": \\\"service_account\\\",\
  \\\"project_id\\\": \\\"flavorjones-oss-concourse\\\",\
  \\\"private_key_id\\\": \\\"<redacted>\\\",\
  \\\"private_key\\\": \\\"-----BEGIN PRIVATE KEY-----\\\
<redacted>\\\
-----END PRIVATE KEY-----\\\
\\\",\
  \\\"client_email\\\": \\\"bbl-service-account@flavorjones-oss-concourse.iam.gserviceaccount.com\\\",\
  \\\"client_id\\\": \\\"<redacted>\\\",\
  \\\"auth_uri\\\": \\\"https://accounts.google.com/o/oauth2/auth\\\",\
  \\\"token_uri\\\": \\\"https://accounts.google.com/o/oauth2/token\\\",\
  \\\"auth_provider_x509_cert_url\\\": \\\"https://www.googleapis.com/oauth2/v1/certs\\\",\
  \\\"client_x509_cert_url\\\": \\\"https://www.googleapis.com/robot/v1/metadata/x509/bbl-service-account%40flavorjones-oss-concourse.iam.gserviceaccount.com\\\"\
}\
\",\"default_root_disk_size_gb\":0,\"default_root_disk_type\":\"\"},\"registry\":{\"use_gce_metadata\":true},\"agent\":{\"ntp\":[\"169.254.169.254\"],\"blobstore\":{\"provider\":\"dav\",\"options\":{\"endpoint\":\"http://10.0.0.6:25250\",\"user\":\"<redacted>\",\"password\":\"<redacted>\"}},\"mbus\":\"nats://<redacted>:<redacted>@10.0.0.6:4222\"}}}}

********************
[json] 2017/02/14 02:30:26 DEBUG - Request bytes
********************
{\"method\":\"create_vm\",\"arguments\":[\"30ffe17c-0843-44e3-9f2e-c2d118b4f6f0\",\"https://www.googleapis.com/compute/v1/projects/cf-greenhouse-mustang/global/images/packer-1484833635\",{\"zone\":\"us-east1-b\",\"machine_type\":\"n1-standard-1\",\"root_disk_size_gb\":10,\"root_disk_type\":\"pd-ssd\"},{\"private\":{\"ip\":\"10.0.16.7\",\"netmask\":\"255.255.240.0\",\"cloud_properties\":{\"ephemeral_external_ip\":true,\"network_name\":\"bbl-env-manitoba-2017-02-08t00-43z-network\",\"subnetwork_name\":\"bbl-env-manitoba-2017-02-08t00-43z-subnet\",\"tags\":[\"bbl-env-manitoba-2017-02-08t00-43z-internal\"]},\"default\":[\"dns\",\"gateway\"],\"gateway\":\"10.0.16.1\"}},[],{\"bosh\":{\"group\":\"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows\",\"groups\":[\"bosh-bbl-env-manitoba-2017-02-08t00-43z\",\"concourse\",\"worker-windows\",\"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse\",\"concourse-worker-windows\",\"bosh-bbl-env-manitoba-2017-02-"}, err: , exit_status: pid 23257 exit 0

This is strange because the manifest sets a disk size of 50GB:

- name: worker-windows
  instances: 1
  vm_type: m3.medium
  vm_extensions:
  - 50GB_ephemeral_disk
  stemcell: windows
  azs: [z1]
  networks: [{name: private}]
  jobs:
  - name: concourse_windows
    release: concourse-windows-worker
  properties:
    concourse_windows:
      tsa_host: ci.nokogiri.org
      tsa_public_key: ((tsa-host-public-key))
      tsa_worker_private_key: ((windows-worker-private-key))

When I re-bosh deploy, the VM does get recreated correctly with a 50GB disk.

Any ideas what's going on? What other information can I provide?

IP address contention in CI

Integration and BATS tests run in the same subnet and contend for IP addresses. This can lead to flakes. Each test type should have its own subnet.

VMs do not start google-address-manager

This prevents them from being registered with a load balancer.

root@74c4feff-fbe2-4454-86af-530ba4413127:~# ip route list table local
local 10.1.0.4 dev eth0  proto kernel  scope host  src 10.1.0.4
broadcast 10.1.0.4 dev eth0  proto kernel  scope link  src 10.1.0.4
broadcast 127.0.0.0 dev lo  proto kernel  scope link  src 127.0.0.1
local 127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1
local 127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1
broadcast 127.255.255.255 dev lo  proto kernel  scope link  src 127.0.0.1
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager status
google-address-manager stop/waiting
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager start
google-address-manager start/running, process 7046
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager status
google-address-manager start/running, process 7046
root@74c4feff-fbe2-4454-86af-530ba4413127:~# ip route list table local
local 10.1.0.4 dev eth0  proto kernel  scope host  src 10.1.0.4
broadcast 10.1.0.4 dev eth0  proto kernel  scope link  src 10.1.0.4
broadcast 127.0.0.0 dev lo  proto kernel  scope link  src 127.0.0.1
local 127.0.0.0/8 dev lo  proto kernel  scope host  src 127.0.0.1
local 127.0.0.1 dev lo  proto kernel  scope host  src 127.0.0.1
broadcast 127.255.255.255 dev lo  proto kernel  scope link  src 127.0.0.1
local 130.211.186.50 dev eth0  proto 66  scope host
root@74c4feff-fbe2-4454-86af-530ba4413127:~#

Note the addition of the 130.211.186.50 route after starting it; that's our forwarding rule.

Stemcell version: bosh-google-kvm-ubuntu-trusty-go_agent 3262.15* ubuntu-trusty stemcell-44d856f5-5749-41a3-7e30-252d9668c622

BOSH/CPI versions:

- name: bosh
  url: https://bosh.io/d/github.com/cloudfoundry/bosh?v=257.3
  sha1: e4442afcc64123e11f2b33cc2be799a0b59207d0
- name: bosh-google-cpi
  url: https://bosh.io/d/github.com/cloudfoundry-incubator/bosh-google-cpi-release?v=24.4.0
  sha1: 2c13a452f76e27a101b287b61cc24851541aac18
- name: uaa
  url: https://bosh.io/d/github.com/cloudfoundry/uaa-release?v=13
  sha1: 5229c6e8793c4061950b4e0738fc66612bc016ba

After starting the agent the VM became healthy and traffic started flowing again.

VMs should have friendlier names

VM names should contain identifiable information and not just UUID.

mysterious stream error during create_vm cpi call

Deploying:
  Creating instance 'bosh/0':
    Creating VM:
      Creating vm with stemcell cid 'stemcell-6577b498-26b2-4f43-50a0-372c03558aee':
        CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating vm: Failed to find Google Image 'stemcell-6577b498-26b2-4f43-50a0-372c03558aee': Get https://www.googleapis.com/compute/v1/projects/cf-sandbox-lsantos/global/images/stemcell-6577b498-26b2-4f43-50a0-372c03558aee?alt=json: stream error: stream ID 1; PROTOCOL_ERROR","ok_to_retry":false}

ran into this error twice in last two days. rerunning bosh create-env (bosh-init deploy) makes the error go away but we should probably retry.

Getting error while deploying cloud foundry

Error 100: Unable to render instance groups for deployment. Errors are:

Unable to render jobs for instance group 'nats'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'consul'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'etcd'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'diego-database'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'blobstore'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'router'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'cloud-controller'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'clock-global'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'cloud-controller-worker'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'collector'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'uaa'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'diego-brain'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'diego-cell'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'doppler'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)
Unable to render jobs for instance group 'loggregator-trafficcontroller'. Errors are:
- Unable to render templates for job 'metron_agent'. Errors are:
  - Error filling in template 'syslog_forwarder.conf.erb' (line 44: undefined method `strip' for nil:NilClass)

"Supplied fingerprint does not match current metadata fingerprint" error

We ran into this error in our CI and I'm not sure if it is relevant, but am raising an issue just in case. Please feel free to close this issue if you think this isn't an error worth investigating.

10:37:11 | Creating missing vms: doppler/4455bd1a-38b2-47b0-a008-0282e86960df (1) (00:05:19)
            L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM: Updating instance metadata with SetMetadata call: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
, metadata value: &compute.Metadata{Fingerprint:"-goaeN2SGek=", Items:[]*compute.MetadataItems{(*compute.MetadataItems)(nil), (*compute.MetadataItems)(nil), (*compute.MetadataItems)(0xc420424780), (*compute.MetadataItems)(0xc4204247b0)}, Kind:"", ForceSendFields:[]string(nil)}: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
' in 'create_vm' CPI method

10:37:11 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM: Updating instance metadata with SetMetadata call: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
, metadata value: &compute.Metadata{Fingerprint:"-goaeN2SGek=", Items:[]*compute.MetadataItems{(*compute.MetadataItems)(nil), (*compute.MetadataItems)(nil), (*compute.MetadataItems)(0xc420424780), (*compute.MetadataItems)(0xc4204247b0)}, Kind:"", ForceSendFields:[]string(nil)}: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
' in 'create_vm' CPI method

Started  Wed Mar  1 10:25:38 UTC 2017
Finished Wed Mar  1 10:37:11 UTC 2017
Duration 00:11:33

Task 1104 error

Updating deployment:
  Expected task '1104' to succeed but was state is 'error'

Exit code 1

cc: @dsabeti @anEXPer

Deploying BOSH instructions missing some environment variables

Deploying a BOSH in new google project, following instructions here. Love the documentation, very easy to follow and setup!

We got all the way to deploying bosh, when we failed to bosh-init because the manifest.yml file did not have the network, subnetwork, zone, and ssh_key_path environment variables to interpolate in the erb file.

We ended up manually setting those env vars and getting the manifest in a good state, but I imagine there might be a better, automatic way of finding those values through the gcloud CLI.

Confusing error message if CPI can't find GCP creds

I've seen a couple users (myself included) hit the following error:

creating stemcell (bosh-google-kvm-ubuntu-trusty-go_agent 3312):
  CPI 'create_stemcell' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating stemcell: Creating Google Image from URL: Failed to create Google Image: Post https://www.googleapis.com/compute/v1/projects/cf-bosh-concourse/global/images?alt=json: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: i/o timeout","ok_to_retry":false}

This error means the CPI can't locate your GCP creds. The lookup process for creds is:

Looks for instance_groups.bosh.properties.google.json_key in the Director manifest
Looks on disk for files created as part of a previous gcloud auth login call
Tries to hit the GCP metadata server at 169.254.169.254

So if the Operator has not specified json_key in their manifest, and has not run gcloud auth login at some point, and is running the deployment from outside a GCP instance, then they will get the above error message. It would be nice to display a better error message, but in the meantime maybe search engines will route people here if they hit this error :)

Cannot create vms with persistent disk set to 1000GB

Looks like the payload that the CPI receives back from the gcp api cannot be marshaled properly into an int because the value was marshaled using floating point syntax.

This Gist has the output from the bosh deploy step.

Failed to find missing persistent disk in `bosh cck` and subsequent `bosh deploy`

After deleting a persistent disk from the GCP UI, I was unable to return to an operational state through a bosh cck or bosh deploy.

I believe this is related to a missing has_disk here: https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/tree/master/src/bosh-google-cpi/action.

IAM command does not accept multiple roles

This snippet in the Cloud Foundry doc attempts to add multiple roles in a single command. This does not work and ends up only adding the last role.

gcloud projects add-iam-policy-binding ${project_id} \
    --member serviceAccount:cf-component@${project_id}.iam.gserviceaccount.com \
    --role "roles/editor" \
    --role "roles/logging.logWriter" \
    --role "roles/logging.configWriter"

This prevents the google-fluentd agent from writing logs out of the box.

CPI should support passing a public SSH key when creating a VM

As part of a VM cloud properties, it would be nice if I could do 2 things:

Pass in an optional list of SSH public keys (possibly with usernames inside of them, or as a separate key) that will be attached to that VM
Pass in an optional "block project keys" boolean (defaults to false)

What do y'all think?

googleapi: Error 403: Required 'compute.firewalls.create'

terraform requires permissions above & beyond the ones listed here in the "Grant the new service account editor access to your project" section when creating a firewall rule:

googleapi: Error 403: Required 'compute.firewalls.create'

PR forthcoming...

Spinning up PCF with latest google-cpi throws Error 400: Invalid value for field 'labels.job'

@evandbrown

Using 25.2.0 to attempt to launch PCF ERT 1.8 throws the following set of errors:

Failed creating missing vms > nfs_server/0 (dae2dea2-4ed9-4394-b999-54300e493b78): Setting metadata for vm 'vm-3f75ebd3-2588-4028-7ead-36aedd7298ed': Failed to set labels for Google Instance 'vm-3f75ebd3-2588-4028-7ead-36aedd7298ed': googleapi: Error 400: Invalid value for field 'labels.job': 'nfs_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
   Failed creating missing vms > diego_brain/0 (759e6685-73ce-46f5-93ff-b7356da3e711): Setting metadata for vm 'vm-c15c442d-1190-4ab7-43e9-fd3c3909fdb3': Failed to set labels for Google Instance 'vm-c15c442d-1190-4ab7-43e9-fd3c3909fdb3': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_brain'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
   Failed creating missing vms > mysql_proxy/0 (e5ed277a-0757-4486-a150-a5ffddcd0597): Setting metadata for vm 'vm-c7f4a1c9-ea78-4dc2-7056-2a28748876b8': Failed to set labels for Google Instance 'vm-c7f4a1c9-ea78-4dc2-7056-2a28748876b8': googleapi: Error 400: Invalid value for field 'labels.job': 'mysql_proxy'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
   Failed creating missing vms > cloud_controller_worker/0 (1d8865ec-bf9e-4a52-a8dc-de8533f53ce9): Setting metadata for vm 'vm-7a669772-2081-4f10-6959-5df06091a3c5': Failed to set labels for Google Instance 'vm-7a669772-2081-4f10-6959-5df06091a3c5': googleapi: Error 400: Invalid value for field 'labels.job': 'cloud_controller_worker'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:08)
   Failed creating missing vms > diego_cell/0 (d420a7ef-37f3-4b21-80bc-ddcde0800365): Setting metadata for vm 'vm-748bd91b-4483-43a8-5345-f6adff8486c7': Failed to set labels for Google Instance 'vm-748bd91b-4483-43a8-5345-f6adff8486c7': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_cell'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
   Failed creating missing vms > consul_server/0 (a47500f9-8538-495c-b0e3-7039856c2f53): Setting metadata for vm 'vm-f2aa1e83-723a-4966-5fdd-6b2ccbe5f5b8': Failed to set labels for Google Instance 'vm-f2aa1e83-723a-4966-5fdd-6b2ccbe5f5b8': googleapi: Error 400: Invalid value for field 'labels.job': 'consul_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
   Failed creating missing vms > clock_global/0 (0ccf6c29-1bc0-493a-8025-add93ef2d443): Setting metadata for vm 'vm-e1b9a38f-b692-43c7-7447-8e0b46eb8ba5': Failed to set labels for Google Instance 'vm-e1b9a38f-b692-43c7-7447-8e0b46eb8ba5': googleapi: Error 400: Invalid value for field 'labels.job': 'clock_global'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
   Failed creating missing vms > diego_database/0 (e2d9afa0-50b3-4096-a4b8-25332bc1287e): Setting metadata for vm 'vm-66cca01f-bcce-4507-5eb8-df05cd804808': Failed to set labels for Google Instance 'vm-66cca01f-bcce-4507-5eb8-df05cd804808': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_database'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
   Failed creating missing vms > etcd_server/0 (03417707-57bc-4b80-954c-6dc9dc7d6691): Setting metadata for vm 'vm-98936bba-c4c6-4f43-5648-01600440b112': Failed to set labels for Google Instance 'vm-98936bba-c4c6-4f43-5648-01600440b112': googleapi: Error 400: Invalid value for field 'labels.job': 'etcd_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
   Failed creating missing vms > cloud_controller/0 (998c903f-8a86-4f7c-a7b4-755d049b3d6f): Setting metadata for vm 'vm-27e82b3d-5c7c-4aec-470d-ff6e07cc8f36': Failed to set labels for Google Instance 'vm-27e82b3d-5c7c-4aec-470d-ff6e07cc8f36': googleapi: Error 400: Invalid value for field 'labels.job': 'cloud_controller'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:16)

Looks like the Console will allow labels that contain a -; when applying the label perhaps the cpi could sanitise invalid chars line _ and replace them with - ?

Bosh init should handle it better when the default account is missing

If the default project account has been deleted, using service_scopes in your bosh director manifest results in an obscure error:

CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::VMCreationFailed","message":"VM failed to create: Google Operation 'operation-1484172723550-545d8e3f8f730-654d03c4-17847194' finished with an error: The resource '[email protected]' of type 'serviceAccount' was not found.\n","ok_to_retry":true}

This error message is unclear, particularly because the credentials which also need to be specified in the manifest may be associated with another account altogether.

The default service_account for the bosh-google-cpi-release is set to "default" if it is not proactively set by the bosh manifest, so this will happen anytime you use service_scopes instead of a service_account, which is exactly what the bosh-bootloader does in its bosh-init manifest today.

cc: @ljfranklin @JesseTAlford @evanfarrar

cloudfoundry / bosh-google-cpi-release Goto Github PK

bosh-google-cpi-release's Issues

Footnotes

Recommend Projects

Recommend Topics

Recommend Org