cloudfoundry / bosh-google-cpi-release Goto Github PK
View Code? Open in Web Editor NEWBOSH Google CPI
License: Apache License 2.0
BOSH Google CPI
License: Apache License 2.0
Just following the docs, and this response is concerning. Should I be concerned?
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ bosh target 10.0.0.6
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Target set to 'micro-google'
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export vip=$(terraform output ip)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export tcp_vip=$(terraform output tcp_ip)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export zone=$(terraform output zone)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export zone_compilation=$(terraform output zone_compilation)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export region=$(terraform output region)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export region_compilation=$(terraform output region_compilation)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export private_subnet=$(terraform output private_subnet)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export compilation_subnet=$(terraform output compilation_subnet)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export network=$(terraform output network)
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$
dashaun_carter@bosh-bastion:/share/docs/cloudfoundry$ export director=$(bosh status --uuid)
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
We have recently been moving some of our development infrastructure over to GCP and were following your guide to deploy bosh and concourse.
Using the terraform script for concourse, we provision a GCP HTTP proxy + global forwarding rule for ingress and load balancing.
We have 2 problems with this setup: It does not support HTTPS, and fly hijack
does not work as tcp:2222 is disallowed by the firewall rules.
We got HTTPS and hijack
working using the following setup:
CAP_NET_BIND_SERVICE
, so it can do this when run as any user).We initially tried to get this setup working by "chaining" GCP-managed load balancing services (trying to mix TCP and SSL-terminating-HTTPS load balancing) but this didn't look possible.
What do you think?
Trying to ssh into vms using bosh ssh
results in an error:
root@bosh-bastion:~# bosh ssh
RSA 1024 bit CA certificates are loaded due to old openssl compatibility
Acting as user 'admin' on deployment 'cf' on 'micro-google'
Target deployment is 'cf'
Setting up ssh artifacts
Director task 36
Error 450001: Action Failed ssh: Getting host public key: Unable to read host public key file: /etc/ssh/ssh_host_rsa_key.pub: Opening file /etc/ssh/ssh_host_rsa_key.pub: open /etc/ssh/ssh_host_rsa_key.pub: no such file or directory
Task 36 error
Failed to set up SSH: see task 36 log for details
I'm using:
After un-mounting the vm's root disk and mounting it on my bastion vm I was able to check the vm logs:
Sep 9 07:51:39 localhost sshd[4313]: fatal: No supported key exchange algorithms [preauth]
Sep 9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_rsa_key
Sep 9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Sep 9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_ecdsa_key
Sep 9 07:52:04 localhost sshd[4320]: error: Could not load host key: /etc/ssh/ssh_host_ed25519_key
And indeed, ssh host keys are not there:
root@bosh-bastion:~/test/etc/ssh# ls -la
total 256
drwxr-xr-x 2 root root 4096 Aug 31 18:13 .
drwxr-xr-x 85 root root 4096 Sep 9 07:52 ..
-rw-r--r-- 1 root root 242091 May 5 14:16 moduli
-rw-r--r-- 1 root root 1690 May 5 14:16 ssh_config
-rw------- 1 root root 2912 Jun 23 21:29 sshd_config
This link does not take me to an API, and there is no such API called 'Project API' that I can search for.
As part of a VM cloud properties, it would be nice if I could do 2 things:
What do y'all think?
On March 7, 2017, VM tags will no longer propagate to VM labels with empty values. This may affect how the CPI labels resources.
To support cross-regional deployments, it would be nice if I could pass a list of target_pools
or backend_services
and the CPI would attach any of them that makes sense based on the region and/or AZ of each VM.
For example:
instance_groups:
- name: cloud_controller
azs: ["west1a", "west1b", "central1a"]
instances: 3
cloud_properties: # via vm_extensions
target_pools: ["west", "central"] # with central likely being a backup pool in the load balancer
- name: router
azs: ["west1a", "west1b", "central1a", "central1b"]
instances: 4
cloud_properties: # via vm_extensions
backend_services: ["westsvc", "centralsvc"] # this time we'd probably go global distribution between svcs
When the CPI booted the 3 cloud controllers, it should obviously only attach the pool from the "west" region to the VMs actually in the west region and same with central. Likewise with the backend services.
We ran into this error in our CI and I'm not sure if it is relevant, but am raising an issue just in case. Please feel free to close this issue if you think this isn't an error worth investigating.
10:37:11 | Creating missing vms: doppler/4455bd1a-38b2-47b0-a008-0282e86960df (1) (00:05:19)
L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM: Updating instance metadata with SetMetadata call: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
, metadata value: &compute.Metadata{Fingerprint:"-goaeN2SGek=", Items:[]*compute.MetadataItems{(*compute.MetadataItems)(nil), (*compute.MetadataItems)(nil), (*compute.MetadataItems)(0xc420424780), (*compute.MetadataItems)(0xc4204247b0)}, Kind:"", ForceSendFields:[]string(nil)}: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
' in 'create_vm' CPI method
10:37:11 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating VM: Updating instance metadata with SetMetadata call: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
, metadata value: &compute.Metadata{Fingerprint:"-goaeN2SGek=", Items:[]*compute.MetadataItems{(*compute.MetadataItems)(nil), (*compute.MetadataItems)(nil), (*compute.MetadataItems)(0xc420424780), (*compute.MetadataItems)(0xc4204247b0)}, Kind:"", ForceSendFields:[]string(nil)}: Google Operation 'operation-1488364472245-549a8dbaa1b09-e8fb3190-02744b2a' finished with an error: Supplied fingerprint does not match current metadata fingerprint.
' in 'create_vm' CPI method
Started Wed Mar 1 10:25:38 UTC 2017
Finished Wed Mar 1 10:37:11 UTC 2017
Duration 00:11:33
Task 1104 error
Updating deployment:
Expected task '1104' to succeed but was state is 'error'
Exit code 1
Looks like the payload that the CPI receives back from the gcp api cannot be marshaled properly into an int because the value was marshaled using floating point syntax.
This Gist has the output from the bosh deploy
step.
I'm running the terraform apply step and I'm running into
google_compute_address.cf-tcp: Error creating address: googleapi: Error 403: Quota 'STATIC_ADDRESSES' exceeded. Limit: 1.0, quotaExceeded
I was able to do this without any problems last week but needed to destroy everything and set it up inside a new project. So I'm not sure if the quota decreased for some reason in the past 4 days or if something changed because we're on the free trial. Any ideas?
I created a manifest using instance_group: nginx
in the subnet cloud_properties
section of my cloud config, but I don't see new instances being attached to that group. Are you sure that's wired up correctly internally?
Additionally - it seems weird that ELBs can be attached using vm_types (or vm_extensions, which is even better), but on GCP I have to attach a load balancer to every VM in the network. It's very common that I'd declare only 1 or 2 large networks, and I don't want all those VMs to have load balancers. Could we make google's instance_group
a cloud_property
to a BOSH job (now confusingly called BOSH instance groups)?
Here's what BOSH printed out when I did the deploy, so it clearly understood the change:
> bosh deploy --recreate
Detecting deployment changes
----------------------------
networks:
- name: private
subnets:
- range: 10.128.0.0/20
cloud_properties:
instance_group: nginx
Then it recreated the VM:
Started updating job web_server > web_server/0 (18857c48-665f-4600-bd00-c9e6909d6c87) (canary). Done (00:04:51)
But that new web server isn't part of an instance group. I used CPI version 21, director 257.3, and stemcell 3262.2.
Here's the full network section of my cloud_config:
networks:
- name: private
type: manual
subnets:
- dns:
- 169.254.169.254
range: 10.128.0.0/20
gateway: 10.128.0.1
azs:
- us-central1-a
cloud_properties:
network_name: cf
instance_group: nginx
tags:
- cf-bosh
reserved:
- 10.128.0.2
- 10.128.0.10-10.128.15.254
static:
- 10.128.0.8
Let me know if I just did something wrong. Thanks!
Error 100: Unable to render instance groups for deployment. Errors are:
The docs in this repo for deploying Cloud Foundry to GCP are outdated. They may still work, but the document in cf-deployment describes an easier way to deploy to GCP, and it's supported by the Infrastructure and RelInt teams. Would it make sense to replace the doc in this repo with a link to the one from cf-deployment?
definitely can retry dial failures. probably should retry a lot more.
+++
CPI error 'Bosh::Clouds::CloudError' with message 'Creating disk: Failed to find Google Instance 'vm-xxx': Get https://www.googleapis.com/compute/v1/projects/xxx/aggregated/instances?alt=json&filter=name+eq+.%2Avm-xxx: oauth2: cannot fetch token: Post https://accounts.google.com/o/oauth2/token: dial tcp 173.194.74.84:443: i/o timeout' in 'create_disk' CPI method
+++
For dynamic network configuration using bosh-init (or bosh create-env
), we require cloud_properties
and dns
at the root, e.g.
networks:
- name: default
type: dynamic
cloud_properties:
network_name: ((network))
subnetwork_name: ((subnetwork))
dns: 8.8.8.8
However, if we forget these fields (for example if they are incorrectly nested under subnets
which is ignored by bosh-init/bosh create-env
) the CPI will not fail, but will instead create the VM somewhere else - some default network?
If the default project account has been deleted, using service_scopes in your bosh director manifest results in an obscure error:
CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::VMCreationFailed","message":"VM failed to create: Google Operation 'operation-1484172723550-545d8e3f8f730-654d03c4-17847194' finished with an error: The resource '[email protected]' of type 'serviceAccount' was not found.\n","ok_to_retry":true}
This error message is unclear, particularly because the credentials which also need to be specified in the manifest may be associated with another account altogether.
The default service_account for the bosh-google-cpi-release is set to "default" if it is not proactively set by the bosh manifest, so this will happen anytime you use service_scopes
instead of a service_account
, which is exactly what the bosh-bootloader
does in its bosh-init manifest today.
cc: @ljfranklin @JesseTAlford @evanfarrar
terraform requires permissions above & beyond the ones listed here in the "Grant the new service account editor access to your project" section when creating a firewall rule:
googleapi: Error 403: Required 'compute.firewalls.create'
PR forthcoming...
$ bosh restart db/0 --force
Restart db/*? (type 'yes' to continue): yes
Performing 'restart db/*'...
Director task 256
Started preparing deployment > Preparing deployment. Done (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started updating instance db > db/<instance_guid> (0) (canary). Failed: CPI error 'Bosh::Clouds::CloudError' with message 'Extracting method arguments from payload: Unmarshalling action argument: json: cannot unmarshal number into Go value of type string' in 'snapshot_disk' CPI method (00:00:02)
Error 100: CPI error 'Bosh::Clouds::CloudError' with message 'Extracting method arguments from payload: Unmarshalling action argument: json: cannot unmarshal number into Go value of type string' in 'snapshot_disk' CPI method
Task 256 error
BOSH manifest for Concourse:
---
name: concourse
director_uuid: <director uuid>
releases:
- name: concourse
version: 2.5.0
url: https://bosh.io/d/github.com/concourse/concourse?v=2.5.0
sha1: 0d1f436aad50bb09ac2c809cd6cb6df3e38a7767
- name: garden-runc
version: 1.0.3
url: https://bosh.io/d/github.com/cloudfoundry/garden-runc-release?v=1.0.3
sha1: 0c04b944d50ec778f5b34304fd4bc8fc0ed83b2b
tls_key: &tls_key |
<tls key>
tls_cert: &tls_cert |
<tls cert>
instance_groups:
- name: web
instances: 1
vm_type: web
azs:
- z1
- z2
stemcell: trusty
networks:
- name: public
default:
- dns
- gateway
- name: vip
static_ips:
- <static ip>
jobs:
- name: atc
release: concourse
properties:
external_url: <external url>
publicly_viewable: true
basic_auth_username: buildpacks
basic_auth_password: "<auth password>"
github_auth:
client_id: <client id>
client_secret: <client secret>
postgresql_database: atc
tls_cert: *tls_cert
tls_key: *tls_key
tls_bind_port: 443
- name: tsa
release: concourse
properties: {}
- name: db
instances: 1
vm_type: database
azs:
- z1
stemcell: trusty
persistent_disk_type: database
networks:
- name: public
jobs:
- name: postgresql
release: concourse
properties:
databases:
- name: atc
role: <role>
password: <password>
- name: worker
instances: 6
vm_type: worker
azs:
- z1
stemcell: trusty
networks:
- name: public
jobs:
- name: groundcrew
release: concourse
properties: {}
- name: baggageclaim
release: concourse
properties: {}
- name: garden
release: garden-runc
properties:
garden:
listen_network: tcp
listen_address: 0.0.0.0:7777
network_mtu: 1432
update:
canaries: 1
max_in_flight: 3
serial: false
canary_watch_time: 1000-120000
update_watch_time: 1000-120000
stemcells:
- alias: trusty
os: ubuntu-trusty
version: latest
BOSH cloud config:
azs:
- name: z1
cloud_properties:
zone: us-east1-c
- name: z2
cloud_properties:
zone: us-east1-d
vm_types:
- name: web
cloud_properties:
machine_type: n1-standard-2
root_disk_size_gb: 20
root_disk_type: pd-ssd
- name: database
cloud_properties:
machine_type: n1-standard-4
root_disk_size_gb: 100
root_disk_type: pd-ssd
- name: worker
cloud_properties:
machine_type: n1-standard-4
root_disk_size_gb: 300
root_disk_type: pd-ssd
- name: bosh-lite-worker
cloud_properties:
machine_type: n1-standard-16
root_disk_size_gb: 300
root_disk_type: pd-ssd
tags: [bosh-lite-public]
compilation:
workers: 3
network: public
reuse_compilation_vms: true
az: z1
cloud_properties:
machine_type: n1-standard-4
root_disk_size_gb: 100
root_disk_type: pd-ssd
preemptible: true
networks:
- name: public
type: manual
subnets:
- az: z1
range: <ip range>
gateway: <gateway>
cloud_properties:
network_name: concourse
subnetwork_name: <subnetwork name>
ephemeral_external_ip: true
tags:
- concourse-public
- concourse-internal
- az: z2
range: <ip range>
gateway: <gateway>
cloud_properties:
network_name: concourse
subnetwork_name: <subnetwork name>
ephemeral_external_ip: true
tags:
- concourse-public
- concourse-internal
- name: vip
type: vip
disk_types:
- name: database
disk_size: 100000 #mb
This is with BOSH Google CPI version 25.6.1
. We were able to deploy the db VM as part of the Concourse deployment using this CPI version. However, during a re-deploy of the Concourse deployment, this error showed up and the DB VM did not recreate and ended up stopped. Any attempts to bosh recreate
, bosh start
, bosh restart
the DB VM result in this error message.
I'm trying to use the S3 blobstore with the google CPI, just like I can with other CPIs, and I think the blobstore options aren't being passed through. I'm attempting to pass blobstore options that look like this:
properties:
blobstore:
provider: s3
host: storage.googleapis.com
s3_port: 443
use_ssl: true
bucket_name: dstevenson-bosh
access_key_id: GOOGB4BAQE4743O4YSAS
secret_access_key: not-telling-you
s3_force_path_style: true
s3_multipart_threshold: 1099511627776
s3_signature_version: '2'
But I get the following error, probably from the code inside of cpi.json.erb
in the CPI:
Running command: 'ruby /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-render.rb /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-context.json /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/bosh-init-release576250374/extracted_jobs/google_cpi/templates/config/cpi.json.erb /home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/rendered-jobs472416459/config/cpi.json', stdout: '', stderr: '/home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/erb-renderer206693177/erb-render.rb:189:in `rescue in render': Error filling in template '/home/tempest-web/.bosh_init/installations/52ec4231-444e-4e0e-5383-e34254e7bfd4/tmp/bosh-init-release576250374/extracted_jobs/google_cpi/templates/config/cpi.json.erb' for google_cpi/0 (line 52: #<TemplateEvaluationContext::UnknownProperty: Can't find property 'agent.blobstore.address', or 'blobstore.address'>) (RuntimeError)
Should we just be passing blobstore options directly through to the agent? Supporting google cloud storage for BOSH would be very useful for us.
I want to write firewalls that make sense, like diego_brain
needs port 2222 allowed to it. While I could do this with a lot of VM extensions, I might end up with a VM extension that I have to attach to every job that give it a tag equal to its job name. So... could the CPI automatically tag VMs with their job name?
@cppforlife says that BOSH is about to start passing bosh.env.group_name
to the CPI, which will equal the instance_group
name (like diego_cell). We could use that! Ideally, we'd also tag a VM with its deployment name as well, and maybe also a combination of the two together.
This snippet in the Cloud Foundry doc attempts to add multiple roles in a single command. This does not work and ends up only adding the last role.
gcloud projects add-iam-policy-binding ${project_id} \
--member serviceAccount:cf-component@${project_id}.iam.gserviceaccount.com \
--role "roles/editor" \
--role "roles/logging.logWriter" \
--role "roles/logging.configWriter"
This prevents the google-fluentd agent from writing logs out of the box.
Compilation VMs would be a good example. Document that unless disks are marked for deletion, they will remain present and will be billed for.
Using 25.2.0 to attempt to launch PCF ERT 1.8 throws the following set of errors:
Failed creating missing vms > nfs_server/0 (dae2dea2-4ed9-4394-b999-54300e493b78): Setting metadata for vm 'vm-3f75ebd3-2588-4028-7ead-36aedd7298ed': Failed to set labels for Google Instance 'vm-3f75ebd3-2588-4028-7ead-36aedd7298ed': googleapi: Error 400: Invalid value for field 'labels.job': 'nfs_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
Failed creating missing vms > diego_brain/0 (759e6685-73ce-46f5-93ff-b7356da3e711): Setting metadata for vm 'vm-c15c442d-1190-4ab7-43e9-fd3c3909fdb3': Failed to set labels for Google Instance 'vm-c15c442d-1190-4ab7-43e9-fd3c3909fdb3': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_brain'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
Failed creating missing vms > mysql_proxy/0 (e5ed277a-0757-4486-a150-a5ffddcd0597): Setting metadata for vm 'vm-c7f4a1c9-ea78-4dc2-7056-2a28748876b8': Failed to set labels for Google Instance 'vm-c7f4a1c9-ea78-4dc2-7056-2a28748876b8': googleapi: Error 400: Invalid value for field 'labels.job': 'mysql_proxy'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:07)
Failed creating missing vms > cloud_controller_worker/0 (1d8865ec-bf9e-4a52-a8dc-de8533f53ce9): Setting metadata for vm 'vm-7a669772-2081-4f10-6959-5df06091a3c5': Failed to set labels for Google Instance 'vm-7a669772-2081-4f10-6959-5df06091a3c5': googleapi: Error 400: Invalid value for field 'labels.job': 'cloud_controller_worker'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:08)
Failed creating missing vms > diego_cell/0 (d420a7ef-37f3-4b21-80bc-ddcde0800365): Setting metadata for vm 'vm-748bd91b-4483-43a8-5345-f6adff8486c7': Failed to set labels for Google Instance 'vm-748bd91b-4483-43a8-5345-f6adff8486c7': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_cell'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
Failed creating missing vms > consul_server/0 (a47500f9-8538-495c-b0e3-7039856c2f53): Setting metadata for vm 'vm-f2aa1e83-723a-4966-5fdd-6b2ccbe5f5b8': Failed to set labels for Google Instance 'vm-f2aa1e83-723a-4966-5fdd-6b2ccbe5f5b8': googleapi: Error 400: Invalid value for field 'labels.job': 'consul_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
Failed creating missing vms > clock_global/0 (0ccf6c29-1bc0-493a-8025-add93ef2d443): Setting metadata for vm 'vm-e1b9a38f-b692-43c7-7447-8e0b46eb8ba5': Failed to set labels for Google Instance 'vm-e1b9a38f-b692-43c7-7447-8e0b46eb8ba5': googleapi: Error 400: Invalid value for field 'labels.job': 'clock_global'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
Failed creating missing vms > diego_database/0 (e2d9afa0-50b3-4096-a4b8-25332bc1287e): Setting metadata for vm 'vm-66cca01f-bcce-4507-5eb8-df05cd804808': Failed to set labels for Google Instance 'vm-66cca01f-bcce-4507-5eb8-df05cd804808': googleapi: Error 400: Invalid value for field 'labels.job': 'diego_database'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
Failed creating missing vms > etcd_server/0 (03417707-57bc-4b80-954c-6dc9dc7d6691): Setting metadata for vm 'vm-98936bba-c4c6-4f43-5648-01600440b112': Failed to set labels for Google Instance 'vm-98936bba-c4c6-4f43-5648-01600440b112': googleapi: Error 400: Invalid value for field 'labels.job': 'etcd_server'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:15)
Failed creating missing vms > cloud_controller/0 (998c903f-8a86-4f7c-a7b4-755d049b3d6f): Setting metadata for vm 'vm-27e82b3d-5c7c-4aec-470d-ff6e07cc8f36': Failed to set labels for Google Instance 'vm-27e82b3d-5c7c-4aec-470d-ff6e07cc8f36': googleapi: Error 400: Invalid value for field 'labels.job': 'cloud_controller'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)?', invalid (00:03:16)
Looks like the Console will allow labels that contain a -
; when applying the label perhaps the cpi could sanitise invalid chars line _
and replace them with -
?
VM names should contain identifiable information and not just UUID.
CPI seems to set bosh_settings
custom metadata, I dont think we need user_data one.
Integration and BATS tests run in the same subnet and contend for IP addresses. This can lead to flakes. Each test type should have its own subnet.
When deleting a stemcell (e.g. during bosh delete-env
) if the image has already been removed then the teardown fails with the following error:
Deleting deployment:
Deleting stemcell from cloud:
CPI 'delete_stemcell' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Deleting stemcell 'stemcell-ab5f5573-62d5-4f72-6620-3ab8f1aed3cf': Google Image 'stemcell-ab5f5573-62d5-4f72-6620-3ab8f1aed3cf' does not exists: \u003cnil cause\u003e","ok_to_retry":false}
Exit code 1
I would expect this to succeed if the image is gone, because that's exactly the operation we're trying to achieve.
For context, this is a problem for us as we accumulate lots of images because we spin up lots of bosh environments in our account, and each deployment results in a new image (stemcell-<guid>
). To avoid hitting the limit we asynchronously delete the images, but then our tear down of the bosh director fails with the above error.
cc @cppforlife
Deploying:
Creating instance 'bosh/0':
Creating VM:
Creating vm with stemcell cid 'stemcell-6577b498-26b2-4f43-50a0-372c03558aee':
CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating vm: Failed to find Google Image 'stemcell-6577b498-26b2-4f43-50a0-372c03558aee': Get https://www.googleapis.com/compute/v1/projects/cf-sandbox-lsantos/global/images/stemcell-6577b498-26b2-4f43-50a0-372c03558aee?alt=json: stream error: stream ID 1; PROTOCOL_ERROR","ok_to_retry":false}
ran into this error twice in last two days. rerunning bosh create-env (bosh-init deploy) makes the error go away but we should probably retry.
When trying to use a global external IP I received the following error:
Error 100: VM failed to create: googleapi: Error 400: Invalid value for field 'resource.networkInterfaces[0].accessConfigs[0]': ''. Specified external IP address not found., invalid
There are two problems here:
After deleting a persistent disk from the GCP UI, I was unable to return to an operational state through a bosh cck
or bosh deploy
.
I believe this is related to a missing has_disk
here: https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/tree/master/src/bosh-google-cpi/action.
When we try to deploy with bosh-init
while in the GCP jumpbox that is created via the Concourse terraform templates, we get the following error at the end:
Command 'deploy' failed:
Deploying:
Creating instance 'bosh/0':
Waiting until instance is ready:
Starting SSH tunnel:
Failed to connect to remote server:
ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
We are using stemcell bosh-google-kvm-ubuntu-trusty-go_agent
version 3262.14 and google CPI version 25.2.1
. We have verified that the private key is in the jumpbox's ~/.ssh
directory and the public key is in SSH keys section in metadata for the project.
Another user seemed to report the exact same issue with using stemcell 3262.14 on the Cloud Foundry Slack: https://cloudfoundry.slack.com/archives/bosh-gce-cpi/p1474256273000136
I followed the steps here and used Terraform to deploy the infrastructure, but the whole startup script failed. The following was in /var/log/startupscript.log:
Running startup script /var/run/google.startup.script
/usr/share/google/run-scripts: /tmp/tmp.SA6GU6Llif: /bin/bash^M: bad interpreter: No such file or directory
Finished running startup script /var/run/google.startup.script
We're coming up on a situation where we want a few VMs in a network to have public IPs, but not the rest of them. We may also need to tag them differently, depending on our desired firewall rules.
Could we make the following CPI network params also overridable on a VM type? Presumably the VM type would take precedence over the network settings.
Thank you!
to allow tagging of all vms with specific set of tags to enable setting security groups. global tags would be added on top of tags specified via env
arg in create_vm
. wdyt about google.default_tags: [tag1, tag2]
?
This prevents them from being registered with a load balancer.
root@74c4feff-fbe2-4454-86af-530ba4413127:~# ip route list table local
local 10.1.0.4 dev eth0 proto kernel scope host src 10.1.0.4
broadcast 10.1.0.4 dev eth0 proto kernel scope link src 10.1.0.4
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager status
google-address-manager stop/waiting
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager start
google-address-manager start/running, process 7046
root@74c4feff-fbe2-4454-86af-530ba4413127:~# sudo service google-address-manager status
google-address-manager start/running, process 7046
root@74c4feff-fbe2-4454-86af-530ba4413127:~# ip route list table local
local 10.1.0.4 dev eth0 proto kernel scope host src 10.1.0.4
broadcast 10.1.0.4 dev eth0 proto kernel scope link src 10.1.0.4
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
local 130.211.186.50 dev eth0 proto 66 scope host
root@74c4feff-fbe2-4454-86af-530ba4413127:~#
Note the addition of the 130.211.186.50
route after starting it; that's our forwarding rule.
Stemcell version: bosh-google-kvm-ubuntu-trusty-go_agent 3262.15* ubuntu-trusty stemcell-44d856f5-5749-41a3-7e30-252d9668c622
BOSH/CPI versions:
- name: bosh
url: https://bosh.io/d/github.com/cloudfoundry/bosh?v=257.3
sha1: e4442afcc64123e11f2b33cc2be799a0b59207d0
- name: bosh-google-cpi
url: https://bosh.io/d/github.com/cloudfoundry-incubator/bosh-google-cpi-release?v=24.4.0
sha1: 2c13a452f76e27a101b287b61cc24851541aac18
- name: uaa
url: https://bosh.io/d/github.com/cloudfoundry/uaa-release?v=13
sha1: 5229c6e8793c4061950b4e0738fc66612bc016ba
After starting the agent the VM became healthy and traffic started flowing again.
Context: I'm running Concourse with a Windows VM worker. Twice now, the Windows VM has been deleted (for reasons that I don't understand, but that's another story) and the BOSH resurrector has failed to be able to restart the VM.
The CPI command from the bosh logs is:
D, [2017-02-14 02:30:26 #23203] [task:4390] DEBUG -- DirectorJobRunner: External CPI sending request: {"method":"create_vm","arguments":["30ffe17c-0843-44e3-9f2e-c2d118b4f6f0","https://www.googleapis.com/compute/v1/projects/cf-greenhouse-mustang/global/images/packer-1484833635",{"zone":"us-east1-b","machine_type":"n1-standard-1","root_disk_size_gb":10,"root_disk_type":"pd-ssd"},{"private":{"ip":"10.0.16.7","netmask":"255.255.240.0","cloud_properties":{"ephemeral_external_ip":true,"network_name":"bbl-env-manitoba-2017-02-08t00-43z-network","subnetwork_name":"bbl-env-manitoba-2017-02-08t00-43z-subnet","tags":["bbl-env-manitoba-2017-02-08t00-43z-internal"]},"default":["dns","gateway"],"gateway":"10.0.16.1"}},[],{"bosh":{"group":"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows","groups":["bosh-bbl-env-manitoba-2017-02-08t00-43z","concourse","worker-windows","bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse","concourse-worker-windows","bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows"]}}],"context":{"director_uuid":"<redacted>"}} with command: /var/vcap/jobs/google_cpi/bin/cpi
the important part being
"root_disk_size_gb":10
which is not at all correct, and as a result receives this error:
Requested disk size cannot be smaller than the image size (50 GB), invalid
The fuller message is (with newlines unescaped):
D, [2017-02-14 02:30:36 #23203] [task:4390] DEBUG -- DirectorJobRunner: External CPI got response: {"result":null,"error":{"type":"Bosh::Clouds::VMCreationFailed","message":"VM failed to create: googleapi: Error 400: Invalid value for field 'resource.disks[0].initializeParams.diskSizeGb': '10'. Requested disk size cannot be smaller than the image size (50 GB), invalid","ok_to_retry":true},"log":"[File System] 2017/02/14 02:30:26 DEBUG - Reading file /var/vcap/jobs/google_cpi/config/cpi.json
[File System] 2017/02/14 02:30:26 DEBUG - Read content
********************
{\"cloud\":{\"plugin\":\"google\",\"properties\":{\"google\":{\"project\":\"flavorjones-oss-concourse\",\"json_key\":\"{\
\\\"type\\\": \\\"service_account\\\",\
\\\"project_id\\\": \\\"flavorjones-oss-concourse\\\",\
\\\"private_key_id\\\": \\\"<redacted>\\\",\
\\\"private_key\\\": \\\"-----BEGIN PRIVATE KEY-----\\\
<redacted>\\\
-----END PRIVATE KEY-----\\\
\\\",\
\\\"client_email\\\": \\\"bbl-service-account@flavorjones-oss-concourse.iam.gserviceaccount.com\\\",\
\\\"client_id\\\": \\\"<redacted>\\\",\
\\\"auth_uri\\\": \\\"https://accounts.google.com/o/oauth2/auth\\\",\
\\\"token_uri\\\": \\\"https://accounts.google.com/o/oauth2/token\\\",\
\\\"auth_provider_x509_cert_url\\\": \\\"https://www.googleapis.com/oauth2/v1/certs\\\",\
\\\"client_x509_cert_url\\\": \\\"https://www.googleapis.com/robot/v1/metadata/x509/bbl-service-account%40flavorjones-oss-concourse.iam.gserviceaccount.com\\\"\
}\
\",\"default_root_disk_size_gb\":0,\"default_root_disk_type\":\"\"},\"registry\":{\"use_gce_metadata\":true},\"agent\":{\"ntp\":[\"169.254.169.254\"],\"blobstore\":{\"provider\":\"dav\",\"options\":{\"endpoint\":\"http://10.0.0.6:25250\",\"user\":\"<redacted>\",\"password\":\"<redacted>\"}},\"mbus\":\"nats://<redacted>:<redacted>@10.0.0.6:4222\"}}}}
********************
[json] 2017/02/14 02:30:26 DEBUG - Request bytes
********************
{\"method\":\"create_vm\",\"arguments\":[\"30ffe17c-0843-44e3-9f2e-c2d118b4f6f0\",\"https://www.googleapis.com/compute/v1/projects/cf-greenhouse-mustang/global/images/packer-1484833635\",{\"zone\":\"us-east1-b\",\"machine_type\":\"n1-standard-1\",\"root_disk_size_gb\":10,\"root_disk_type\":\"pd-ssd\"},{\"private\":{\"ip\":\"10.0.16.7\",\"netmask\":\"255.255.240.0\",\"cloud_properties\":{\"ephemeral_external_ip\":true,\"network_name\":\"bbl-env-manitoba-2017-02-08t00-43z-network\",\"subnetwork_name\":\"bbl-env-manitoba-2017-02-08t00-43z-subnet\",\"tags\":[\"bbl-env-manitoba-2017-02-08t00-43z-internal\"]},\"default\":[\"dns\",\"gateway\"],\"gateway\":\"10.0.16.1\"}},[],{\"bosh\":{\"group\":\"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse-worker-windows\",\"groups\":[\"bosh-bbl-env-manitoba-2017-02-08t00-43z\",\"concourse\",\"worker-windows\",\"bosh-bbl-env-manitoba-2017-02-08t00-43z-concourse\",\"concourse-worker-windows\",\"bosh-bbl-env-manitoba-2017-02-"}, err: , exit_status: pid 23257 exit 0
This is strange because the manifest sets a disk size of 50GB:
- name: worker-windows
instances: 1
vm_type: m3.medium
vm_extensions:
- 50GB_ephemeral_disk
stemcell: windows
azs: [z1]
networks: [{name: private}]
jobs:
- name: concourse_windows
release: concourse-windows-worker
properties:
concourse_windows:
tsa_host: ci.nokogiri.org
tsa_public_key: ((tsa-host-public-key))
tsa_worker_private_key: ((windows-worker-private-key))
When I re-bosh deploy
, the VM does get recreated correctly with a 50GB disk.
Any ideas what's going on? What other information can I provide?
I have been working through https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/tree/master/docs/concourse and when its creating the web VM it fails with VM failed to create: Backend Service "concourse" does not exist: <nil cause>
Done creating missing vms > db/0 (a9cdb68d-9f11-4bea-af69-750c9f390e11) (00:01:01)
Done creating missing vms > worker/0 (f3ca4d71-6503-4545-b446-aa47e6395426) (00:01:01)
Failed creating missing vms > web/0 (f06432bb-1275-44ff-ad86-b55d0eda9697): VM failed to create: Backend Service "concourse" does not exist: (00:15:01)
Failed creating missing vms (00:15:01)
Codewise, this relates to https://github.com/cloudfoundry-incubator/bosh-google-cpi-release/blob/e84f61b28ec0a98cff7f1628f7d930f598f939fc/src/bosh-google-cpi/google/backendservice_service/google_backendservice_service.go#L122
Its clearly expecting a backend service configured but having followed the documented set up from the README.md if I subsequently run gcloud compute backend-services list
it returns Listed 0 items
. Any ideas what might be missing here?
We used the concourse terraform to create a concourse instance. It deployed perfectly!
However, when we are watching jobs through the browser the jobs were not updating. It wasn't till we reloaded a page we realized a job completed five minutes previously.
After investigation, the HTTP/S load balancer created is not supporting websocket and SSE events correctly. According to the docs these are not supported, unless using TCP/IP load balancer or by setting forwarding rules.
We'd like to help fix this as we are going to hack around the terraform configurations to make it work. Which then null/voids the usefulness of terraform for rolling back nicely.
Hey GCP CPI developers -
I noticed some inconsistencies when using the google CPI next to the AWS, OpenStack, and vSphere CPIs. There's a few properties that aren't named the same with the google CPI, here's what they are:
agent:
properties section seems to expect blobstore
and ntp
settings. Usually these settings are at the root level and are pulled in by the other CPIs.agent
section has a similar inconsistency in it. Usually the blobstore
and ntp
settings are not inside the agent section.blobstore_options
like this: {provider: "local", options: {blobstore_path: "/var/vcap/micro_bosh/data/cache"}}
. Other CPIs have the blobstore path more at the root level like this: {provider: "local", path: "/var/vcap/micro_bosh/data/cache"}
If y'all help make those changes, it'll be much easier for existing BOSH users of another IaaS to learn GCP. I've already had 5 or 6 people run into this, including me. Thanks!!!
In the cloudfoundry example, the application security groups are wide open. The recommended groups disallow internal networks. This is very important because, as currently configured, 1) containers can see host metadata from the google metadata server, and 2) containers can see the entire bosh subnet, including internal components that should not be viewable to host applications.
Deploying a BOSH in new google project, following instructions here. Love the documentation, very easy to follow and setup!
We got all the way to deploying bosh, when we failed to bosh-init
because the manifest.yml
file did not have the network
, subnetwork
, zone
, and ssh_key_path
environment variables to interpolate in the erb file.
We ended up manually setting those env vars and getting the manifest in a good state, but I imagine there might be a better, automatic way of finding those values through the gcloud
CLI.
I've seen a couple users (myself included) hit the following error:
creating stemcell (bosh-google-kvm-ubuntu-trusty-go_agent 3312):
CPI 'create_stemcell' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"Creating stemcell: Creating Google Image from URL: Failed to create Google Image: Post https://www.googleapis.com/compute/v1/projects/cf-bosh-concourse/global/images?alt=json: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token: dial tcp 169.254.169.254:80: i/o timeout","ok_to_retry":false}
This error means the CPI can't locate your GCP creds. The lookup process for creds is:
instance_groups.bosh.properties.google.json_key
in the Director manifestgcloud auth login
call169.254.169.254
So if the Operator has not specified json_key
in their manifest, and has not run gcloud auth login
at some point, and is running the deployment from outside a GCP instance, then they will get the above error message. It would be nice to display a better error message, but in the meantime maybe search engines will route people here if they hit this error :)
Stemcell 3361.1 deployment fails with BOSH Golang CLI:
bosh create-env bosh-gce.yml -l <(lpass show --note deployments)
...
Deploying:
Creating instance 'bosh/0':
Waiting until instance is ready:
Starting SSH tunnel:
Starting SSH tunnel:
Failed to connect to remote server:
ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
Exit code 1
The deployment succeeds with light stemcell 3312.18.
Successful manifest is here.
The only difference between the successful manifest and the failing manifest is the stemcell URL and SHA.
The BOSH Golang CLI is version 2.0.2-a0c78a5-2017-02-16T01:43:54Z.
Is is possible to use Local SSD for the ephemeral disk mounted at /var/vcap/data
? I could only find references to pd-ssd
in the examples, which I assume is shorthand for persistent SSD:
disk_pools:
- name: disks
disk_size: 32_768
cloud_properties:
type: pd-ssd
Our present Concourse CI workers are provisioned as AWS C3 instance types, which provide local SSD and drastically better performance for our workloads. We'd like to achieve the same on GCP if possible.
I have come across an use case that require GPU on some of the Bosh deployed VM in GCP. I'm proposing that the manifest would look something like this:
resource_pools:
- name: common
network: private
stemcell:
name: bosh-google-kvm-ubuntu-trusty-go_agent
version: latest
cloud_properties:
zone: us-east1-d
region: us-east1
machine_type: n1-standard-2
root_disk_size_gb: 20
root_disk_type: pd-standard
accelerator:
type: nvidia-tesla-k80
count: 1
Behavior will not change if user does not fill out the accelerator
properties. During VM creation, we'll pass in something like --accelerator type=nvidia-tesla-k80,count=1
.
If it's ok with the community, my team can submit a PR for this enhancement. In the mean time, please let me know if there is any question or concern.
Thanks,
Victor
Garden defaults to a 1500 MTU. GCP supports a max MTU of 1460. This means a 1461 B packeting will be dropped and degrade network performance.
The manifests for Concourse and Cloud Foundry need to set the garden.network_mtu
property.
Big thanks to Pivotal folks for finding/helping.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.