Giter Site home page Giter Site logo

gardener / gardener-extension-provider-gcp Goto Github PK

View Code? Open in Web Editor NEW
11.0 14.0 75.0 37.29 MB

Gardener extension controller for the GCP cloud provider (https://cloud.google.com).

Home Page: https://gardener.cloud

License: Apache License 2.0

Shell 2.63% Dockerfile 0.13% Makefile 0.90% Go 94.58% Smarty 0.25% HCL 1.02% Python 0.48%

gardener-extension-provider-gcp's Introduction

REUSE status CI Build status Go Report Card

Project Gardener implements the automated management and operation of Kubernetes clusters as a service. Its main principle is to leverage Kubernetes concepts for all of its tasks.

Recently, most of the vendor specific logic has been developed in-tree. However, the project has grown to a size where it is very hard to extend, maintain, and test. With GEP-1 we have proposed how the architecture can be changed in a way to support external controllers that contain their very own vendor specifics. This way, we can keep Gardener core clean and independent.

This controller implements Gardener's extension contract for the GCP provider.

An example for a ControllerRegistration resource that can be used to register this controller to Gardener can be found here.

Please find more information regarding the extensibility concepts and a detailed proposal here.

Supported Kubernetes versions

This extension controller supports the following Kubernetes versions:

Version Support Conformance test results
Kubernetes 1.30 1.30.0+ N/A
Kubernetes 1.29 1.29.0+ Gardener v1.29 Conformance Tests
Kubernetes 1.28 1.28.0+ Gardener v1.28 Conformance Tests
Kubernetes 1.27 1.27.0+ Gardener v1.27 Conformance Tests
Kubernetes 1.26 1.26.0+ Gardener v1.26 Conformance Tests
Kubernetes 1.25 1.25.0+ Gardener v1.25 Conformance Tests

Please take a look here to see which versions are supported by Gardener in general.


How to start using or developing this extension controller locally

You can run the controller locally on your machine by executing make start.

Static code checks and tests can be executed by running make verify. We are using Go modules for Golang package dependency management and Ginkgo/Gomega for testing.

Feedback and Support

Feedback and contributions are always welcome. Please report bugs or suggestions as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn more!

Please find further resources about out project here:

gardener-extension-provider-gcp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-extension-provider-gcp's Issues

Add support for Node Accelerators (GPU/TPU)

How to categorize this issue?
/component mcm
/kind enhancement
/priority normal
/platform gcp

What would you like to be added:
In order to allow the usage of GPU/TPU nodes on GCP we need to enable the guest_accelerators when machines are being created.

In Terraform speak this would look something like this:

    guest_accelerator = [
      {
        type = "${var.gpu_type}"
        count = "${var.gpu_per_node}"
      }]

Of course this needs to be translated to the go sdk usage.

Why is this needed:
The use case here is to allow ML workload on GCP cluster.

Implement Quota friendly Cloud Router sharing for CloudNATs.

Problem
Currently it is possible to utilize CloudNAT from GCP to allow external connectivity from VMs to the outside world without exposing their public IP. The problem with the current approach is that there is a hard quota limit of 5 Cloud Routers / VPC, this means that only 5 shoots can share the same VPC (https://cloud.google.com/router/quotas).

Possible solution
To circumvent this issue, we can have two scenarios:

  • If a VPC is not created / does not exists, we simply create the Cloud Router for the VPC, if newer shoots would like to share the VPC, they can provide the created Cloud Router name during the shoot creation.
  • If the VPC already exists, it will be required to provide the CloudRouter name along with the VPC, in that case the CloudNAT for the new shoot will share the CloudRouter specified in the name field.
  • if a VPC already exists and a CloudRouter name is not provided, then VMs are created the old way (with a public IP, until a deadline which dictates the use of a CloudRouter).
    • In two months time, this adapting code will be removed.

Changes required

  • Expose the CloudRouter Name in the API.
  • Differentiate between the two cases in code (i.e., decide whether a CloudRouter needs to be created or re-used).

part of #3

/cc @DockToFuture @rfranzke @vlerenc

Specify volumeBindingMode:WaitForFirstConsumer in default storage class

/area storage
/kind enhancement
/priority normal
/platform gcp

What would you like to be added:
PVs shall be created in the zone where the pod will be scheduled to.

Why is this needed:
From https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode:

By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created. For storage backends that are topology-constrained and not globally accessible from all Nodes in the cluster, PersistentVolumes will be bound or provisioned without knowledge of the Pod's scheduling requirements. This may result in unschedulable Pods.

We use Immediate, but should rather use WaitForFirstConsumer, wouldn't you agree?

Implement `Infrastructure` controller for GCP provider

Similar to how we have implemented the Infrastructure extension resource controller for the AWS provider let's please now do it for GCP.

Based on the current implementation the InfrastructureConfig should look like this:

apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
# vpc:
#   name: some-existing-vpc-name
  internal: 10.250.112.0/22
  workers: 10.250.0.0/19

Based on the current implementation the InfrastructureStatus should look like this:

---
apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureStatus
networks:
  vpc:
    name: network-name
  subnets:
  - purpose: nodes
    name: my-subnet
  - purpose: internal
    name: other-subnet
serviceAccountEmail: serviceaccount@email

The current infrastructure creation/deletion implementation can be found here. Please try to change as little as possible (with every change the risk that we break something increases!) and just move the code over into the extensions infrastructure actuator.

cloudNAT settings can't be changed

What happened:
I tried changing the cloudNAT setting in one of my teams cluster's shoot, as it is described here https://github.com/gardener/gardener-extension-provider-gcp/blob/master/docs/usage-as-end-user.md. When I tried to 'Save' my changes the following error appeared "admission webhook "validation.gcp.provider.extensions.gardener.cloud" denied the request: spec.provider.infrastructureConfig.networks: Invalid value: gcp.NetworkConfig{VPC:(*gcp.VPC)(nil), CloudNAT:(*gcp.CloudNAT)(0xc000499b20), Internal:(*string)(nil), Worker:"10.250.0.0/19", Workers:"", FlowLogs:(*gcp.FlowLogs)(nil)}: field is immutable"

What you expected to happen:
I expect the modification of the cloudNAT settings to be allowed.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: 1.5.1
  • Kubernetes version (use kubectl version): 1.16.8
  • Cloud provider or hardware configuration:
  • Others:

Enable Private Google access for VPC subnets

What would you like to be added:
Private Google access should be On by default for VPC subnets created by Gardener.

Why is this needed:
Private Google access allows VMs to reach Google APIs and services using an internal IP address instead of an external IP address. Also, this setting is recommended in the Google Cloud Platform Security Procedure.
https://wiki.wdf.sap.corp/wiki/display/itsec/Google+Cloud+Platform+Security+Procedure#GoogleCloudPlatformSecurityProcedure-1.40.02VPCNetworkconfiguration

Minimal Permissions for user credentials

From gardener-attic/gardener-extensions#133

We have narrowed down the access permissions for AWS shoot clusters (potential remainder tracked in #178), but not yet for Azure, GCP and OpenStack, which this ticket is now about. We expect less success on these infrastructures as AWSes permision/policy options are very detailed. This may break the "shared account" idea on these infrastructures (Azure and GCP - OpenStack can be mitigated by programmatically creating tenants on the fly).

[GCP] Deletion of a shoot cluster deletes all firewall rules in the same VPC and svc account

If two or more shoot clusters share the same VPC and are created by the same service account and then one of these clusters gets deleted all firewall rules for the VPC and service account also get deleted.

Check:
https://github.com/gardener/gardener-extensions/blob/master/controllers/provider-gcp/pkg/internal/infrastructure/infrastructure.go#L35
and
https://github.com/gardener/gardener-extensions/blob/master/controllers/provider-gcp/pkg/internal/infrastructure/infrastructure.go#L70

We should add an additional filter including the shoot's name when listing firewall rules for deletion, currently we only use the following prefixes:

const (
	KubernetesFirewallNamePrefix string = "k8s"
	shootPrefix                  string = "shoot--"
)

Running `make install` returns an error

What happened:
make install returns the following error when run locally

image

What you expected to happen:
Running make install should not return an error

How to reproduce it (as minimally and precisely as possible):

  1. Clone the master branch
  2. Run make install

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Default VolumeSnapshotClass is missing

What would you like to be added:

A default VolumeSnapshotClass.

Why is this needed:

Volume Snapshot Class describes the storage type for the Volume Snapshots just like the Storage Class for the Persistent Volumes. It would be great if the cluster comes with a default Volume Snapshot Class, so volume snapshotting would work out of the box.

Update credentials during Worker deletion

From gardener-attic/gardener-extensions#523

Steps to reproduce:

  1. Create a Shoot with valid cloud provider credentials my-secret.
  2. Ensure that the Shoot is successfully created.
  3. Invalidate the my-secret credentials.
  4. Delete the Shoot.
  5. Update my-secret credentials with valid ones.
  6. Ensure that the Shoot deletion fails waiting the Worker to be deleted.

Currently we do no sync the cloudprovider credentials in the <Provider>MachineClass during Worker deletion. Hence machine-controller-manager fails to delete the machines because the credentials are the invalid ones.

Extend internal GCP Shoot cluster capabilities

What would you like to be added:
The Gardener GCP provider allow already to pass configuration for services of type LoadBalancer which are backed by internal load balancers. The user can specify in the infra config via .internal a cidr range within the vpc which should be used to pick an ip addresses for the internal load balancer service.
The extension will create a subnet in the vpc with the .internal cidr.

I propose to extend this approach to allow users to specify an existing subnet, which can be as well used to pick ip addresses for internal load balancer services.
In addition it could also make sense to deploy Shoot cluster only with internal load balancer services as there could be scenarios which require that for security/isolation reasons (those scenarios would of course require that the control plane hosting seed can access these enviroments).

The InfrastructureConfig could look like this:

apiVersion: gcp.provider.extensions.gardener.cloud/v1beta1
kind: InfrastructureConfig
networks:
  vpc:
    name: myvpc
  ...
internal:
  cidr: 10.251.0.0/16       
  subnet: subnet-name-in-myvpc   
  internalOnly: true|false
  localAccess: true|false
...

Either .internal.cidr or .internal.subnet can be specified.
The .internal.internalOnly flag specify that all load balancer services in the cluster need to be internal ones (including vpn-shoot). That can be enforces and/or validated via webhooks.
The .internal.localAccess flag allow to specify could be used to limit the access to the internal load balancers only within vpc.

The following annotation on the services need to be set:

networking.gke.io/load-balancer-type: Internal
networking.gke.io/internal-load-balancer-allow-global-access: true      # enforced false when `.internal.localAccess == true` via webhook
networking.gke.io/internal-load-balancer-subnet: subnet-name-in-myvpc.  # subnet-name enforced when `.internal.subnet` is set via webhook

The annotation networking.gke.io/internal-load-balancer-subnet is currently available as alpha feature.
To enable it on the cloud provider config passed to the GCP cloud-controller-manager need to contain alpha-features="ILBCustomSubnet".

Why is this needed:
There are scenarios where user need to create upfront a vpc with a subnet inside which is routable in other context e.g. internal networks etc.

cc @DockToFuture

Need the ability to create nodes on an existing VPC/subnet

What would you like to be added:
Although it is possible to configure Gardener clusters to create nodes in a particular VPC, it still attempts to create a new subnet in the VPC. We would like to be able to specify both the name of an existing VPC and the name of an existing subnet within the VPC when creating a new cluster.
Why is this needed:
We have an existing VPC and subnet configured with PCI connectivity to the SAP corporate network. We need to configure all of our Kubernetes clusters to have nodes on the existing subnet.

GCP support additional volume

What would you like to be added:
In the worker object, there are 2 fields Volume and DataVolumes
image

But in gcp disk, there is only Volume field used.
image

I need 2 volume

  • Root volume

  • Local SSD volume

Can i submit a Pull Request to support it?

Support shared VPC in gardener

How to categorize this issue?
/area networking
/kind enhancement
/priority 3
/platform gcp

What would you like to be added: We tried to create gardener cluster using GCP shared VPC, it fails to create the cluster because gardener cannot detect shared-vpc in service project.
Current config of VPC in gardener infrastructure points to a VPC located in local service project, it cannot detect VPCs shared by host project.

Why is this needed:
Shared VPC lets organization administrators delegate administrative responsibilities, such as creating and managing instances, to Service Project Admins while maintaining centralized control over network resources like subnets, routes, and firewalls.
This is also a google recommended way for organization who wants to have more secured network settings and efficient communications.

[GCP] Expose configuration for minimum ports per VM instance in CloudNAT

With gardener-attic/gardener-extensions#417 the shoot worker nodes get outbound network connectivity via CloudNAT instead of external IP per node. By default, CloudNAT has a minimum ports per VM set to 64 which results in dropped packets when more than 64 connection are tried to be established from the same node to the same destination. To resolve this, the minimum ports per VM configuration has to be exposed in the Infrastructure API, as a reasonable default value seems to be 2048 ports(allowing around 30 nodes to be exposed via single NAT IP).

Reference:

GCP: Control the default allow-external-access FW rules

We have Gardener shoot clusters running in GCP with firewall rules like the following.

shoot-prod-allow-external-access
Ingress
Apply to all    
IP ranges: 0.0.0.0/0
tcp:80,443
Allow
1000

The fact that the port 80 is open to the internet is being classified as a high security risk.
There is nothing from our services actually listening on that port and it will be good if we could close it.

Is there a way for you to allow us to have control over that configuration and open it only if needed?

[CSI] csi-driver failing to provision volumes when node ID is longer than 128

What happened:

A persistent volume claim was failing to be provisioned, because the node ID was too long.

k -n kube-system logs -p csi-driver-node-d4tnl -c csi-node-driver-registrar
I0505 13:45:10.412325       1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I0505 13:45:10.412405       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0505 13:45:10.412427       1 connection.go:151] Connecting to unix:///csi/csi.sock
I0505 13:45:10.412865       1 main.go:127] Calling CSI driver to discover driver name
I0505 13:45:10.412881       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0505 13:45:10.412886       1 connection.go:181] GRPC request: {}
I0505 13:45:10.414462       1 connection.go:183] GRPC response: {"name":"pd.csi.storage.gke.io","vendor_version":"v0.7.0-gke.0"}
I0505 13:45:10.414820       1 connection.go:184] GRPC error: <nil>
I0505 13:45:10.414827       1 main.go:137] CSI driver name: "pd.csi.storage.gke.io"
I0505 13:45:10.414915       1 node_register.go:51] Starting Registration Server at: /registration/pd.csi.storage.gke.io-reg.sock
I0505 13:45:10.415103       1 node_register.go:60] Registration Server started at: /registration/pd.csi.storage.gke.io-reg.sock
I0505 13:45:10.870866       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0505 13:45:11.870960       1 main.go:77] Received GetInfo call: &InfoRequest{}
I0505 13:45:13.585060       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: error updating CSINode object with CSI driver node info: error updating CSINode: timed out waiting for the condition; caused by: CSINode.storage.k8s.io "shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk" is invalid: spec.drivers[0].nodeID: Invalid value: "projects/012-34-56789abcdefghij-klmnopq/zones/us-central1-a/instances/shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk": must be 128 characters or less,}
E0505 13:45:13.585115       1 main.go:89] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: error updating CSINode object with CSI driver node info: error projects/012-34-56789abcdefghij-klmnopq/zones/us-central1-a/instances/shoot--12345678--123-56789ab-cpu-worker-z1-7c4f48599f-q6vbk": must be 128 characters or less, restarting registration container.

What you expected to happen:
CSI driver to work for all machines in the clusters. I am not sure, but maybe a further restrictions on the names length has to be applied.

How to reproduce it (as minimally and precisely as possible):
Create a shoot, project and worker pool with long names. Also, the GCP project name should be long.

Anything else we need to know?:
gardener/machine-controller-manager#461

Environment:

  • Gardener version: v1.3.1
  • Kubernetes version (use kubectl version): v1.18.2
  • Cloud provider or hardware configuration:
  • Others:

Ability to configure GCP VPC Flow Logs

We need the ability to configure the VPC flow logs on Google Cloud Platform via Gardener to meet security requirements.

Manual configuration of the VPC flow logs is not possible, as it gets overwritten by Gardener on reconciliation.

Expose Shoot Cluster on GCP via Cloud NAT

Currently we are using external IP addresses on GCP to reach the internet from our nodes.
Cloud NAT provides the possibility to reach the public internet via one NAT Endpoint/Gateway without exposing each node and eases access to reach third party components by whitelisting the Cloud NAT IP address instead of the region pools which are used by GCP to assign dynamically IPs to the nodes/instances.

Provider-specific webhooks in Garden cluster

From gardener-attic/gardener-extensions#407

With the new core.gardener.cloud/v1alpha1.Shoot API Gardener does no longer understand the provider-specifics, e.g., the infrastructure config, control plane config, worker config, etc.
This allows end-users to harm themselves and create invalid Shoot resources the Garden cluster. Errors will only become present during reconciliation part creation of the resource.

Also, it's not possible to default any of the provider specific sections. Hence, we could also think about mutating webhooks in the future.

As we are using the controller-runtime maintained by the Kubernetes SIGs it should be relatively easy to implement these webhooks as the library abstracts already most of the things.

We should have a separate, dedicated binary incorporating the webhooks for each provider, and a separate Helm chart for the deployment in the Garden cluster.

Similarly, the networking and OS extensions could have such webhooks as well to check on the providerConfig for the networking and operating system config.

Part of gardener/gardener#308

Build failed

What happened:
I run the commands

make install
make install-requirements
make generate

The config is here

OS: macOS
version: 10.14.3 

But it failed, the message is here

> Generate
charts/gardener-extension-provider-gcp/doc.go:15: running "../../vendor/github.com/gardener/gardener/extensions/hack/generate-controller-registration.sh": exit status 1
I0421 15:54:28.241924    6545 main.go:123] parsing go packages in directory .
I0421 15:54:34.763045    6545 main.go:225] using package=github.com/gardener/gardener-extension-provider-gcp/pkg/apis/config/v1alpha1
I0421 15:54:34.768536    6545 main.go:161] written to ../../../../hack/api-reference/config.md
Generating deepcopy funcs
Generating defaulters
Generating conversions
Generating deepcopy funcs
Generating defaulters
Generating conversions
I0421 15:55:14.983831   12855 main.go:123] parsing go packages in directory .
I0421 15:55:20.724491   12855 main.go:225] using package=github.com/gardener/gardener-extension-provider-gcp/pkg/apis/gcp/v1alpha1
I0421 15:55:20.732766   12855 main.go:161] written to ../../../../hack/api-reference/api.md
make: *** [Makefile:117: generate] Error 1

Move from in-tree to out-of-tree MCM

How to categorize this issue?

/area dev-productivity
/kind enhancement
/priority normal
/platform gcp

What would you like to be added:
We would like to split the MCM code base into two different pods using the OOT approach in MCM - gardener/machine-controller-manager#460.

Why is this needed:
This would ease the maintenance of the provider specific MCM code independently.

csi-node-driver-registrar preStopHook failing

How to categorize this issue?

/area storage
/kind bug
/priority normal
/platform gcp

What happened:

When rolling the csi-driver-node, I noticed one pod refusing to terminate and the preStopHook failing with the following error:

Events:
  Type     Reason             Age                   From     Message
  ----     ------             ----                  ----     -------
  Normal   Killing            3m8s                  kubelet  Stopping container csi-liveness-probe
  Normal   Killing            3m8s                  kubelet  Stopping container csi-node-driver-registrar
  Warning  FailedPreStopHook  3m8s                  kubelet  Exec lifecycle hook ([/bin/sh -c rm -rf /registration/pd.csi.storage.gke.io-reg.sock /csi/csi.sock]) for Container "csi-node-driver-registrar" in Pod "csi-driver-node-qfhnf_kube-system(fb77f204-f9c7-4d84-8970-77cdbfe8ad7b)" failed - error: command '/bin/sh -c rm -rf /registration/pd.csi.storage.gke.io-reg.sock /csi/csi.sock' exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: exec: \"/bin/sh\": stat /bin/sh: no such file or directory: unknown\r\n"
  Warning  Unhealthy          2m21s (x5 over 3m1s)  kubelet  Liveness probe failed: Get http://10.242.0.92:9808/healthz: dial tcp 10.242.0.92:9808: connect: connection refused
  Warning  FailedKillPod      68s                   kubelet  error killing pod: failed to "KillContainer" for "csi-driver" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Normal   Killing            67s (x2 over 3m8s)    kubelet  Stopping container csi-driver

The pod was hanging for another ~5m.

What you expected to happen:

The csi-driver-node pod to properly terminate so that rolls can continue.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

[GCP] Expose configuration for GCP Cloud NAT manual external IP address

As mentioned in the google documentation https://cloud.google.com/nat/docs/using-nat#update_nat, there is a way to set a reserved static IP address for Cloud NAT. Currently gardener doesnt' have a configuration available for setting it to be manual instead of automatic. The reason for requesting this configuration is that we have an use case which requires us to whitelist our clusters' public IP addresses in different places, that don't allow access unless the IP address is officially whitelisted.

Reduce memory footprint for GCP admission component

How to categorize this issue?

/area robustness
/area cost
/priority critical
/platform gcp

What would you like to be added:
Since the validation of cloud provider secrets has been introduced (#112) the GCP admission plugin's memory footprint increased for a considerable amount, mainly because of added caches for Shoots, SecretBindings and Secrets.

The required memory depends on the K8s cluster or the number/size of the stored resources, but we've observed increases from from ~20Mb to ~500Mb. Under consideration that such an admission component has multiple replicas and is only responsible for one provider, the runtime costs are too high.

Hence, we should try to:

  1. Disable caching for all objects and analyse the consequences, i.e. is cost per admission request still acceptable
  2. If 1.) is not feasible, use API Reader to read Secrets because we expect that Secrets are the most occurring resource kinds in the Garden cluster.

Cannot delete infrastructure when credentials data keys a re missing in secret

From gardener-attic/gardener-extensions#577

If the account secret does not contain a service account json, the cluster can for sure not be created.
But when trying to delete such a cluster this fails because of the same reason:

Waiting until shoot infrastructure has been destroyed
Last Error
task "Waiting until shoot infrastructure has been destroyed" failed: Failed to delete infrastructure: Error deleting infrastructure: secret shoot--berlin--rg-kyma/cloudprovider doesn't have a service account json

It is the same also for the other providers. This is not something specific to gcp.

Forwarding rules leak on provider-gcp

Steps to reproduce:

  1. Create a Shoot on GCP.
  2. Delete the Shoot created in step 1.
  3. Ensure on google console or with gcloud that forwarding rules created in steps 1 are not cleaned up.

Support add custom label on GCP compute engine vm

What would you like to be added:
In pkg/controller/worker/machines.go, the machineClassSpec labels map only contains one key named "name", can we all pool labels to the machineClassSpec?

	machineClassSpec := map[string]interface{}{
				"region":             w.worker.Spec.Region,
				"zone":               zone,
				"canIpForward":       true,
				"deletionProtection": false,
				"description":        fmt.Sprintf("Machine of Shoot %s created by machine-controller-manager.", w.worker.Name),
				"disks":              disks,
				"labels": map[string]interface{}{
					"name": w.worker.Name,
				},

By the way, AWS extension needs this feature too.

Why is this needed:
We need some custom labels to distinguish VM, then we can charge our tenant by query the cost from the big query.

If it's possible, I can submit a PR to do this

Adapt to terraform v0.12 language changes

How to categorize this issue?

/area open-source
/kind cleanup
/priority normal
/platform gcp

What would you like to be added:
provider-gcp needs an adaptation of the terraform configuration to v0.12. For provider-aws this is done with this PR - gardener/gardener-extension-provider-aws#111.

Why is this needed:
Currently the terraformer run is only omitting warnings but in a future version of terraform, the warnings will be switched to errors.

Unable to create internal LoadBalancer

How to categorize this issue?

/area networking
/kind bug
/priority normal
/platform gcp

What happened:
Unable to create internal LoadBalancer. Ref https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing

What you expected to happen:
To be able to create internal LoadBalancer.

How to reproduce it (as minimally and precisely as possible):

  1. Create Service with annotation networking.gke.io/load-balancer-type: "Internal"
$ cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
spec:
  selector:
    matchLabels:
      app: hello
  replicas: 3
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello
        image: "gcr.io/google-samples/hello-app:2.0"
---
apiVersion: v1
kind: Service
metadata:
  name: ilb-service
  annotations:
    networking.gke.io/load-balancer-type: "Internal"
  labels:
    app: hello
spec:
  type: LoadBalancer
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
EOF
  1. Ensure that the Service hangs in pending state
$ k describe svc ilb-service

Events:
  Type     Reason                  Age                 From                Message
  ----     ------                  ----                ----                -------
  Normal   EnsuringLoadBalancer    0s (x7 over 5m17s)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  0s (x7 over 5m16s)  service-controller  Error syncing load balancer: failed to ensure load balancer: services "ilb-service" is forbidden: User "system:serviceaccount:kube-system:cloud-provider" cannot patch resource "services/status" in API group "" in the namespace "default"

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: v1.9.0
  • Kubernetes version (use kubectl version): v1.18.5
  • Cloud provider or hardware configuration:
  • Others:

Forbid replacing secret with new account for existing Shoots

What would you like to be added:
Currently we don't have a validation that would prevent user to replace its cloudprovider secret with credentials for another account. Basically we do have only a warning in the dashboard - ref gardener/dashboard#422.

Steps to reproduce:

  1. Get an existing Shoot.
  2. Update its secret with credentials for another account.
  3. Ensure that on new reconciliation, new infra resources will be created in the new account. The old infra resources and machines in the old account will leak.
    For me the reconciliation failed at
    lastOperation:
      description: Waiting until the Kubernetes API server can connect to the Shoot
        workers
      lastUpdateTime: "2020-02-20T14:56:43Z"
      progress: 89
      state: Processing
      type: Reconcile

wtih reason

$ k describe svc -n kube-system vpn-shoot
Events:
  Type     Reason                   Age                  From                Message
  ----     ------                   ----                 ----                -------
  Normal   EnsuringLoadBalancer     7m38s (x6 over 10m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed   7m37s (x6 over 10m)  service-controller  Error syncing load balancer: failed to ensure load balancer: could not find any suitable subnets for creating the ELB

Why is this needed:
Prevent users to harm themselves.

Configure mcm-settings from worker to machine-deployment.

How to categorize this issue?

/area usability
/kind enhancement
/priority normal
/platform gcp

What would you like to be added: Machine-controller-manager now allows configuring certain controller-settings, per machine-deployment. Currently, the following fields can be set:

Also, with the PR gardener/gardener#2563 , these settings can be configured via shoot-resource as well.

We need to enhance the worker-extensions to read these settings from worker-object and set respectively on MachineDeployment.

Similar PR on AWS worker-extension: gardener/gardener-extension-provider-aws#148
Dependencies:

  • Vendor the MCM 0.33.0
  • gardener/gardener#2563 should be merged.
  • g/g with the #2563 change should be vendored.

Why is this needed:
To allow a fine configuration of MCM via worker-object.

GCP support enablegcraccess field

In gardener-externsion-provider-aws, there is EnableEcrAccess field to allow nodes to pull images on ecr.
I think gcp also need to support this feature.
If the gcr registry is private, it always pull image failed

Add validation to prevent worker.min to be set to 0.

What would you like to be added: We need to add the validation to prevent the worker.Min from being set to 0 when worker.Max is non-zero. This is to avoid cluster-autoscaler from scaling down the worker-pool to zero, and not being able to scale-up later.

Why is this needed:
Please refer for more info.: gardener/gardener#2045

Remove IPIP Tunnels from GCP

How to categorize this issue?
/area networking
/kind enhancement

What would you like to be added:

Today, the cloud-controller-manager is used to configure routes on GCP (configure-cloud-routes=true), however these routes are not used since we are utilizing CNI tunnels (e.g., IPIP or VxLAN) to enabled pod-connectivity cross nodes and zones.

To completely remove tunnels for GCP, we would need to disable IPIP tunnels on Calico (e.g., via a webhook similar to Azure Setup backend=none). Furthermore, we need to deploy ip-masq-agent to prevent Pod generated traffic from being blocked by the GCP infrastructure .

Why is this needed:

  • Improve pod routing performance.
  • Avoid possible overlaps between tunnel and cluster CIDRs.
  • Eliminate MTU inconsistencies.

Remove external IPs from GCP worker nodes

With gardener-attic/gardener-extensions#379 and gardener-attic/gardener-extensions#398 Cloud NATs have been introduced for GCP shoots in order to remove the external IPs from the shoot worker nodes.

However, these changes were reverted with gardener-attic/gardener-extensions#405 because of instabilities.
More specifically, the main problem was that shoots which are deployed into the same VPC/network get one router and cloud NAT each. However, GCP has a hard quota limit here:

From https://cloud.google.com/router/quotas:

Cloud Routers per project Quotas Regardless of quota, each network is limited to five Cloud Routers per region. See limits below.
The following limits for Cloud Router apply to VPC networks. Unless otherwise stated, these limits cannot be increased.

Hence, the implementation with gardener-attic/gardener-extensions#379 and gardener-attic/gardener-extensions#398 does not work.

What can we do to circumvent the problem and to get rid of the external IPs for GCP worker nodes?

Provision node VMs with cloud-platform access scope

What would you like to be added:
Gardener creates node VMs with limited access scopes. In our case, this results in our application not being able to access a GCS bucket without having to add additional configuration. We have granted the service account the necessary IAM role on the GCS bucket. However, it is the limited access scope of the node VMs that is causing the issue. We want to follow the Google recommend practice of setting the full cloud-platform access scope on VM instances and then limiting service account access using IAM roles. To do this, we need the node VMs to be created with the cloud-platform access scope by default.
Why is this needed:
Without this change to the Cloud API access scopes, we cannot take advantage of the simplified authentication flow that is available when accessing a GCS bucket from a GCE VM.

VolumeAttachments are orphaned on hibernation

How to categorize this issue?

/area storage
/kind bug
/priority normal
/platform gcp

What happened:
VolumeAttachments are orphaned after hibernation and wake up. Such orphan VolumeAttachments block PV deletion afterwards.

What you expected to happen:
PV to do not hang in Terminating because of orphan VolumeAttachments.

How to reproduce it (as minimally and precisely as possible):

  1. Create a Shoot

  2. Create sample PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: ebs-claim
  1. Ensure that there is one VolumeAttachment as expected
$ k get volumeattachments.storage.k8s.io
NAME                                                                   ATTACHER                PV                                         NODE                                                     ATTACHED   AGE
csi-428c30ed273c01d110380beb8bfac4052e7b3567300c9835421df82a0a99ddf4   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-gm8kb   true.      90s
  1. Hibernate

  2. Wake up

  3. Ensure that the VolumeAttachment is not cleaned up after wake up

$ k get volumeattachments.storage.k8s.io
NAME                                                                   ATTACHER                PV                                         NODE                                                     ATTACHED   AGE
csi-3a2f98271fdce6a780cb93777e5c5042b64e0782855355fd1cd9a270231857d8   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-r4gls   true       89s
csi-428c30ed273c01d110380beb8bfac4052e7b3567300c9835421df82a0a99ddf4   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-gm8kb   true       9m25s
  1. Delete the sample resources from step 2

  2. Ensure that PV hangs in Terminating

$ k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM               STORAGECLASS   REASON   AGE
pv--09fdeb10-ab27-441c-9daf-413238e2491b   4Gi        RWO            Delete           Terminating   default/ebs-claim   default                 23m

Anything else we need to know?:

Environment:

  • Gardener version (if relevant): v1.10
  • Extension version: v1.11.0
  • Kubernetes version (use kubectl version): v1.18.8
  • Cloud provider or hardware configuration:
  • Others:

Implement `Worker` controller for GCP provider

Similar to how we have implemented the Worker extension resource controller for the AWS provider let's please now do it for GCP.

There is no special provider config required to be implemented, however, we should have component configuration for the controller that should look as follows:

---
apiVersion: gcp.provider.extensions.config.gardener.cloud/v1alpha1
kind: ControllerConfiguration
machineImages:
- name: coreos
  version: 2023.5.0
  image: projects/coreos-cloud/global/images/coreos-stable-2023-5-0-v20190312

Volume snapshotting does not work

What happened:

I provisioned a Gardener GCP cluster with CSI storage driver, deployed a sample pod, and a sample PVC. Then, I created a VolumeSnapshotClass and a VolumeSnapshot to trigger snapshotting from the existing PVC. However, snapshotting didn't work. VolumeSnapshot's readyToUse field does not become true, and I don't see any related disk or snapshot on the GCP project.

How to reproduce it (as minimally and precisely as possible):

Create a PVC and a Pod:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: source-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 6Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: source-pod
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["/bin/sh", "-c"]
    args: ["touch /demo/data/sample-file.txt && sleep 3000"]
    volumeMounts:
    - name: source-data
      mountPath: /demo/data
  volumes:
  - name: source-data
    persistentVolumeClaim:
      claimName: source-pvc
      readOnly: false

Then, create a VolumeSnapshotClass and a VolumeSnapshot:

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotClass
metadata:
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
  name: default-snapshot-class
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
---
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: snapshot-source-pvc
spec:
  source:
    persistentVolumeClaimName: source-pvc

Check the VolumeSnapshot status:

kubectl get volumesnapshot snapshot-source-pvc

Environment:

  • Gardener version (if relevant): v1.3.0
  • Extension version:
  • Kubernetes version (use kubectl version): 1.18.1
  • Cloud provider or hardware configuration: GCP
  • Others:

SeedNetworkPoliciesTest fails always

From gardener-attic/gardener-extensions#293

What happened:
The test defined in SeedNetworkPoliciesTest.yaml fails always.
Most of the time the following 3 specs fail:

2019-07-29 11:32:33	Test Suite Failed
2019-07-29 11:32:33	Ginkgo ran 1 suite in 3m20.280138435s
2x		2019-07-29 11:32:33	
2019-07-29 11:32:32	FAIL! -- 375 Passed | 3 Failed | 0 Pending | 126 Skipped
2019-07-29 11:32:32	Ran 378 of 504 Specs in 85.218 seconds
2019-07-29 11:32:32	
2019-07-29 11:32:32	> /go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1194
2019-07-29 11:32:32	[Fail] Network Policy Testing egress for mirrored pods elasticsearch-logging [AfterEach] should block connection to "Garden Prometheus" prometheus-web.garden:80
2019-07-29 11:32:32	
2019-07-29 11:32:32	/go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1062
2019-07-29 11:32:32	[Fail] Network Policy Testing components are selected by correct policies [AfterEach] gardener-resource-manager
2019-07-29 11:32:32	
2019-07-29 11:32:32	/go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1194
2019-07-29 11:32:32	[Fail] Network Policy Testing egress for mirrored pods gardener-resource-manager [AfterEach] should block connection to "External host" 8.8.8.8:53

@mvladev can you please check?

Environment:
TestMachinery on all landscapes (dev, ..., live)

Implement `ControlPlane` controller for GCP provider

Similar to how we have implemented the ControlPlane extension resource controller for the AWS provider let's please now do it for GCP.

Based on the current implementation the ControlPlaneConfig should look like this:

apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
kind: ControlPlaneConfig
cloudControllerManager:
  zone: europe-west1
  featureGates:
    CustomResourceValidation: true

No ControlPlaneStatus needs to be implemented right now (not needed yet).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.