Giter Site home page Giter Site logo

gardener / gardener-extension-runtime-gvisor Goto Github PK

View Code? Open in Web Editor NEW
4.0 11.0 34.0 21.33 MB

Gardener extension controller for the gVisor container runtime sandbox (https://gvisor.dev).

Home Page: https://gardener.cloud

License: Apache License 2.0

Shell 23.13% Dockerfile 1.35% Makefile 5.64% Go 65.14% Smarty 0.72% Python 4.02%

gardener-extension-runtime-gvisor's Introduction

REUSE status CI Build status Go Report Card

Project Gardener implements the automated management and operation of Kubernetes clusters as a service. Its main principle is to leverage Kubernetes concepts for all of its tasks.

Recently, most of the vendor specific logic has been developed in-tree. However, the project has grown to a size where it is very hard to extend, maintain, and test. With GEP-1 we have proposed how the architecture can be changed in a way to support external controllers that contain their very own vendor specifics. This way, we can keep Gardener core clean and independent.


How to start using or developing this extension controller locally

You can run the controller locally on your machine by executing make start. Please make sure to have the kubeconfig to the cluster you want to connect to ready in the ./dev/kubeconfig file.

Static code checks and tests can be executed by running make verify. We are using Go modules for Golang package dependency management and Ginkgo/Gomega for testing.

Feedback and Support

Feedback and contributions are always welcome. Please report bugs or suggestions as GitHub issues or join our Slack channel #gardener (please invite yourself to the Kubernetes workspace here).

Learn more!

Please find further resources about out project here:

gardener-extension-runtime-gvisor's People

Contributors

acumino avatar aleksandarsavchev avatar andreasburger avatar andrerun avatar ary1992 avatar bd3lage avatar ccwienk avatar danielfoehrkn avatar dependabot[bot] avatar dimitar-kostadinov avatar dimityrmirchev avatar gardener-robot-ci-1 avatar gardener-robot-ci-2 avatar gardener-robot-ci-3 avatar ialidzhikov avatar jordanjordanov avatar krgostev avatar lucabernstein avatar martinweindel avatar marwinski avatar mrbatschner avatar nimrodoron avatar nimrodoronsap avatar ppalucki avatar rfranzke avatar shafeeqes avatar timuthy avatar voelzmo avatar vpnachev avatar zkdev avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gardener-extension-runtime-gvisor's Issues

Installation image is using wrong tag by default

What happened:
The installation container cannot be started due to wrong default image tag

  Warning  Failed                  27s (x3 over 70s)  kubelet            Failed to pull image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": rpc error: code = NotFound desc = failed to pull and unpack image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": failed to resolve reference "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev: not found

What you expected to happen:
A proper image tag to be used by default.

How to reproduce it (as minimally and precisely as possible):

  1. Create a shoot cluster with worker pool with gVisor enabled
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: foo
  namespace: bar
spec:
  provider:
    workers:
    - cri:
        name: containerd
        containerRuntimes:
        - type: gvisor
      machine:
        image:
          name: gardenlinux
        type: c5.large
      maxSurge: 1
      maxUnavailable: 0
      maximum: 1
      minimum: 1
      name: gvisor
      volume:
        size: 50Gi
        type: gp2
  1. Once the shoot is created, make sure that the gvisor installation pods are failing to pull their images
$ kubectl -n kube-system get pod -l app.kubernetes.io/name=containerd-gvisor
NAME                             READY   STATUS             RESTARTS   AGE
containerd-gvisor-gvisor-8tzkn   0/1     ImagePullBackOff   0          13m
$ kubectl -n kube-system describe pod -l app.kubernetes.io/name=containerd-gvisor
...
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Normal   Scheduled               15m                   default-scheduler  Successfully assigned kube-system/containerd-gvisor-gvisor-8tzkn to ip-10-250-31-65.eu-west-1.compute.internal
  Normal   Pulling                 13m (x4 over 15m)     kubelet            Pulling image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev"
  Warning  Failed                  13m (x4 over 15m)     kubelet            Failed to pull image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": rpc error: code = NotFound desc = failed to pull and unpack image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": failed   to resolve reference "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev": eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev: not found
  Warning  Failed                  13m (x4 over 15m)     kubelet            Error: ErrImagePull
  Warning  Failed                  5m47s (x39 over 15m)  kubelet            Error: ImagePullBackOff
  Normal   BackOff                 46s (x60 over 15m)    kubelet            Back-off pulling image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:0.0.0-dev"

Anything else we need to know?:
This bug can be easily mitigated by overwriting the default Image viaimageVectorOverwrite

# imageVectorOverwrite: |
# Please find documentation in github.com/gardener/gardener/docs/deployment/image_vector.md

Environment:

  • Gardener version (if relevant): Not relevant
  • Extension version: v0.4.0 and master (probably also older versions are affected)
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

/assign @vpnachev @andrerun

Implement migration & restore operation

What would you like to be added:

The gVisor extension should implement the migrate and restore interface in the gardener.

Migration

  • shallow deletion of managed resources (installation & prerequisite). Otherwise the RuntimeClass is gone and the installation ds is not available during node scale-out.
  • saving the state to the ContainerRuntime CRD should be done by the generic actuator in the gardener

Why is this needed:

In order to support control plane migrations.

Integration test: unsupported kubernetes version "v1.22.2"

/area testing
/kind bug
/priority 3

What happened:
With [email protected] the integration test fails against K8s v1.22 Shoot with reason:

• Failure in Spec Setup (BeforeEach) [300.835 seconds]
gVisor tests
/src/test/integration/container-runtime/container_runtime.go:40
  [BETA] [SERIAL] [SHOOT] should add, remove and upgrade worker pool with gVisor [BeforeEach]
  /src/vendor/github.com/gardener/gardener/test/framework/gingko_utils.go:26

  Unexpected error:
      <*retry.Error | 0xc00084c540>: {
          ctxError: <context.deadlineExceededError>{},
          err: <*fmt.wrapError | 0xc00084c400>{
              msg: "could not construct Shoot client: error discovering kubernetes version: unsupported kubernetes version \"v1.22.2\"",
              err: <*fmt.wrapError | 0xc00084c3e0>{
                  msg: "error discovering kubernetes version: unsupported kubernetes version \"v1.22.2\"",
                  err: <*errors.errorString | 0xc000639040>{
                      s: "unsupported kubernetes version \"v1.22.2\"",
                  },
              },
          },
      }
      retry failed with context deadline exceeded, last error: could not construct Shoot client: error discovering kubernetes version: unsupported kubernetes version "v1.22.2"
  occurred

  /src/vendor/github.com/gardener/gardener/test/framework/shootframework.go:119

It seems that this issue is fixed on the master branch thanks to #26. So it would be nice a have a new release of runtime-gvisor where the test does not complain about the Shoot K8s version.
In longterm, if possible, the test can be adapted to do not rely on the supported K8s versions list in gardener. Similar hidden dependency was present for the networking extensions and it was resolved for example with gardener/gardener-extension-networking-calico#111.

What you expected to happen:
No similar error when running the integration test.

How to reproduce it (as minimally and precisely as possible):
See above.

Environment:

  • Gardener version (if relevant):
  • Extension version: v0.2.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

CP migration with gvisor extension enabled fails

What happened:
When performing the ControlPlane migration of shoots that have gvisor extension enabled, the migration is getting stuck because MR cannot be removed.

What you expected to happen:
CP migration to successfully finish.

How to reproduce it (as minimally and precisely as possible):
Create a shoot with gvisor runtime enabled and perform CP migration from one seed to another one.

Anything else we need to know?:
In order to proceed with the migration, after discussion with @plkokanov, I could
set the keepObjects field in the MR to true, then delete the MRs

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

unify imageVectorOverwrite location in helm values

As far as I have seen, the runtime-gvisor extension is the only extension where the image vector can be overwritten at .image.imageVectorOverwrite in the chart's values file. For all other extensions, the imageVectorOverwrite node is at top-level in the values file.

To make this even more confusing, this is how it is described in the values file:

image:
  repository: eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor
  tag: latest
  pullPolicy: IfNotPresent
# imageVectorOverwrite: |
...

The line with the (commented out) 'imageVectorOverwrite' is not indented and thus at the same position as it is in the example values files of the other extensions. But there, when commented in, the 'imageVectorOverwrite' must not be indented, whereas for this extension, it has to be.

It would be nice to have this unified among all extensions.

Ensure that the implementation of controllers supports control plane migration

How to categorize this issue?

/area control-plane-migration
/kind task
/kind test
/priority normal

What would you like to be added:

Why is this needed:

gvisor integration test is failing always in amd64 flavors

What happened:
Screenshot 2022-11-30 at 12 47 57

The integration test times out waiting for the Pod to be ready.

What you expected to happen:
The test to succeed.

How to reproduce it (as minimally and precisely as possible):
Run the integration test against Shoot with amd64 workers.

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: v0.7.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Installation image is using wrong tag by default

What happened:
The installation container cannot be started due to wrong default image tag

Normal  BackOff  4m20s (x661 over 153m)  kubelet  Back-off pulling image "eu.gcr.io/gardener-project/gardener/extensions/runtime-gvisor-installation:v0.5.1-dev"

What you expected to happen:
A proper image tag to be used by default.

How to reproduce it (as minimally and precisely as possible):

Create a shoot cluster with worker pool with gVisor enabled. With the following controller-registration.yaml: https://raw.githubusercontent.com/gardener/gardener-extension-runtime-gvisor/v0.5.1/example/controller-registration.yaml

Anything else we need to know?:
The issue lies (as far as we found out) in the gvisor-runtime image https://console.cloud.google.com/gcr/images/gardener-project/eu/gardener%2Fextensions%2Fruntime-gvisor@sha256:866ea708c0ce5e5c64b9dc1a290b89caa90151691253d328af4155cdf4e64f3f which creates a daemonset with a "-dev" image tag. We had a look at the included binary and found the string "v0.5.1-dev". If we build the image locally (with v0.5.1 checked out) this string is "v0.5.1". So we assume that a wrong version is linked into the upstream image. Currently we assume an issue in the build pipeline but unfortunately we cant get the log from the respective run: https://concourse.ci.gardener.cloud/teams/gardener/pipelines/gardener-extension-runtime-gvisor-release-v0.5/jobs/release-v0.5-release-job/builds/1

Environment:

  • Gardener version : 1.46
  • Extension version: 0.5.1
  • Kubernetes version (use kubectl version): 1.23
  • Cloud provider or hardware configuration:
  • Others:

Containerd fails to start after gVisor is enabled

What happened:
After the gVisor was enabled for a ubuntu worker pool, the nodes transit to NotReady state and never become healthy again.

Node events:

Events:
  Type     Reason                   Age                    From             Message
  ----     ------                   ----                   ----             -------
  Normal   Starting                 3m37s                  kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      3m37s                  kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  3m37s (x2 over 3m37s)  kubelet          Node ip-10-222-9-112.eu-west-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    3m37s (x2 over 3m37s)  kubelet          Node ip-10-222-9-112.eu-west-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     3m37s (x2 over 3m37s)  kubelet          Node ip-10-222-9-112.eu-west-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  3m37s                  kubelet          Updated Node Allocatable limit across pods
  Normal   Starting                 3m25s                  kube-proxy       Starting kube-proxy.
  Normal   NodeReady                3m17s                  kubelet          Node ip-10-222-9-112.eu-west-1.compute.internal status is now: NodeReady
  Normal   Starting                 3m15s                  kube-proxy       Starting kube-proxy.
  Warning  DockerStart              3m1s (x3 over 3m2s)    systemd-monitor  Starting Docker Application Container Engine...
  Normal   NodeNotReady             2m24s                  kubelet          Node ip-10-222-9-112.eu-west-1.compute.internal status is now: NodeNotReady
  Warning  ContainerdStart          86s (x22 over 3m2s)    systemd-monitor  Starting containerd container runtime...
  Warning  ContainerGCFailed        37s (x3 over 2m37s)    kubelet          rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: no such file or directory"

Node conditions

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentUnregisterNetDevice   False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   NoFrequentContainerdRestart     containerd is functioning properly
  KernelDeadlock                False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   FilesystemIsNotReadOnly         Filesystem is not read-only
  CorruptDockerOverlay2         False   Wed, 04 Aug 2021 13:30:23 +0300   Wed, 04 Aug 2021 13:30:22 +0300   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  NetworkUnavailable            False   Wed, 04 Aug 2021 13:30:15 +0300   Wed, 04 Aug 2021 13:30:15 +0300   CalicoIsUp                      Calico is running on this node
  MemoryPressure                False   Wed, 04 Aug 2021 13:33:13 +0300   Wed, 04 Aug 2021 13:29:47 +0300   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Wed, 04 Aug 2021 13:33:13 +0300   Wed, 04 Aug 2021 13:29:47 +0300   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Wed, 04 Aug 2021 13:33:13 +0300   Wed, 04 Aug 2021 13:29:47 +0300   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         False   Wed, 04 Aug 2021 13:33:13 +0300   Wed, 04 Aug 2021 13:31:00 +0300   KubeletNotReady                 container runtime is down

SSH-ing on the node and find out that the containerd is failing with:

Aug 04 10:31:36 ip-10-222-9-112.eu-west-1.compute.internal containerd[30781]: containerd: failed to load TOML from /etc/containerd/config.toml: invalid plugin key URI "cri" expect io.containerd.x.vx
Aug 04 10:31:36 ip-10-222-9-112.eu-west-1.compute.internal systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE
Aug 04 10:31:36 ip-10-222-9-112.eu-west-1.compute.internal systemd[1]: containerd.service: Failed with result 'exit-code'.
Aug 04 10:31:36 ip-10-222-9-112.eu-west-1.compute.internal systemd[1]: Failed to start containerd container runtime.

It looks like this extension has injected this configuration

216a217,218
> [plugins.cri.containerd.runtimes.runsc]
>       runtime_type = "io.containerd.runsc.v1"

and it is most probably the reason it cannot start again.

Exchanging the above configuration with the he one recommended in the documentation, ref

216a217,218
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
>   runtime_type = "io.containerd.runsc.v1"

fixes the problem and the node is back in healthy state.

What you expected to happen:
The gVisor specific configuration to not break containerd service on the nodes.

How to reproduce it (as minimally and precisely as possible):
Create a shoot with ubuntu nodes and gVisor enabled

spoiler for Shoot manifest:
kind: Shoot
apiVersion: core.gardener.cloud/v1beta1
metadata:
  name: foo
spec:
  cloudProfileName: aws
  kubernetes:
    allowPrivilegedContainers: true
    kubeAPIServer:
      enableBasicAuthentication: false
    version: 1.18.16
  networking:
    type: calico
    pods: 10.223.128.0/17
    nodes: 10.222.0.0/16
    services: 10.223.0.0/17
  maintenance:
    autoUpdate:
      kubernetesVersion: false
      machineImageVersion: false
    timeWindow:
      begin: 130000+0000
      end: 140000+0000
  provider:
    type: aws
    controlPlaneConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: ControlPlaneConfig
    infrastructureConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: InfrastructureConfig
      networks:
        vpc:
          cidr: 10.222.0.0/16
        zones:
          - internal: 10.222.112.0/22
            name: eu-west-1a
            public: 10.222.96.0/22
            workers: 10.222.0.0/19
    workers:
      - cri:
          name: containerd
          containerRuntimes:
            - type: gvisor
        name: pool-1
        machine:
          type: c5.xlarge
          image:
            name: ubuntu
            version: 18.4.20210415
        maximum: 1
        minimum: 1
        maxSurge: 3
        maxUnavailable: 3
        volume:
          type: gp2
          size: 40Gi
          encrypted: true
        zones:
          - eu-west-1a
  region: eu-west-1

Anything else we need to know?:
The above mitigation is not tested on other operation systems where the current configuration is still working, so before changing it, it should be validated with them, too.

Environment:

  • Gardener version (if relevant): v1.28.1
  • Extension version: v0.1.0
  • Kubernetes version (use kubectl version): v1.18.16
  • Cloud provider or hardware configuration:
  • Others:
    • Containerd Version: containerd github.com/containerd/containerd 1.5.2-0ubuntu1~18.04.2
    • Ubuntu version: 18.4.20210415/ami-0943382e114f188e8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.