The magnum-cluster-api's discuss from vexxhost

routing issues due to podSubnet and serviceSubnet overlapping

executing k get cm kubeadm-config -n kube-system -oyaml in a cluster shows that podSubnet and serviceSubnet overlap, which could cause routing issues (also see coredns/coredns#3704)

`kube-bench`: [FAIL] 1.2.20 Ensure that the --audit-log-maxage argument is set to 30 or as appropriate (Automated)

`kube-bench`: [FAIL] 4.2.6 Ensure that the --protect-kernel-defaults argument is set to true (Automated)

Better handling for default volume

The default volume is not properly selected, we should either rely on Cinder to select the default volume if none is specified or fail early.

Bubble errors up to Magnum API

At the moment, errors in the Cluster API do not bubble up into the Magnum API

  Warning  Failedcreatenetwork  11m (x19 over 34m)  openstack-controller  Failed to create network k8s-clusterapi-cluster-magnum-system-cluster-2-ikhba5qojt: Expected HTTP response code [201 202] when accessing [POST https://network.openstack.cloud.local/v2.0/networks], but got 409 instead
{"NeutronError": {"type": "OverQuota", "message": "Quota exceeded for resources: ['network'].", "detail": ""}}

We need to bubble those errors up to OpenStack, or at least the quota checks.

Add support for `boot_volume_{size,type}`

There are a few labels that are used inside Magnum when doing boot from volume:

boot_volume_size
boot_volume_type

There is also corresponding set of functions that allow you to get the volume type:

https://github.com/openstack/magnum/blob/16bdedcf2fe6986c995bd415f4e3c70dac914ada/magnum/common/cinder.py#L25-L37

and some defaults:

default_boot_volume_size: https://github.com/openstack/magnum/blob/5f9bfe3f3ca08776956866426351e385156ff37e/magnum/conf/cinder.py#L47-L50

It looks like for now, we'll be able to support boot from volume here:

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/docs/book/src/clusteropenstack/configuration.md#boot-from-volume

so if boot_volume_size and boot_volume_type are set (or their defaults are), then we point to that.

Integrate to Kolla Ansible

Hello.
I want to test this project with multi node deployment. Can we have some guideline for it?
Thanks

Add documentation how to use with existing Openstack

Topology managed health checks

The health checks are currently built out using MHC resources manually, this should be modified to be using the topology so they can be managed by it.

`kube-bench`: [FAIL] 1.3.2 Ensure that the --profiling argument is set to false (Automated)

Storageclass is not allowing volume expansion

Currently, the storageclasses capi deploys doesn't allow volume expansion. We should enable this to give the user the ability to do so

Use "Ubuntu" for os_distro instead

Look into using cluster addons

https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20220712-cluster-api-addon-orchestration.md
https://github.com/kubernetes-sigs/cluster-api-addon-provider-helm

Integration with Keystone auth

When following the instructions in #56 my cluster create fails with an error resembling the following:

{
  "status": "CREATE_FAILED",
  .
  .
  "status_reason": "(https://10.10.4.4:35357/v3/users/590bc654b1014a2fa568631ba399524a/application_credentials): The resource could not be found. (HTTP 404) (Request-ID: req-6204ca58-87f5-4868-bc2c-ba4159399fba)",
  .
  .
}

My question: Is there a way around using application credentials to authenticate? We are using a bit older version of Keystone (Pike) and are probably not going to be able to upgrade very soon.

Management cluster is no upgraded

When trying to bump mcapi, it was supposed to upgrade the cluster and not initialize again.

i.e

TASK [vexxhost.atmosphere.magnum : Initialize the management cluster] ******************************************************************************************************************************************************
fatal: [usegonslp020.vistex.local]: FAILED! => {"changed": false, "cmd": ["clusterctl", "init", "--config", "/etc/clusterctl.yaml", "--core", "cluster-api:v1.3.3", "--bootstrap", "kubeadm:v1.3.3", "--control-plane", "kubeadm:v1.3.3", "--infrastructure", "openstack:v0.7.1"], "delta": "0:00:03.355675", "end": "2023-04-25 13:23:42.272366", "msg": "non-zero return code", "rc": 1, "start": "2023-04-25 13:23:38.916691", "stderr": "Fetching providers\nError: installing provider \"infrastructure-openstack\" can lead to a non functioning management cluster: there is already an instance of the \"infrastructure-openstack\" provider installed in the \"capo-system\" namespace", "stderr_lines": ["Fetching providers", "Error: installing provider \"infrastructure-openstack\" can lead to a non functioning management cluster: there is already an instance of the \"infrastructure-openstack\" provider installed in the \"capo-system\" namespace"], "stdout": "", "stdout_lines": []}

IPIP rules in security groups not applying properly

IPIP encapsulated DNS requests from pods to coredns get blocked by iptables.
on closer inspections, the packet dont get picked up by the security groups rules:

ID                                   | IP Protocol | Ethertype | IP Range  | Port Range  | Remote Security Group
6a9c3c58-bdd0-474e-8bf8-e8e248ad2cc6 | ipip           | IPv4      | 0.0.0.0/0 |             | ba33108a-ceb1-4ef0-98cd-455fa916166f |
c031473b-9da4-414f-94e3-425aa34dc2b8 | ipip           | IPv4      | 0.0.0.0/0 |             | 76192aca-20eb-43e6-bd1c-331979e57157

changing the IP Protocol from IPIP to 4 does the trick.

ID                                   | IP Protocol | Ethertype | IP Range  | Port Range  | Remote Security Group
6a9c3c58-bdd0-474e-8bf8-e8e248ad2cc6 | 4           | IPv4      | 0.0.0.0/0 |             | ba33108a-ceb1-4ef0-98cd-455fa916166f
c031473b-9da4-414f-94e3-425aa34dc2b8 | 4           | IPv4      | 0.0.0.0/0 |             | 76192aca-20eb-43e6-bd1c-331979e57157

What I don't understand is how we are hitting these issues as the fix has been merged since march 2020.

projectcalico/calico#2700
projectcalico/calico#2111

`kube-bench`: [FAIL] 1.2.22 Ensure that the --audit-log-maxsize argument is set to 100 or as appropriate (Automated)

Dynamic `ClusterClass` version

The ClusterClass has as static name, we should probably use something like pbr to generate a dynamic name for the ClusterClass.

Non-existing resources should not stop cluster deletion

Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server [None req-8f39cccc-1fc0-4aab-9062-b9fea14ece2e None None] Exception during message handling: pykube.exceptions.ObjectDoesNotExist: k8s-v1.25.3-74wufkfiei-cloud-config does not exist.
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.8/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.8/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.8/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/usr/local/lib/python3.8/dist-packages/osprofiler/profiler.py", line 159, in wrapper
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     result = f(*args, **kwargs)
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/opt/stack/magnum/magnum/conductor/handlers/cluster_conductor.py", line 191, in cluster_delete
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     cluster_driver.delete_cluster(context, cluster)
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/home/ubuntu/magnum-cluster-api/magnum_cluster_api/driver.py", line 194, in delete_cluster
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     resources.Cluster(context, self.k8s_api, cluster).delete()
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/home/ubuntu/magnum-cluster-api/magnum_cluster_api/resources.py", line 56, in delete
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     resource = self.get_object()
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/home/ubuntu/magnum-cluster-api/magnum_cluster_api/resources.py", line 1299, in get_object
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     utils.generate_cloud_controller_manager_config(
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/home/ubuntu/magnum-cluster-api/magnum_cluster_api/utils.py", line 56, in generate_cloud_controller_manager_config
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     data = pykube.Secret.objects(api, namespace="magnum-system").get_by_name(
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server   File "/home/ubuntu/.local/lib/python3.8/site-packages/pykube/query.py", line 116, in get_by_name
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server     raise ObjectDoesNotExist(f"{name} does not exist.")
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server pykube.exceptions.ObjectDoesNotExist: k8s-v1.25.3-74wufkfiei-cloud-config does not exist.
Nov 07 13:31:09 devstack magnum-conductor[2806024]: ERROR oslo_messaging.rpc.server

Have a name converter between openstack and kubernetes resources

Context

We create several k8s resources inside this m-capi driver and their names sometimes include the one of related openstack resources.
For instance, we name the k8s storage classes by using the corresponding openstack volume type name.
Openstack allows for the volume type name to include + but k8s doesn't allow this character in the resource name like this

magnum-cluster-api/magnum_cluster_api/resources.py

Line 228 in 0024683

f"storageclass-{vt.name}.yaml": yaml.dump(

.

It causes k8s resource creation error.

solution suggesting

Have a util func to convert openstack resource name to k8s-like name when it is used in k8s resource name.

Dynamically add all StorageClass

An OpenStack cloud can have many volume types. Instead of having a disconnect between the Kubernetes cluster and the OpenStack cloud, we should automatically poll for all the volume types and create StorageClass for them.

In the same time, we will also make the default StorageClass the one which is default in the cloud, that will make things nice and simple for the user.

Implement auto healing features

In Magnum, when you specify the auto_healing_enabled label set to true, it will enable the magnum-auto-healer. However, since we are using the Cluster API, we can rely on the built-in health checking feature:

https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html

We need to add a MachineHealthCheck resource if auto_healing_enabled is set to true. Also, unlike how Magnum defaults to auto_healing_enabled set to false (i.e. not enabled), we should always enable it since it would be largely beneficial for the user to have it enabled (better user experience).

We'll also have to factor in cluster label updates in case the user wants to enable/disable it dynamically, so we'd end up with the following:

Update create_cluster to add MachineHealthCheck if the label is either on cluster or cluster template (might need to move this function to a utils.py)
Update update_cluster to add/remove MachineHealthCheck depending on the value of the label
Test auto healing by shutting down a node and seeing if Cluster API autoheals it

Add Manila CSI

We should detect what services are running in the cloud (such as Manila or Cinder) and then install the appropriate CSI.

We already deploy Cinder but all the time, instead, we should deploy it if we detect the Cinder service, and the same goes for Manila, that way we finally have ReadWriteMany volumes! :)

`kube-bench`: [FAIL] 1.1.12 Ensure that the etcd data directory ownership is set to etcd:etcd (Automated)

1.1.12 On the etcd server node, get the etcd data directory, passed as an argument --data-dir,
from the command 'ps -ef | grep etcd'.
Run the below command (based on the etcd data directory found above).
For example, chown etcd:etcd /var/lib/etcd

When using kubeadm, we actually run etcd inside a container, which means there is no system etcd user on the host and this is not a valid issue.

`kube-bench`: [FAIL] 1.2.21 Ensure that the --audit-log-maxbackup argument is set to 10 or as appropriate (Automated)

Autoscaling

Auto-scaling is a useful feature that used to exist inside Magnum, however, the Cluster API has built-in autoscaling. We should enable it when auto_scaling_enabled is set to true.

I have not done significant research into this, so I will update this later once we have an idea how to get it all to work.

Implement `master_lb_allowed_cidrs`

https://cluster-api-openstack.sigs.k8s.io/clusteropenstack/configuration.html#restrict-access-to-the-api-server

`kube-bench`: [FAIL] 1.4.1 Ensure that the --profiling argument is set to false (Automated)

cluster template name - ID rander issue for image and flavor

Right now, we can't use image name or flavor id in cluster template or magnum-cluster-api driver will failed like this

Add support for `keystone_auth_enabled`

The main implementation is here: https://github.com/openstack/magnum/blob/2a7ab5e9a37565760fb57fb00f5604f1486cdfb1/magnum/drivers/common/templates/kubernetes/fragments/enable-keystone-auth.sh

We can replace that with a simple ClusterResourceSet.

Magnum/CAPI provisioned Kubernetes cluster name length

Review `kube-bench` results

This is an issue to track which ones we need to take care of, or offer the option for the user to work with

Magnum/CAPI provisioned Kubernetes cluster name length

Is there a way to customize the name used by the Magnum API (using the CAPI driver) built into Atmosphere? We are running into issues where operators are adding labels to artifacts that exceed the 63 character limit built into Kubernetes, and a big chunk of the label is the cluster name in these cases. It would be nice if we can define a name (such as prod-k8s-00 or something like that) that tells the user exactly what it is used for and doesn't include the additional Magnum (and/or CAPI/CAPO) syntax.

trivy k8s cluster --compliance k8s-cis --report summary 
2023-03-25T16:59:00.124-0400	FATAL	get k8s artifacts error: running node-collector job: Job.batch "node-collector-foo-test-4admbibsdj-default-worker-infra-rb4bh-f2h5t" is invalid: spec.template.labels: Invalid value: "node-collector-foo-test-4admbibsdj-default-worker-infra-rb4bh-f2h5t": must be no more than 63 characters

Bump `cluster-api-provider-openstack`

Just landed this:

kubernetes-sigs/cluster-api-provider-openstack@dc64e9a

We should bump the nightly here to a new image here:

https://github.com/vexxhost/magnum-cluster-api/blob/main/hack/stack.sh#L82

`kube-bench`: [FAIL] 1.2.18 Ensure that the --profiling argument is set to false (Automated)

1.2.18 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the control plane node and set the below parameter.
--profiling=false

`kube-bench`: [FAIL] 1.2.6 Ensure that the --kubelet-certificate-authority argument is set as appropriate (Automated)

1.2.6 Follow the Kubernetes documentation and setup the TLS connection between
the apiserver and kubelets. Then, edit the API server pod specification file
/etc/kubernetes/manifests/kube-apiserver.yaml on the control plane node and set the
--kubelet-certificate-authority parameter to the path to the cert file for the certificate authority.
--kubelet-certificate-authority=<ca-string>

Remove all load balancers pre-delete

At the moment, the load balancers are not removed with CAPO.

We can work-around this by doing something similar to what Magnum does with cleaning up the resources:

https://github.com/openstack/magnum/blob/0ee8abeed0ab90baee98a92cab7c684313bab906/magnum/drivers/heat/driver.py#L306-L311

FTR, pre_delete_cluster is manually called so we can just add it at the top of our delete_cluster. The function that's already implemented seems tied to Heat, so these two parts of code should help:

https://github.com/openstack/magnum/blob/0ee8abeed0ab90baee98a92cab7c684313bab906/magnum/common/octavia.py#L89-L101
https://github.com/openstack/magnum/blob/0ee8abeed0ab90baee98a92cab7c684313bab906/magnum/common/octavia.py#L131-L137

With that in place, we'll be able to wipe all of the resources. However, the one thing to investigate is the cluster UUID in the description and what that is set to with CAPI to make sure the regex works right.

Support `ingress_controller`

We'll need to add the ability to install an Ingress controller onto the cluster. Magnum supports both Octavia and Nginx, but we'll start with Nginx at least.

This will be handled with a ClusterResourceSet that gets applied.

Cluster deletion stuck in DELETE_IN_PROGRESS

Context

Cluster(3 masters and 1 worker) was failed to create because of the resource lack. 2 masters and 1 worker created and the 3rd master creation failed.
Then deleted that cluster but it hangs up on DELETE_IN_PROGRESS status.

$ kubectl get clusters
NAMESPACE       NAME                     PHASE      AGE    VERSION
magnum-system   k8s-v1-25-3-8mic9qdzdl   Deleting   160m   v1.25.3

$ kubectl describe openstackclusters -n magnum-system
...
Events:
  Type     Reason                            Age                From                  Message
  ----     ------                            ----               ----                  -------
  Normal   Successfuldisassociatefloatingip  23m                openstack-controller  Disassociated floating IP 172.24.4.74
  Normal   Successfuldeletefloatingip        23m                openstack-controller  Deleted floating IP 172.24.4.74
  Normal   Successfuldeleteloadbalancer      23m                openstack-controller  Deleted load balancer k8s-clusterapi-cluster-magnum-system-k8s-v1-25-3-8mic9qdzdl-kubeapi with id edc6de48-71e3-4da4-b98d-15ad258ba319
  Warning  Faileddeleteloadbalancer          23m (x5 over 23m)  openstack-controller  Failed to delete load balancer k8s-clusterapi-cluster-magnum-system-k8s-v1-25-3-8mic9qdzdl-kubeapi with id edc6de48-71e3-4da4-b98d-15ad258ba319: Expected HTTP response code [202 204] when accessing [DELETE http://38.108.68.181/load-balancer/v2.0/lbaas/loadbalancers/edc6de48-71e3-4da4-b98d-15ad258ba319?cascade=true], but got 409 instead
{"faultcode": "Client", "faultstring": "Invalid state PENDING_DELETE of loadbalancer resource edc6de48-71e3-4da4-b98d-15ad258ba319", "debuginfo": null}
  Warning  Faileddeletesecuritygroup  101s (x14 over 23m)  openstack-controller  Failed to delete security group k8s-cluster-magnum-system-k8s-v1-25-3-8mic9qdzdl-secgroup-controlplane with id 56f209e7-3ed4-4880-83b0-5a52284c9e8d: Expected HTTP response code [202 204] when accessing [DELETE http://38.108.68.181:9696/networking/v2.0/security-groups/56f209e7-3ed4-4880-83b0-5a52284c9e8d], but got 409 instead
{"NeutronError": {"type": "SecurityGroupInUse", "message": "Security Group 56f209e7-3ed4-4880-83b0-5a52284c9e8d in use.", "detail": ""}}
ubuntu@magnum-capi-driver:~$ source /opt/stack/openrc admin admin
WARNING: setting legacy OS_TENANT_NAME to support cli tools.

Loadbalancer has been deleted finally but sg is not deleted because it is in-use by an undeleted port.

+--------------------------------------+----------------------------------------------------+-------------------+----------------------------------------------------------------------------------------------------+--------+
| ID                                   | Name                                               | MAC Address       | Fixed IP Addresses                                                                                 | Status |
+--------------------------------------+----------------------------------------------------+-------------------+----------------------------------------------------------------------------------------------------+--------+
| 94720b71-091a-42aa-b758-b0675feee028 | k8s-v1-25-3-8mic9qdzdl-control-plane-8xbtc-vwrjx-0 | fa:16:3e:e6:1e:99 | ip_address='10.6.0.95', subnet_id='33540296-1071-4227-8cb7-fab4d042ea5e'                           | DOWN   |
+--------------------------------------+----------------------------------------------------+-------------------+----------------------------------------------------------------------------------------------------+--------+

Only one port is not deleted and remains. I guess this is the port that was bound to 3rd master node.

Workaround

Manually delete that dangling port so sg can be deleted by CAPO.

`kube-bench`: [FAIL] 1.2.19 Ensure that the --audit-log-path argument is set (Automated)

Cluster upgrades

We've got to implement cluster upgrades which will allow us to go from one release to another of Kubernetes, we will be relying on Cluster API cluster upgrade to get this done.

`README.md`

Upgrade to include instructions to create both 1.24 and 1.25 images and cluster templates

`upgrade_cluster` driver function

Recreate all of the MachineTemplates with the new values (see steps 1-3)
Update the version value after recreating the MachineTemplate so it forces a rollout for control plane
Update the cluster.x-k8s.io/restartedAt annotation to force a rollout after updating MachineTemplate (see "How to schedule a machine rollout")
Validate that the cluster ends up in UPDATE_COMPLETE
Validate that the cluster is now fully upgraded.

The above can be tested upgrading from 1.24 to 1.25, we assume that Cluster API can take care of upgrades cleanly and we're just focused on making sure the Magnum/Cluster API interaction is working properly.

object handling require some improve

found errors like

2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'magnum.service.periodic.ClusterUpdateJob.update_status' failed: KeyError: 'status'
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall Traceback (most recent call last):
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum/service/periodic.py", line 73, in update_status
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall     cdriver.update_cluster_status(self.ctx, self.cluster)
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum_cluster_api/driver.py", line 55, in update_cluster_status
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall     node_groups = [
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum_cluster_api/driver.py", line 56, in <listcomp>
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall     self.update_nodegroup_status(context, cluster, node_group)
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum_cluster_api/driver.py", line 206, in update_nodegroup_status
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall     generation = kcp.obj["status"].get("observedGeneration")
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall KeyError: 'status'
2023-02-07 22:50:52.737 1 ERROR oslo.service.loopingcall

or

2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall [-] Fixed interval looping call 'magnum.service.periodic.ClusterUpdateJob.update_status' failed: AttributeError: 'NoneType' object has no attribute 'reload'
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall Traceback (most recent call last):
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_service/loopingcall.py", line 150, in _run_loop
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum/service/periodic.py", line 73, in update_status
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall     cdriver.update_cluster_status(self.ctx, self.cluster)
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall   File "/var/lib/openstack/lib/python3.10/site-packages/magnum_cluster_api/driver.py", line 69, in update_cluster_status
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall     capi_cluster.reload()
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall AttributeError: 'NoneType' object has no attribute 'reload'
2023-02-07 22:50:46.717 1 ERROR oslo.service.loopingcall

Cluster status is not updated properly

I've been noticing that on every action took on a magnum cluster, it delays up to 1 minute to have it reflected on magnum api.

In some cases, like a sequence of rolling upgrades, the second time we upgrade a cluster ( with UPDATE_COMPLETE status ), it keeps doing the upgrade, but it doesnt update the magnum status to UPDATE_IN_PROGESS.

This might confuse the end-user on what is the operation being done at the moment.

Possible missing Apache2 License?

It would be great to collaborate on this, and merge it with our helm approach.

Would you be willing to add an Apache2 Licence so we could do some of that please?

Support `container_infra_prefix`

At the moment, all of the images are being pulled directly from the internet. This can be a problem in air-gapped environments or places where internet might not be reliable.

We've got to add container_infra_prefix in order to be able to use images from a local registry, and have a very clean script on how to load up said custom registry with all those images.

Add release infrastructure

We need to setup something like release-please which will help us track releases and then push them to pypi for getting started.

Support custom NTP server

https://cluster-api.sigs.k8s.io/tasks/kubeadm-bootstrap.html?highlight=ntp#additional-features

KubeadmConfig.NTP specifies NTP settings for the machine

ntp:
  servers:
    - IP_ADDRESS
  enabled: true

Add support for `docker_volume_{size,type}`

At the moment, we cannot use this feature due to the fact that CAPO does not support mounting multiple volumes:

kubernetes-sigs/cluster-api-provider-openstack#1286

Once that's in place, we can use the following labels to create volumes:

etcd_volume_size
etcd_volume_type
docker_volume_size
docker_volume_type

We can use the following functions to determine the volume type (if size is set):

https://github.com/openstack/magnum/blob/16bdedcf2fe6986c995bd415f4e3c70dac914ada/magnum/common/cinder.py#L25-L37

And we can get the sizes from labels/API objects.

Add support for `kube_dashboard_enabled`

The brunt of the implementation is here, and should be easily replaceable with a ClusterResourceSet:

https://github.com/openstack/magnum/blob/2a7ab5e9a37565760fb57fb00f5604f1486cdfb1/magnum/drivers/common/templates/kubernetes/fragments/kube-dashboard-service.sh

vexxhost / magnum-cluster-api Goto Github PK

magnum-cluster-api's Issues

Context

solution suggesting

Context

Workaround

README.md

upgrade_cluster driver function

Recommend Projects

Recommend Topics

Recommend Org

`README.md`

`upgrade_cluster` driver function