Giter Site home page Giter Site logo

bf2fc6cc711aee1a0c2a / kas-fleetshard Goto Github PK

View Code? Open in Web Editor NEW
8.0 16.0 20.0 12.84 MB

The kas-fleetshard-operator is responsible for provisioning and managing instances of kafka on a cluster. The kas-fleetshard-synchronizer synchronizes the state of a fleet shard with the kas-fleet-manager.

License: Apache License 2.0

Java 95.48% Makefile 0.12% 1C Enterprise 0.02% Python 1.37% Shell 3.00%

kas-fleetshard's Introduction

License

kas-fleetshard

Build and Unit tests Smoke tests Quality Gate Status Coverage

Running

WARNING : currently the kas fleetshard operator needs a Strimzi operator already running on your Kubernetes/OpenShift cluster.

kubectl create namespace kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka

The first step is to install the operator allowing fabric8 to generate the ManagedKafka CRDs.

mvn install

After that, apply the generated CRD to the Kubernetes/OpenShift cluster by running the following commands.

kubectl apply -f operator/target/kubernetes/managedkafkas.managedkafka.bf2.org-v1.yml
kubectl apply -f operator/target/kubernetes/managedkafkaagents.managedkafka.bf2.org-v1.yml

Finally, you can start the operator from your IDE running the Main application (for a step by step debugging purposes), or you can run it from the command line by running the following command (with Quarkus in "dev" mode). If you're running against a vanilla Kubernetes, you'll need to add -Dkafka=dev so that it doesn't assume that OLM, etc, are available.

# OpenShift
mvn -pl operator quarkus:dev

# OR
# Vanilla Kubernetes
mvn -pl operator quarkus:dev -Dkafka=dev

NOTE: Quarkus will start debugger listener on port 5005 to which you can attach from your IDE.

Testing

Read Testing guide

OLM Bundle Generation

See the bundle module README

Releasing

Milestones

Each release requires an open milestone that includes the issues/pull requests that are part of the release. All issues in the release milestone must be closed. The name of the milestone must match the version number to be released.

Configuration

The release action flow requires that the following secrets are configured in the repository:

  • IMAGE_REPO_HOSTNAME - the host (optionally including a port number) of the image repository where images will be pushed
  • IMAGE_REPO_NAMESPACE - namespace/library/user where the image will be pushed
  • IMAGE_REPO_USERNAME - user name for authentication to server IMAGE_REPO_HOSTNAME
  • IMAGE_REPO_PASSWORD - password for authentication to server IMAGE_REPO_HOSTNAME
  • RELEASE_TOKEN - GitHub token for the account performing the release (i.e. a bot account). The commits generated by the release process will be authored by this account. Currently configured to be the bf2-ci-bot

These credentials will be used to push the release image to the repository configured in the .github/workflows/release.yml workflow.

Performing the Release

Release Branch

Optional - only required when a release branch is needed for patch releases. Routine releases should target the main branch. Patch/release branches are only necessary when main includes commits that should not be included in the patch release.

Create a new branch from the tag being patched. For example, if the release that requires a patch release is version/tag 0.22.0, create a new branch 0.22.x from that tag.

git checkout -b 0.22.x 0.22.0
git push upstream 0.22.x

Follow the steps in the pull request section using branch 0.22.x as the target of the PR for release 0.22.1. If you are already releasing from a release branch, skip the above step of creating a new branch and simply checkout that branch and open the PR as described below.

Pull Request

Releases are performed by modifying the .github/project.yml file, setting current-version to the release version and next-version to the next SNAPSHOT. Open a pull request with the changed project.yml to initiate the pre-release workflows. The target of the pull request should be either main or a release branch (described above).

At this phase, the project milestone will be checked and it will be verified that no issues for the release milestone are still open. Additionally, the project's integration tests will be run.

Once approved and the pull request is merged, the release action will execute. This action will execute the Maven release plugin to tag the release commit, build the application artifacts, create the build image, and push the image to the repository identified by the secret IMAGE_REPO_HOSTNAME. If successful, the action will push the new tag to the Github repository and generate release notes listing all of the closed issues included in the milestone. Finally, the milestone will be closed.

Contributing

Use mvn clean process-sources or almost any mvn command to automatically format your code contribution prior to creating a pull request.

kas-fleetshard's People

Contributors

agullon avatar akoserwal avatar bf2-ci-bot avatar bf2robot avatar biswassri avatar dependabot[bot] avatar devguyio avatar echernous avatar franvila avatar fvaleri avatar grdryn avatar k-wall avatar katheris avatar kornys avatar metacosm avatar mikeedgar avatar npecka avatar ppatierno avatar racheljpg avatar rareddy avatar robobario avatar sambarker avatar shawkins avatar showuon avatar shubhamrwt avatar stuartwdouglas avatar tinaselenge avatar tplevko avatar vbusch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kas-fleetshard's Issues

Audit logging for consumer group leader still a bit noisy

Looking at the audit Logging for a consumer group leader, it is still a bit noisy, with regular HEARTBEAT, OFFSET_COMMIT and
DESCRIBE_GROUPS.

HEARTBEAT and DESCRIBE_GROUPS requests should probably be demoted to debug.

I'm less certain about OFFSET_COMMIT - I think INFO is probably okay as this is logically a mutating operation. Maybe there's a case for reducing it to DEBUG for the strimzi-canary-group?

WDYT @grdryn @MikeEdgar

2021-10-13T21:34:49Z INFO  [data-plane-kafka-request-handler-7] [CustomAclAuthorizer:265] Principal = OAuthKafkaPrincipal(User:canary-c5jku5g8knc542lfuo8g, session:1131829585, token:eyJh**OhvQ) is Allowed Operation = Read from host = 10.129.4.37 via listener TLS-9093 on resource = Group:LITERAL:strimzi-canary-group for request = OFFSET_COMMIT with resourceRefCount = 1
2021-10-13T21:34:50Z INFO  [data-plane-kafka-request-handler-0] [CustomAclAuthorizer:265] Principal = OAuthKafkaPrincipal(User:canary-c5jku5g8knc542lfuo8g, session:1131829585, token:eyJh**OhvQ) is Allowed Operation = Read from host = 10.129.4.37 via listener TLS-9093 on resource = Group:LITERAL:strimzi-canary-group for request = HEARTBEAT with resourceRefCount = 1
...
2021-10-13T21:44:09Z INFO  [data-plane-kafka-request-handler-3] [CustomAclAuthorizer:265] super.user Principal = User:CN=foo-kafka-exporter,O=io.strimzi is Allowed Operation = Describe from host = 10.131.2.19 via listener REPLICATION-9091 on resource = Group:LITERAL:strimzi-canary-group for request = DESCRIBE_GROUPS with resourceRefCount = 1

Canary's quota exclusion mechanism broken

The original intent was that the canary produce|consume activity was deliberately excluded from the quota mechanism. That is, if the customer happens to be utilising all his bandwidth or utilising all his disk space, the canary should still be able to produce and consumer message unimpeded by the quota.

Originally achieved this by leaning on two features:

  • the quota component had an bypass feature that allow anonymous traffic to bypass the quoting mechanism.
  • the canary used anonymous authentication.

These assumptions were inadvertently broken. This means that canary produce/consume traffic is potentially affected by end-user traffic.

Generated ./sync/target/kubernetes/kubernetes.yml uses deprecated serviceAccount name

The generated deployments within the kubernetes.yml uses deprecated serviceAccount field rather than serviceAccountName. Whilst functional this creates warnings from the tooling further down the pipelines.

          volumeMounts:
            - mountPath: /config
              name: logging-config-volume
              readOnly: false
      serviceAccount: kas-fleetshard-sync   # should be serviceAccountName

@shawkins is this the operator SDK?

Multiple warnings during surefire/failsafe test

I see multiple warning during surefire/failsafe tests.

  1. in agent-api
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
  1. systemtest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/dkornel/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.13.3/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/dkornel/.m2/repository/org/jboss/slf4j/slf4j-jboss-logmanager/1.1.0.Final/slf4j-jboss-logmanager-1.1.0.Final.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Use an event to refine when the controller starts

See operator-framework/java-operator-sdk#417 (comment) - after #299 we can control when the controllers start. Even with the current blocking informer start, there is a corner case - a failure during the initial start. An exception during postcontruct will cause quarkus to simply retry the creation of the informermanager (and the associated informers) when it is next requested. Since we have no explicitly cleanup, this can leave dangling watches. We can hopefully mostly ignore this case as the pod will fail it's health checks while this is happening and will typically have to restart, which will clean everything up...

With the later sdk we can start the informers async from post construct and signal the controller startup when the start jobs are all finished.

Service account has no permission to work with routes on ocp cluster

Operator does not have rights to work with routes on ocp cluster.

14:55:52.999 [EventHandler-managedkafkacontroller] ERROR io.javaoperatorsdk.operator.processing.EventDispatcher - Error during event processing ExecutionScope{events=[CustomResourceEvent{action=MODIFIED, resource=[ name=mk-resource-recovery, kind=ManagedKafka, apiVersion=managedkafka.bf2.org/v1alpha1 ,resourceVersion=2270990, markedForDeletion: false ]}, { class=org.bf2.operator.events.KafkaEvent, relatedCustomResourceUid=7f3fcbae-55c1-4283-bc42-4f85c9f9b74a, eventSource=org.bf2.operator.events.KafkaEventSource@71febc62 }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=7f3fcbae-55c1-4283-bc42-4f85c9f9b74a, eventSource=org.bf2.operator.events.DeploymentEventSource@66aaf625 }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=7f3fcbae-55c1-4283-bc42-4f85c9f9b74a, eventSource=org.bf2.operator.events.DeploymentEventSource@66aaf625 }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=7f3fcbae-55c1-4283-bc42-4f85c9f9b74a, eventSource=org....
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.30.0.1/apis/route.openshift.io/v1/namespaces/mk-test-resources-recovery/routes/mk-resource-recovery-admin-server. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. routes.route.openshift.io "mk-resource-recovery-admin-server" is forbidden: User "system:serviceaccount:kas-fleetshard:kas-fleetshard-operator" cannot get resource "routes" in API group "route.openshift.io" in the namespace "mk-test-resources-recovery".
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:570) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:507) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:474) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:435) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:402) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:384) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:925) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:220) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:186) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:85) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at org.bf2.operator.operands.AdminServer.createOrUpdate(AdminServer.java:99) ~[classes/:?]
	at org.bf2.operator.operands.AdminServer_ClientProxy.createOrUpdate(AdminServer_ClientProxy.zig:510) ~[classes/:?]
	at org.bf2.operator.operands.KafkaInstance.createOrUpdate(KafkaInstance.java:26) ~[classes/:?]
	at org.bf2.operator.operands.KafkaInstance_ClientProxy.createOrUpdate(KafkaInstance_ClientProxy.zig:498) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:78) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:41) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController_ClientProxy.createOrUpdateResource(ManagedKafkaController_ClientProxy.zig:191) ~[classes/:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleCreateOrUpdate(EventDispatcher.java:100) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleDispatch(EventDispatcher.java:79) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleExecution(EventDispatcher.java:46) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.ExecutionConsumer.run(ExecutionConsumer.java:25) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Events for deployments without owners causing IOOBE

Running on CRC on startup I see messages like:

2021-01-29 18:39:18,880 INFO  [org.bf2.ope.eve.DeploymentEventSource] (pool-8-thread-1) Add event received for Deployment kafka/strimzi-cluster-operator
2021-01-29 18:39:28,637 ERROR [io.fab.kub.cli.inf.cac.ProcessorListener] (pool-8-thread-1) Failed invoking Index 0 out of bounds for length 0 event handler: {}

it occurs for all deployments that lack an owner ref.

2021-01-29 18:42:05,039 ERROR [io.fab.kub.cli.inf.cac.ProcessorListener] (pool-8-thread-1) Failed invoking Index 0 out of bounds for length 0 event handler: {}
2021-01-29 18:42:05,039 INFO  [org.bf2.ope.eve.DeploymentEventSource] (pool-8-thread-1) Add event received for Deployment openshift-apiserver-operator/openshift-apiserver-operator
2021-01-29 18:42:05,553 ERROR [io.fab.kub.cli.inf.cac.ProcessorListener] (pool-8-thread-1) Failed invoking Index 0 out of bounds for length 0 event handler: {}
2021-01-29 18:42:05,553 INFO  [org.bf2.ope.eve.DeploymentEventSource] (pool-8-thread-1) Update event received for Deployment openshift-apiserver-operator/openshift-apiserver-operator
2021-01-29 18:42:05,931 ERROR [io.fab.kub.cli.inf.cac.ProcessorListener] (pool-8-thread-1) Failed invoking Index 0 out of bounds for length 0 event handler: {}
2021-01-29 18:42:05,931 INFO  [org.bf2.ope.eve.DeploymentEventSource] (pool-8-thread-1) Add event received for Deployment openshift-apiserver/apiserver

Digging in I see:

java.lang.IndexOutOfBoundsException:
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
	at java.base/java.util.Objects.checkIndex(Objects.java:373)
	at java.base/java.util.ArrayList.get(ArrayList.java:426)
	at org.bf2.operator.events.DeploymentEvent.<init>(DeploymentEvent.java:11)
	at org.bf2.operator.events.DeploymentEventSource.handleEvent(DeploymentEventSource.java:40)
	at org.bf2.operator.events.DeploymentEventSource.onAdd(DeploymentEventSource.java:19)
	at org.bf2.operator.events.DeploymentEventSource.onAdd(DeploymentEventSource.java:11)
	at org.bf2.operator.events.DeploymentEventSource_ClientProxy.onAdd(DeploymentEventSource_ClientProxy.zig:193)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener$AddNotification.handle(ProcessorListener.java:118)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener.run(ProcessorListener.java:57)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

Operator raises NPE when Strimzi operator doesn't update Kafka resource status

If the Strimzi operator is crash looping or for any reason, it doesn't take care of the created Kafka resource and its Kafka.status is still null, the fleetshard operator raises the following NPE:

java.lang.NullPointerException
        at org.bf2.operator.operands.KafkaCluster.isInstalling(KafkaCluster.java:123)
        at org.bf2.operator.operands.KafkaCluster_ClientProxy.isInstalling(KafkaCluster_ClientProxy.zig:346)
        at org.bf2.operator.operands.KafkaInstance.isInstalling(KafkaInstance.java:40)
        at org.bf2.operator.operands.KafkaInstance_ClientProxy.isInstalling(KafkaInstance_ClientProxy.zig:188)
        at org.bf2.operator.controllers.ManagedKafkaController.updateManagedKafkaStatus(ManagedKafkaController.java:128)
        at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:97)
        at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:29)
        at org.bf2.operator.controllers.ManagedKafkaController_ClientProxy.createOrUpdateResource(ManagedKafkaController_ClientProxy.zig:191)
        at io.javaoperatorsdk.operator.processing.EventDispatcher.handleCreateOrUpdate(EventDispatcher.java:100)
        at io.javaoperatorsdk.operator.processing.EventDispatcher.handleDispatch(EventDispatcher.java:79)
        at io.javaoperatorsdk.operator.processing.EventDispatcher.handleExecution(EventDispatcher.java:46)
        at io.javaoperatorsdk.operator.processing.ExecutionConsumer.run(ExecutionConsumer.java:25)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

disable smoke tests

Even with several recent changes the smoke tests still fail too often due to the length of time they take to run on github. Since there have already been concerns expressed about the amount we are spending on github actions and we are have not caught an issue with these tests in a while, it seems best to disable them.

Canary/AdminServer deployment patching fails - LabelSelectorRequirement(nil)}: field is immutable.

#252 broke the ability of the operator to patch canary/adminserver deployments. Kubernetes has a business rule that means the selector of the deployment is immutable. #252 failed to respect that.

 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://172.30.0.1/apis/apps/v1/namespaces/foo/deployments/foo-admin-server. Message: Deployment.apps "foo-admin-server" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"foo-admin-server", "app.kubernetes.io/component":"adminserver", "app.kubernetes.io/managed-by":"kas-fleetshard-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.selector, message=Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"foo-admin-server", "app.kubernetes.io/component":"adminserver", "app.kubernetes.io/managed-by":"kas-fleetshard-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable, reason=FieldValueInvalid, additionalProperties={})], group=apps, kind=Deployment, name=foo-admin-server, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Deployment.apps "foo-admin-server" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app":"foo-admin-server", "app.kubernetes.io/component":"adminserver", "app.kubernetes.io/managed-by":"kas-fleetshard-operator"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

Changing the selector labels isn't actually required for this change, so that it probably the simplest way forward.

Exception in operator from operatorsdk

I don't know if it is an issue but I see sometimes this issue when I executing tests in parallel in operator

it happens when first test removed managedkafka resource in some namespace A and second test tried to recreate different managedkafka in namespace B

12:15:08.915 [EventHandler-managedkafkacontroller] ERROR io.javaoperatorsdk.operator.processing.EventDispatcher - Error during event processing ExecutionScope{events=[{ class=org.bf2.operator.events.KafkaEvent, relatedCustomResourceUid=5ccb4a07-a9d7-416d-bbea-ad28028f1283, eventSource=org.bf2.operator.events.KafkaEventSource@21e1340b }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=5ccb4a07-a9d7-416d-bbea-ad28028f1283, eventSource=org.bf2.operator.events.DeploymentEventSource@2829a110 }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=5ccb4a07-a9d7-416d-bbea-ad28028f1283, eventSource=org.bf2.operator.events.DeploymentEventSource@2829a110 }], customResource uid: 5ccb4a07-a9d7-416d-bbea-ad28028f1283, version: 505910} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://172.30.0.1/apis/managedkafka.bf2.org/v1alpha1/namespaces/mk-test-create-check/managedkafkas/mk-create. Message: Operation cannot be fulfilled on managedkafkas.managedkafka.bf2.org "mk-create": StorageError: invalid object, Code: 4, Key: /kubernetes.io/managedkafka.bf2.org/managedkafkas/mk-test-create-check/mk-create, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 5ccb4a07-a9d7-416d-bbea-ad28028f1283, UID in object meta: . Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=managedkafka.bf2.org, kind=managedkafkas, name=mk-create, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on managedkafkas.managedkafka.bf2.org "mk-create": StorageError: invalid object, Code: 4, Key: /kubernetes.io/managedkafka.bf2.org/managedkafkas/mk-test-create-check/mk-create, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 5ccb4a07-a9d7-416d-bbea-ad28028f1283, UID in object meta: , metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:570) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:509) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:474) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:435) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleReplace(OperationSupport.java:286) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleReplace(OperationSupport.java:267) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleReplace(BaseOperation.java:876) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:75) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:80) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:30) ~[io.fabric8.kubernetes-client-5.0.0.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher$CustomResourceFacade.replaceWithLock(EventDispatcher.java:199) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.removeFinalizer(EventDispatcher.java:165) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleDelete(EventDispatcher.java:133) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleDispatch(EventDispatcher.java:77) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleExecution(EventDispatcher.java:46) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.ExecutionConsumer.run(ExecutionConsumer.java:25) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Fabric8 connection issue running system tests

Trying to run UpgradeST, I see an okhttp stream error - this occurs against crc or an osd and regardless of using retries. Seems to be related to sending the strimzi install yaml.

ResourceEventSource sometimes throws NPE

Sometimes during startup, I'm seeing ResourceEventSource throwing an NPE. Debugging shows that the injected log is not populated.

021-12-22 18:39:26,253 ERROR [io.fab.kub.cli.inf.cac.SharedProcessor] (OkHttp https://172.30.0.1/...)  Failed invoking org.bf2.operator.managers.InformerManager$1@1dad01fe event handler: null: java.lang.NullPointerException
	at org.bf2.operator.events.ResourceEventSource.onUpdate(ResourceEventSource.java:32)
	at org.bf2.operator.events.ResourceEventSource.onUpdate(ResourceEventSource.java:16)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener$UpdateNotification.handle(ProcessorListener.java:85)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorListener.add(ProcessorListener.java:47)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.lambda$distribute$0(SharedProcessor.java:79)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.lambda$distribute$1(SharedProcessor.java:101)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.lambda$execute$0(SerialExecutor.java:40)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.scheduleNext(SerialExecutor.java:52)
	at io.fabric8.kubernetes.client.utils.SerialExecutor.execute(SerialExecutor.java:46)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.distribute(SharedProcessor.java:98)
	at io.fabric8.kubernetes.client.informers.cache.SharedProcessor.distribute(SharedProcessor.java:79)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorStore.update(ProcessorStore.java:48)
	at io.fabric8.kubernetes.client.informers.cache.ProcessorStore.update(ProcessorStore.java:29)
	at io.fabric8.kubernetes.client.informers.cache.Reflector$ReflectorWatcher.eventReceived(Reflector.java:134)
	at io.fabric8.kubernetes.client.informers.cache.Reflector$ReflectorWatcher.eventReceived(Reflector.java:114)
	at io.fabric8.kubernetes.client.utils.WatcherToggle.eventReceived(WatcherToggle.java:49)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.eventReceived(AbstractWatchManager.java:178)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:234)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:93)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)
	at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

System tests should check for clean logs

I missed #617 because the system test will pass due to other events. It would be great to have a check for clean logs of at least the fleetshard components with each test run - I'd have to look at fabric8's logwatch feature a little more to understand how involved this would be.

Connection latencies recorded by canary surprisingly high

The connection latencies recorded by canary surprisingly high. For instance, even for a kafka instance without load, the connection latency record by the canary is regularly >500ms.

For example the connection latencies:

I1101 18:03:22.956942 1 connection_check.go:157] Connected to broker 0 in 572 ms I1101 18:03:23.181817 1 connection_check.go:157] Connected to broker 2 in 225 ms I1101 18:03:23.286645 1 connection_check.go:157] Connected to broker 1 in 105 ms I1101 18:05:22.580393 1 connection_check.go:157] Connected to broker 0 in 196 ms I1101 18:05:23.284139 1 connection_check.go:157] Connected to broker 2 in 704 ms I1101 18:05:23.590685 1 connection_check.go:157] Connected to broker 1 in 306 ms
After analysis, I noticed that the kube metrics are showing the container cpu for the canary are being throttled. The cpu limit for the canary is too low and this is affecting the ability of the canary to measure latencies accurately. Note at 18.12 on the cpu throttling graph, I relaxed the canary cpu limit from 10 => 100 millicores.

After this change, the connection latencies look like this:
I1101 18:13:22.598494 1 connection_check.go:157] Connected to broker 0 in 45 ms I1101 18:13:22.641737 1 connection_check.go:157] Connected to broker 2 in 43 ms I1101 18:13:22.684031 1 connection_check.go:157] Connected to broker 1 in 42 ms I1101 18:15:22.594761 1 connection_check.go:157] Connected to broker 0 in 41 ms I1101 18:15:22.637624 1 connection_check.go:157] Connected to broker 2 in 43 ms I1101 18:15:22.673900 1 connection_check.go:157] Connected to broker 1 in 36 ms I1101 18:17:22.589786 1 connection_check.go:157] Connected to broker 0 in 36 ms I1101 18:17:22.621260 1 connection_check.go:157] Connected to broker 2 in 32 ms I1101 18:17:22.659876 1 connection_check.go:157] Connected to broker 1 in 38 ms I1101 18:19:22.598069 1 connection_check.go:157] Connected to broker 0 in 44 ms I1101 18:19:22.637483 1 connection_check.go:157] Connected to broker 2 in 39 ms I1101 18:19:22.693163 1 connection_check.go:157] Connected to broker 1 in 56 ms

ManagedKafka status empty when the ManagedKafka resource is in a different namespace than the operator

The scenario is about having Strimzi operator running in its own namespace as cluster wide and watching all namespaces; the fleetshard operator running in its own namespace as cluster wide and watching all namespaces as well.
If a ManagedKafka resource is created in the same namespace where the fleetshard operator is running, everything works fine, the Kafka cluster is created and the ManagedKafka.status is updated accordingly.
If a ManagedKafka resource is created in a different namespace where the fleetshard operator is running, the Kafka cluster is created but the ManagedKafka.status is left empty.
It seems that events from Kafka and Deployment(s) resources are not got correctly by the controller.

Smoke tests routinely fail

We need some rationalization of what the smoke tests are doing as they routinely fail. Options include:

  1. Waiting longer - as long as a ManagedKafka is Ready=False, Installing - then it's fine to keep waiting (for a while). For this though we should move the code that detects errors and unschedulable pods out of perf and into system test - to fail fast if possible. As it already takes at least 12 minutes to complete, this is not ideal.
  2. Pare it down - if we are going to run with every commit, should we just do the sync test and possibly forgo keycloak
  3. Switch to a scheduled job - run nightly (with increased waits and possibly increased resources).

@kornys @ppatierno @rareddy @k-wall WDYT?

log level not reverted if config map removed.

my only concern is that after setting a log level for a specific category, if I delete the ConfigMap to return to defaults logging, the file is deleted from the volume but the logging levels are not reverted back. The only way I found was restarting the fleetshard pod.

Originally posted by @ppatierno in #312 (comment)

Provide support mechanism to allow the querying of arbitrary kafka jmx beans

Kafka provides a very rich set of mbeans, more than would ever by practical to expose over JMX. Kafka ships with a kafka.tools.JmxTool which can be used to query JMX mbeans from the command line. We can enable use of that tool from the command line from within the container.

oc rsh <kafka pod>
./bin/kafka-run-class.sh kafka.tools.JmxTool   --object-name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec
Trying to connect to JMX url: service:jmx:rmi:///jndi/rmi://:9999/jmxrmi.
"time","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:Count","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:EventType","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:FifteenMinuteRate","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:FiveMinuteRate","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:MeanRate","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:OneMinuteRate","kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec:RateUnit"
1642190114369,6,messages,0.006520454346334469,0.01871793656671873,0.10481855913754813,0.07240185377185522,SECONDS
1642190116370,6,messages,0.006520454346334469,0.01871793656671873,0.10127856318362949,0.07240185377185522,SECONDS
1642190118369,6,messages,0.006484330038178644,0.018408556287811687,0.09797323303348683,0.0666129211716044,SECONDS
1642190120369,7,messages,0.006484330038178644,0.018408556287811687,0.1106865377832141,0.0666129211716044,SECONDS
1642190122368,7,messages,0.0075564362632013745,0.02140999885080774,0.10729415911862082,0.07727796314021336,SECONDS

Improve smoketest run times

Currently for each smoke test there is a full tear down and build up of the environment. We should consider for smoke tests an approach that will first install strimzi, the crds / fleetshard components, then run the tests against that setup where the tear down only deletes the namespace created for the managedkafka resource and restarts the relevant pods.

@kornys would you be open to such a change?

ManagedKafka builder failed when .build() method is called

I have tried to use builder for managedkafka CR and Im unable to build CR

Example

ManagedKafka mk = new ManagedKafkaBuilder()
                .withMetadata(
                        new ObjectMetaBuilder()
                                .withNamespace("test")
                                .withName("my-managed-kafka")
                                .build())
                .withSpec(
                        new ManagedKafkaSpecBuilder()
                                .withNewVersions()
                                .withKafka("2.6.0")
                                .endVersions()
                                .build())
                .build();

Exception

java.lang.IllegalArgumentException: org.bf2.operator.resources.v1alpha1.EditableManagedKafka CustomResource must provide an API version using @io.fabric8.kubernetes.model.annotation.Group and @io.fabric8.kubernetes.model.annotation.Version annotations

Operator cannot handle recreating managedkafka resource

I created test scenario where:

  1. create managed kafka
  2. wait until deployed
  3. delete managed kafka
  4. immediately create same resource again

managed kafka resource is created but no kafka cluster

operator exception

12:49:08.130 [EventHandler-managedkafkacontroller] ERROR io.javaoperatorsdk.operator.processing.EventDispatcher - Error during event processing ExecutionScope{events=[{ class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=02b9f758-10f4-4ea0-a87b-6f452e8d2729, eventSource=org.bf2.operator.events.DeploymentEventSource@6b6fc06c }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=02b9f758-10f4-4ea0-a87b-6f452e8d2729, eventSource=org.bf2.operator.events.DeploymentEventSource@6b6fc06c }, { class=io.javaoperatorsdk.operator.processing.event.internal.TimerEvent, relatedCustomResourceUid=02b9f758-10f4-4ea0-a87b-6f452e8d2729, eventSource=io.javaoperatorsdk.operator.processing.event.internal.TimerEventSource@4de7d819 }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=02b9f758-10f4-4ea0-a87b-6f452e8d2729, eventSource=org.bf2.operator.events.DeploymentEventSource@6b6fc06c }, { class=org.bf2.operator.events.DeploymentEvent, relatedCustomResourceUid=02...
java.lang.NullPointerException: null
	at org.bf2.operator.operands.KafkaCluster.isInstalling(KafkaCluster.java:123) ~[classes/:?]
	at org.bf2.operator.operands.KafkaCluster_ClientProxy.isInstalling(KafkaCluster_ClientProxy.zig:346) ~[classes/:?]
	at org.bf2.operator.operands.KafkaInstance.isInstalling(KafkaInstance.java:40) ~[classes/:?]
	at org.bf2.operator.operands.KafkaInstance_ClientProxy.isInstalling(KafkaInstance_ClientProxy.zig:188) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController.updateManagedKafkaStatus(ManagedKafkaController.java:128) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:97) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController.createOrUpdateResource(ManagedKafkaController.java:29) ~[classes/:?]
	at org.bf2.operator.controllers.ManagedKafkaController_ClientProxy.createOrUpdateResource(ManagedKafkaController_ClientProxy.zig:191) ~[classes/:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleCreateOrUpdate(EventDispatcher.java:100) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleDispatch(EventDispatcher.java:79) ~[io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.EventDispatcher.handleExecution(EventDispatcher.java:46) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at io.javaoperatorsdk.operator.processing.ExecutionConsumer.run(ExecutionConsumer.java:25) [io.javaoperatorsdk.operator-framework-core-1.7.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

fleetshard-operator reports ERRORs the object has been modified as kubernetes rolls deployment

If fleetshard-operator is updated when there's an existing popiulation of managedkafkas, the replacement pod seems to fight with the older pod and swathe of the object has been modified are produced at ERROR. These error seem to stop once the older pod is gone.

2021-04-23 13:50:12,658 ERROR [io.jav.ope.pro.EventDispatcher] (EventHandler-managedkafkacontrollerwrapper) Error during event processing ExecutionScope{events=[{ Deployment kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/neds-atomic-penguin-admin-server relatedCustomResourceUid=dda1d9f5-8d34-4b8c-b012-a5276bab35a2 ,resourceVersion=127684012 }, { Deployment kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/neds-atomic-penguin-canary relatedCustomResourceUid=dda1d9f5-8d34-4b8c-b012-a5276bab35a2 ,resourceVersion=129116842 }, { Kafka kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/neds-atomic-penguin relatedCustomResourceUid=dda1d9f5-8d34-4b8c-b012-a5276bab35a2 ,resourceVersion=127361629 }, { Kafka kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/neds-atomic-penguin relatedCustomResourceUid=dda1d9f5-8d34-4b8c-b012-a5276bab35a2 ,resourceVersion=129917957 }, { Kafka kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/neds-atomic-penguin relatedCustomResourceUid=dda1d9f5-8d34-4b8c-b012-a5276bab35a2 ,resourceVersion=129917961 }], customResource uid: dda1d9f5-8d34-4b8c-b012-a5276bab35a2, version: 129917969} failed.: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://172.30.0.1/apis/managedkafka.bf2.org/v1alpha1/namespaces/kwall-mk-bin-packing-1rupk1f1swsreuz6whcvg08uqxo/managedkafkas/neds-atomic-penguin/status. Message: Operation cannot be fulfilled on managedkafkas.managedkafka.bf2.org "neds-atomic-penguin": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=managedkafka.bf2.org, kind=managedkafkas, name=neds-atomic-penguin, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on managedkafkas.managedkafka.bf2.org "neds-atomic-penguin": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).

operator does not recreate resources when someone deletes them

I created test scenario:

  1. Create managed kafka resource
  2. Wait until is ready
  3. Delete pods, statefulsets, deployments and services in managed kafka instance namespace
  4. wait until resources are created by operator again

Strimzi operator creates kafka again but kas-fleetshard-operator does not create deployments and services so status of managedkafka CR is still installing.

No error in operator log

Mess in logs of systemtests

After PR #420 systemtests logs contains many useless lines during waiting which cause only mess in logs...

2021-07-19 07:30:00 INFO  [main] [TestUtils:121] =======================================================================
2021-07-19 07:30:00 INFO  [main] [TestUtils:122] -> Running test method: testCreateManagedKafka
2021-07-19 07:30:00 INFO  [main] [TestUtils:123] =======================================================================
2021-07-19 07:44:49 INFO  [main] [SyncApiClient:23] Create managed kafka mk-test-create
2021-07-19 07:44:49 INFO  [main] [SyncApiClient:25] Sending POST request to kas-fleetshard-sync-external-kas-fleetshard.apps.korny-cluster.enmasse.app-services-dev.net with port 80 and path /api/kafkas_mgmt/v1/agent-clusters/pepa/kafkas/
2021-07-19 07:44:50 INFO  [main] [TestUtils:65] Waiting for mk-test-create
2021-07-19 07:44:50 WARN  [main] [VersionUsageUtils:60] The client is using resource type 'managedkafkas' with unstable version 'v1alpha1'
2021-07-19 07:44:54 INFO  [main] [TestUtils:65] Waiting for mk-test-create
2021-07-19 07:44:56 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:44:57 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:44:58 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:44:59 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:00 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:01 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:02 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:04 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:05 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:06 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:07 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:08 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:09 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:10 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:12 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:13 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:14 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:15 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:16 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:17 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:18 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:20 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:21 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:22 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:23 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:24 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:25 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:26 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:28 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:29 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:30 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:31 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:33 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:34 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:35 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:36 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:37 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:38 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:39 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:40 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:42 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:43 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:44 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:45 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:46 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:47 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:48 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:50 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:51 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:52 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:53 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:54 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:55 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:45:56 INFO  [pool-2-thread-3] [ManagedKafkaResourceType:50] ManagedKafka mk-test-create in error state
2021-07-19 07:46:00 INFO  [main] [SmokeST:80] ManagedKafka mk-test-create created

Event handling of modifications

The current logic will only look for the owner reference of the freshly modified object. There is a narrow chance that it's the owner references that have been modified, we should also look to the old state.

Possible performance regression

There appears to be a performance regression in the throughput results in the low latency profile with no batching.

Running the instance profiler against 0.15 on 4.8 and earlier produce numbers similar to

throughputResults:
  LATENCY_NO_BATCHING:
    averageMaxProducerMBs: 117.40602992350006
    averageMaxConsumerMBs: 117.40384791365771
    medianMaxProducerMBs: 117.31501737563312
    medianMaxConsumerMBs: 117.28390907546762

Running latest on 4.8 or 4.9 is capable of only about half that:

  LATENCY_NO_BATCHING:
    averageMaxProducerMBs: 59.26162430892092
    averageMaxConsumerMBs: 59.26106964325792
    medianMaxProducerMBs: 59.08476802526846
    medianMaxConsumerMBs: 59.09795303385282

Similarly latencies seem worse, but batching throughput seems the same.

This was evident after looking at https://issues.redhat.com/browse/MGDSTRM-6933 - my initial assumption was that the lower throughput was due to fewer resources, but not it appears due to other changes as well.

Generated CSVs from PoC in Quarkus extension

I've been working on a PoC to generate CSVs from the code. The code can be found at quarkiverse/quarkus-operator-sdk#116 and relies on a SNAPSHOT build of Quarkus since we need the feature that got merged today from quarkusio/quarkus#20113.
CSVs are currently generated for each controller:

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  name: managedkafkaagentcontroller
spec:
  customresourcedefinitions:
    owned:
    - kind: ManagedKafkaAgent
      name: managedkafkaagents.managedkafka.bf2.org
      version: v1alpha1
  install:
    spec:
      clusterPermissions:
      - rules:
        - apiGroups:
          - managedkafka.bf2.org
          resources:
          - managedkafkas
          - managedkafkas/status
          - managedkafkaagents
          - managedkafkaagents/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - kafka.strimzi.io
          resources:
          - kafkas
          - kafkas/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - apps
          - extensions
          resources:
          - deployments
          - deployments/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - apps
          resources:
          - replicasets
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - ""
          resources:
          - services
          - configmaps
          - secrets
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - route.openshift.io
          resources:
          - routes
          - routes/custom-host
          - routes/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - admissionregistration.k8s.io
          resources:
          - validatingwebhookconfigurations
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operators.coreos.com
          resources:
          - subscriptions
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - packages.operators.coreos.com
          resources:
          - packagemanifests
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operators.coreos.com
          resources:
          - installplans
          verbs:
          - get
          - list
          - watch
          - patch
          - update
        - apiGroups:
          - ""
          resources:
          - pods
          - nodes
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operator.openshift.io
          resources:
          - ingresscontrollers
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        serviceAccountName: kas-fleetshard-operator
      deployments:
      - name: kas-fleetshard-operator
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: kas-fleetshard-operator
              app.kubernetes.io/version: 999-SNAPSHOT
              app.kubernetes.io/name: kas-fleetshard-operator
          template:
            metadata:
              annotations:
                app.quarkus.io/commit-id: 7046be38820fd54c539f32de1627ddb6174e04d5
                app.quarkus.io/build-timestamp: 2021-09-15 - 10:41:54 +0000
                prometheus.io/scrape: "true"
                prometheus.io/path: /q/metrics
                prometheus.io/port: "8080"
                prometheus.io/scheme: http
              labels:
                app: kas-fleetshard-operator
                app.kubernetes.io/version: 999-SNAPSHOT
                app.kubernetes.io/name: kas-fleetshard-operator
              name: kas-fleetshard-operator
            spec:
              containers:
              - env:
                - name: QUARKUS_PROFILE
                  value: prod
                - name: KUBERNETES_NAMESPACE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.namespace
                image: claprun/kas-fleetshard-operator:999-SNAPSHOT
                imagePullPolicy: Always
                livenessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /q/health/live
                    port: 8080
                    scheme: HTTP
                  initialDelaySeconds: 0
                  periodSeconds: 30
                  successThreshold: 1
                  timeoutSeconds: 10
                name: kas-fleetshard-operator
                ports:
                - containerPort: 8080
                  name: http
                  protocol: TCP
                readinessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /q/health/ready
                    port: 8080
                    scheme: HTTP
                  initialDelaySeconds: 0
                  periodSeconds: 30
                  successThreshold: 1
                  timeoutSeconds: 10
                resources:
                  limits:
                    cpu: 1500m
                    memory: 1Gi
                  requests:
                    cpu: 500m
                    memory: 512Mi
                volumeMounts:
                - mountPath: /config
                  name: logging-config-volume
                  readOnly: false
                  subPath: ""
              serviceAccount: kas-fleetshard-operator
              serviceAccountName: kas-fleetshard-operator
              volumes:
              - configMap:
                  defaultMode: 384
                  name: operator-logging-config-override
                  optional: true
                name: logging-config-volume

and

apiVersion: operators.coreos.com/v1alpha1
kind: ClusterServiceVersion
metadata:
  name: managedkafkacontroller
spec:
  customresourcedefinitions:
    owned:
    - kind: ManagedKafka
      name: managedkafkas.managedkafka.bf2.org
      version: v1alpha1
  install:
    spec:
      clusterPermissions:
      - rules:
        - apiGroups:
          - managedkafka.bf2.org
          resources:
          - managedkafkas
          - managedkafkas/status
          - managedkafkaagents
          - managedkafkaagents/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - kafka.strimzi.io
          resources:
          - kafkas
          - kafkas/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - apps
          - extensions
          resources:
          - deployments
          - deployments/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - apps
          resources:
          - replicasets
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - ""
          resources:
          - services
          - configmaps
          - secrets
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - route.openshift.io
          resources:
          - routes
          - routes/custom-host
          - routes/status
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        - apiGroups:
          - admissionregistration.k8s.io
          resources:
          - validatingwebhookconfigurations
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operators.coreos.com
          resources:
          - subscriptions
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - packages.operators.coreos.com
          resources:
          - packagemanifests
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operators.coreos.com
          resources:
          - installplans
          verbs:
          - get
          - list
          - watch
          - patch
          - update
        - apiGroups:
          - ""
          resources:
          - pods
          - nodes
          verbs:
          - get
          - list
          - watch
        - apiGroups:
          - operator.openshift.io
          resources:
          - ingresscontrollers
          verbs:
          - get
          - list
          - watch
          - create
          - delete
          - patch
          - update
        serviceAccountName: kas-fleetshard-operator
      deployments:
      - name: kas-fleetshard-operator
        spec:
          replicas: 1
          selector:
            matchLabels:
              app: kas-fleetshard-operator
              app.kubernetes.io/version: 999-SNAPSHOT
              app.kubernetes.io/name: kas-fleetshard-operator
          template:
            metadata:
              annotations:
                app.quarkus.io/commit-id: 7046be38820fd54c539f32de1627ddb6174e04d5
                app.quarkus.io/build-timestamp: 2021-09-15 - 10:41:54 +0000
                prometheus.io/scrape: "true"
                prometheus.io/path: /q/metrics
                prometheus.io/port: "8080"
                prometheus.io/scheme: http
              labels:
                app: kas-fleetshard-operator
                app.kubernetes.io/version: 999-SNAPSHOT
                app.kubernetes.io/name: kas-fleetshard-operator
              name: kas-fleetshard-operator
            spec:
              containers:
              - env:
                - name: QUARKUS_PROFILE
                  value: prod
                - name: KUBERNETES_NAMESPACE
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.namespace
                image: claprun/kas-fleetshard-operator:999-SNAPSHOT
                imagePullPolicy: Always
                livenessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /q/health/live
                    port: 8080
                    scheme: HTTP
                  initialDelaySeconds: 0
                  periodSeconds: 30
                  successThreshold: 1
                  timeoutSeconds: 10
                name: kas-fleetshard-operator
                ports:
                - containerPort: 8080
                  name: http
                  protocol: TCP
                readinessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /q/health/ready
                    port: 8080
                    scheme: HTTP
                  initialDelaySeconds: 0
                  periodSeconds: 30
                  successThreshold: 1
                  timeoutSeconds: 10
                resources:
                  limits:
                    cpu: 1500m
                    memory: 1Gi
                  requests:
                    cpu: 500m
                    memory: 512Mi
                volumeMounts:
                - mountPath: /config
                  name: logging-config-volume
                  readOnly: false
                  subPath: ""
              serviceAccount: kas-fleetshard-operator
              serviceAccountName: kas-fleetshard-operator
              volumes:
              - configMap:
                  defaultMode: 384
                  name: operator-logging-config-override
                  optional: true
                name: logging-config-volume

Let me know what you think.
[Edit: updated the generated CSV after fixing an issue in the generator that wasn't properly handling defined roles]
/cc @shawkins

fleetshard-sync pod error after successful kas-installer installation

After running successfully kas-installer script I got tthis error on fleetshard-sync pod:

 --/ __ \/ / / / _ | / _ \/ //_/ / / / __/
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \
--\___\_\____/_/ |_/_/|_/_/|_|\____/___/
2022-02-23 11:52:40,837 WARN  [io.fab.kub.cli.int.VersionUsageUtils] (main)  The client is using resource type 'managedkafkas' with unstable version 'v1alpha1'
2022-02-23 11:52:41,451 WARN  [io.fab.kub.cli.int.VersionUsageUtils] (main)  The client is using resource type 'managedkafkaagents' with unstable version 'v1alpha1'
2022-02-23 11:52:42,140 INFO  [io.quarkus] (main)  kas-fleetshard-sync 0.19.0.managedsvc-redhat-00001 on JVM (powered by Quarkus 2.2.2.Final) started in 10.997s. Listening on: http://0.0.0.0:8080
2022-02-23 11:52:42,140 INFO  [io.quarkus] (main)  Profile prod activated.
2022-02-23 11:52:42,141 INFO  [io.quarkus] (main)  Installed features: [cdi, kubernetes, kubernetes-client, micrometer, oidc-client, oidc-client-filter, rest-client, resteasy, resteasy-jackson, scheduler, smallrye-context-propagation, smallrye-health]
2022-02-23 11:52:42,143 INFO  [org.bf2.syn.KasFleetShardSync] (main)  Managed Kafka sync
2022-02-23 11:52:43,054 INFO  [org.bf2.syn.ManagedKafkaAgentSync] (executor-thread-3)  ManagedKafkaAgent CR created
2022-02-23 11:52:43,148 ERROR [org.bf2.syn.ExecutorServiceProvider] (pool-11-thread-3)  Uncaught exception running task: javax.ws.rs.WebApplicationException: Unknown error, status code 400
	at org.jboss.resteasy.microprofile.client.DefaultResponseExceptionMapper.toThrowable(DefaultResponseExceptionMapper.java:21)
	at org.jboss.resteasy.microprofile.client.ExceptionMapping$HandlerException.mapException(ExceptionMapping.java:41)
	at org.jboss.resteasy.microprofile.client.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:153)
	at com.sun.proxy.$Proxy43.updateStatus(Unknown Source)
	at org.bf2.sync.controlplane.ControlPlaneRestClient_97de7d79203bfddd2e99cf0e008790c55d223eed_Synthetic_ClientProxy.updateStatus(ControlPlaneRestClient_97de7d79203bfddd2e99cf0e008790c55d223eed_Synthetic_ClientProxy.zig:154)
	at org.bf2.sync.controlplane.ControlPlane.lambda$updateAgentStatus$0(ControlPlane.java:114)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)```

Kafka logging pattern includes slow source code line number introspection.

Log4J 1 docs warn, when talking about the %L pattern:

Generating caller location information is extremely slow and should be avoided unless execution speed is not an issue.

We use this pattern in our kafka logging and we use kafka logging for audit trace, which is very chatty.

log4j.appender.CONSOLE.layout.ConversionPattern=%d{yyyy-MM-dd'T'HH:mm:ss'Z'}{GMT} %-5p [%t] [%c{1}:%L] %m%n

I don't think we can justify the %L pattern to be our default.

NPE raised when not all informers are created on startup

When the fleetshard operator starts it raises the following exception

[id=1_org.bf2.operator.controllers.ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc, interval=60000]: java.lang.NullPointerException
        at org.bf2.operator.InformerManager.isReady(InformerManager.java:126)
        at org.bf2.operator.InformerManager_ClientProxy.isReady(InformerManager_ClientProxy.zig:385)
        at org.bf2.operator.controllers.ManagedKafkaAgentController.statusUpdateLoop(ManagedKafkaAgentController.java:87)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass.statusUpdateLoop$$superaccessor2(ManagedKafkaAgentController_Subclass.zig:432)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass$$function$$2.apply(ManagedKafkaAgentController_Subclass$$function$$2.zig:29)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
        at io.quarkus.micrometer.runtime.MicrometerCountedInterceptor.countedMethod(MicrometerCountedInterceptor.java:76)
        at io.quarkus.micrometer.runtime.MicrometerCountedInterceptor_Bean.intercept(MicrometerCountedInterceptor_Bean.zig:303)
        at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:50)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.processWithTimer(MicrometerTimedInterceptor.java:80)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.time(MicrometerTimedInterceptor.java:57)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.timedMethod(MicrometerTimedInterceptor.java:48)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor_Bean.intercept(MicrometerTimedInterceptor_Bean.zig:303)
        at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
        at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass.statusUpdateLoop(ManagedKafkaAgentController_Subclass.zig:390)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_ClientProxy.statusUpdateLoop(ManagedKafkaAgentController_ClientProxy.zig:186)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc.invokeBean(ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc.zig:45)
        at io.quarkus.arc.runtime.BeanInvoker.invoke(BeanInvoker.java:20)
        at io.quarkus.scheduler.runtime.SimpleScheduler$ScheduledTask$1.run(SimpleScheduler.java:227)
        at io.quarkus.runtime.CleanableExecutor$CleaningRunnable.run(CleanableExecutor.java:231)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2415)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1452)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
        at java.base/java.lang.Thread.run(Thread.java:829)
        at org.jboss.threads.JBossThread.run(JBossThread.java:501)

The problem seems to be the fact that the scheduled statusUpdateLoop is called (after 5 seconds) too early when the startup of the InformerManager is not completed yet and one or more of its informer fields are not created so they are still null.

2021-05-10 17:13:13,609 WARN  [io.fab.kub.cli.int.VersionUsageUtils] (Quarkus Main Thread) The client is using resource type 'kafkas' with unstable version 'v1beta2'
2021-05-10 17:13:13,864 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list KafkaList version 88480
2021-05-10 17:13:13,865 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88480
2021-05-10 17:13:14,717 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list DeploymentList version 88483
2021-05-10 17:13:14,717 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88483
2021-05-10 17:13:15,466 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list ServiceList version 88500
2021-05-10 17:13:15,467 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88500
2021-05-10 17:13:16,234 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list ConfigMapList version 88501
2021-05-10 17:13:16,234 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88501
2021-05-10 17:13:17,043 ERROR [io.qua.sch.run.SimpleScheduler] (executor-thread-1) Error occured while executing task for trigger IntervalTrigger [id=1_org.bf2.operator.controllers.ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc, interval=60000]: java.lang.NullPointerException
        at org.bf2.operator.InformerManager.isReady(InformerManager.java:126)
        at org.bf2.operator.InformerManager_ClientProxy.isReady(InformerManager_ClientProxy.zig:385)
        at org.bf2.operator.controllers.ManagedKafkaAgentController.statusUpdateLoop(ManagedKafkaAgentController.java:87)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass.statusUpdateLoop$$superaccessor2(ManagedKafkaAgentController_Subclass.zig:432)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass$$function$$2.apply(ManagedKafkaAgentController_Subclass$$function$$2.zig:29)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
        at io.quarkus.micrometer.runtime.MicrometerCountedInterceptor.countedMethod(MicrometerCountedInterceptor.java:76)
        at io.quarkus.micrometer.runtime.MicrometerCountedInterceptor_Bean.intercept(MicrometerCountedInterceptor_Bean.zig:303)
        at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:50)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.processWithTimer(MicrometerTimedInterceptor.java:80)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.time(MicrometerTimedInterceptor.java:57)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor.timedMethod(MicrometerTimedInterceptor.java:48)
        at io.quarkus.micrometer.runtime.MicrometerTimedInterceptor_Bean.intercept(MicrometerTimedInterceptor_Bean.zig:303)
        at io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:41)
        at io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:41)
        at io.quarkus.arc.impl.InvocationContexts.performAroundInvoke(InvocationContexts.java:32)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_Subclass.statusUpdateLoop(ManagedKafkaAgentController_Subclass.zig:390)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_ClientProxy.statusUpdateLoop(ManagedKafkaAgentController_ClientProxy.zig:186)
        at org.bf2.operator.controllers.ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc.invokeBean(ManagedKafkaAgentController_ScheduledInvoker_statusUpdateLoop_2aab5e1f9f4824ecbb2c90be2ebfef1020f773cc.zig:45)
        at io.quarkus.arc.runtime.BeanInvoker.invoke(BeanInvoker.java:20)
        at io.quarkus.scheduler.runtime.SimpleScheduler$ScheduledTask$1.run(SimpleScheduler.java:227)
        at io.quarkus.runtime.CleanableExecutor$CleaningRunnable.run(CleanableExecutor.java:231)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2415)
        at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1452)
        at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
        at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
        at java.base/java.lang.Thread.run(Thread.java:829)
        at org.jboss.threads.JBossThread.run(JBossThread.java:501)

2021-05-10 17:13:17,092 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list SecretList version 88505
2021-05-10 17:13:17,092 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88505
2021-05-10 17:13:18,453 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Got a fresh list RouteList version 88511
2021-05-10 17:13:18,454 DEBUG [org.bf2.com.ResourceInformer] (Quarkus Main Thread) Starting watch at version 88511

Of course, increasing the delay on the scheduled method works but it's not ideal.
Maybe we should check the informers to be not null in the isReady and returning false otherwise.

Upgrade to v1 crd

It appears that there are no runtime side effects from switching as the crd structure and version are staying the same.

Perf defaults to 2.7.0 kafka

Rather than a hard-coded default, if not set we should use similar version introspection logic as strimzi, to determine the latest version of kafka to use based upon what's available via the bundle.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.