openshift / cluster-image-registry-operator Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 113.0 162.67 MB

The image registry operator installs+maintains the internal registry on a cluster

License: Apache License 2.0

Go 99.26% Dockerfile 0.12% Shell 0.45% Makefile 0.17%

cluster-image-registry-operator's People

Contributors

Stargazers

Watchers

Forkers

legionus dmage bparees abhinavdahiya oip-rnd wking smarterclayton staebler suicidesin jhadvig runcom derekwaynecarr mrunalp adambkaplan dgoodwin flaper87 isgasho florinpeter sanchezl gabemontero joelddiaz ravisantoshgudimetla sferich888 sjenning openshift-cherrypick-robot serbrech rphillips russellb ingvagabund dulek huffmanca mandre wzheng1 freego-six boranx luis-falcon azopat xiuwang marun axfxy2010 gyohuangxin rgolangh ederst gal-zaidman openshift-bot deads2k jupierce guillaumerose csrwng cfergeau osherdp global-localhost global19 global19-atlassian-net honza akhil-rane xuanyunhui lack isabella232 bverschueren vrutkovs kwoodson hasueki openshift-ci-robot m3ntalsp00n tonyleedocker bmeng s-urbaniak menglingwei mtrmac flavianmissi mtulio bharath-b-rh josefkarasek manny27nyc eggfoobar nutanix-cloud-native lm0943111262 qjkee prb112 maysamacedo cjschaef pmtk neisw carlosrobertodevops xueqzhan abutcher joshualucas84 trilokgeer shellyyang1989 r4f4 jkyros shiftstack chiragkyal hamzy mjturek dharaneeshvrd dlom jharrington22 cdoern

cluster-image-registry-operator's Issues

s3 bucket management

Follow up to #52 to resolve the items described in:
#52 (comment)

You need to ensure that on re-sync, we are making sure the bucket exists. That means a few things:

Adding a condition on the registry resource that indicates when we have successfully created the bucket (we should not be deploying the registry until we are sure we created the bucket)
a mechanism for ensuring that if the S3 bucket name changed, the condition is cleared until we create the new bucket.

Can't update empty volumeSource to use emptyDir

The operator doesn't allow to change configuration from

  storage:
    filesystem: {}

  storage:
    filesystem:
      volumeSource:
        emptyDir: {}

because storage type change is not supported: expected storage type filesystem, but got emptydir.

/kind bug

Fatal "unknown storage backend" when the backend is empty

@RobertKrawitz reports this operator crashing on both libvirt and AWS with:

time=""2018-11-05T19:42:08Z"" level=info msg=""Cluster Image Registry Operator Version: 601f5c3-dirty""
time=""2018-11-05T19:42:08Z"" level=info msg=""Go Version: go1.10.3""
time=""2018-11-05T19:42:08Z"" level=info msg=""Go OS/Arch: linux/amd64""
time=""2018-11-05T19:42:08Z"" level=info msg=""operator-sdk Version: 0.0.6+git""
time=""2018-11-05T19:42:08Z"" level=info msg=""Metrics service cluster-image-registry-operator created""
time=""2018-11-05T19:42:08Z"" level=info msg=""generating registry custom resource""
time=""2018-11-05T19:42:08Z"" level=fatal msg=""unknown storage backend: """

From the first line of the logs, you can see that's with 601f5c3 from this repository. I'll file a pull request at least expanding the logs for the missing config-map case.

don't use GC for PVC cleanup

I'm removing most of our GC usage here:
#215

but I don't want to get into the PVC logic. Right now it appears to be using GC to ensure the PVC gets removed when the CR is removed. That won't work because the CR is cluster-scoped and GC can't span namespaces. The PVC needs to be cleaned up via finalization.

move registry custom resource to be cluster scoped

Since we only want one registry per cluster anyway (can't actually tolerate more than one) and because finalizers deadlock themselves when the finalizer and the resource being finalized run in the same namespace, we need to switch the registry resource CRD to be cluster scoped instead of namespace scoped.

/cc @deads2k

/assign @legionus

encryption being applied to user provided buckets

based on this:

cluster-image-registry-operator/pkg/storage/s3/s3.go

Lines 365 to 377 in 7f6ffaf

    
           // Enable default encryption on the bucket 
        
           _, err = svc.PutBucketEncryption(&s3.PutBucketEncryptionInput{ 
        
           	Bucket: aws.String(d.Config.Bucket), 
        
           	ServerSideEncryptionConfiguration: &s3.ServerSideEncryptionConfiguration{ 
        
           		Rules: []*s3.ServerSideEncryptionRule{ 
        
           			{ 
        
           				ApplyServerSideEncryptionByDefault: &s3.ServerSideEncryptionByDefault{ 
        
           					SSEAlgorithm: aws.String(s3.ServerSideEncryptionAes256), 
        
           				}, 
        
           			}, 
        
           		}, 
        
           	}, 
        
           })

it looks like we apply our own encryption settings to the s3 bucket even if the bucket was supplied by the user, even if the bucket already existed. This means a user can't provide their own s3 bucket w/ their own encryption settings.

Move from logrus to klog

Move from the logrus and the old github.com/golang/glog to k8s.io/klog

progressing status is wrong

both the image-registry resource and the clusteroperator resource are reporting "progressing=true" when the deployment is complete.

The deployment itself reports:

  - lastTransitionTime: 2018-10-22T15:18:03Z
    lastUpdateTime: 2018-10-22T15:18:31Z
    message: ReplicaSet "image-registry-7cbbc9dc8c" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing

So either this is a misunderstanding on our part of what that means, or deployments have a bug in which they should be setting progressing to false/completed=true.

Certainly the "message: ReplicaSet "image-registry-7cbbc9dc8c" has successfully progressed." would imply this might be a deployments bug.

/assign @legionus

unncessary updates to the CR

This code:

cluster-image-registry-operator/pkg/resource/generator.go

Lines 233 to 248 in 6e2fee7

    
           	err = gen.Create() 
        
           	if err != nil { 
        
           		return fmt.Errorf("failed to create object %s: %s", Name(gen), err) 
        
           	} 
        
           	glog.Infof("object %s created", Name(gen)) 
        
           	*modified = true 
        
           	return nil 
        
           } 
        
           updated, err := gen.Update(o.DeepCopyObject()) 
        
           if err != nil { 
        
           	return fmt.Errorf("failed to update object %s: %s", Name(gen), err) 
        
           } 
        
           if updated { 
        
           	glog.Infof("object %s updated", Name(gen)) 
        
           	*modified = true

means that we are ultimately going to update the CR even if we haven't necessarily changed the CR itself as part of an event loop (we may have only updated a secondary resource).

Every time we update the CR, we trigger an event that sends us back through the generator logic, so this means at a minimum we're doing one extra event loop process every time.

use a status subresource on the registry operator resource

I thought we'd already done this..is there a reason the registry resource doesn't define and use a Status subresource?

use more unique s3 bucket names

Currently we're using a generated uuid for the bucket name:

cluster-image-registry-operator/pkg/storage/s3/s3.go

Line 161 in d7c2a4d

    
           d.Config.Bucket = fmt.Sprintf("%s-%s", clusterconfig.STORAGE_PREFIX, string(uuid.NewUUID()))

@cuppett mentioned that this isn't generally considered good enough on S3 because of the universal nature of the bucket namespace. He had some suggestions for additional values we should include in the name to ensure uniquness.

@cuppett can you please supply the details of the recommendation you were making to me last night?

Thanks.

Is it supported to use PVC as registry storage?

Is it supported to use PVC as registry storage on OCP 4 like OCP 3?
How do we set it up if yes?

By default, the registry uses s3 on aws.

# oc get configs.imageregistry.operator.openshift.io instance -o yaml | grep storageManaged -B3
    s3:
      bucket: image-registry-us-east-2-<hash>
      region: us-east-2
  storageManaged: true

Thanks.

Publish information about the registry

In order to make the internal registry discoverable by other components (the API server, for example), we should create the ConfigMap openshift-image-registry/(image-registry-)?public-info with:

our internal host name (image-registry.openshift-image-registry.svc.cluster.local:5000),
our external host name.

This ConfigMap should be readable by system:authenticated.

InvalidAccessKeyId error during upgrade

Image registry operator works fine after cluster is setup, but broken after upgrade:

oc get clusteroperator cluster-image-registry-operator -o yaml:

...
  - lastTransitionTime: 2019-01-28T15:48:03Z
    message: "Unable to apply resources: unable to sync storage configuration: InvalidAccessKeyId:
      The AWS Access Key Id you provided does not exist in our records.\n\tstatus
      code: 403, request id: 1BB1FAF4B272F4A2, host id: 5/5II1d/1rTqEEh9HbyiW+P8NM+dqzd0NOCpmTogNIjgFoRf7lSxbdyX0BWhH1B+9ILNI22u9Lw="
    status: "True"
    type: Progressing
...

image-registry pods are being restarted frequently, so builds fail with HTTP 500

rename registry operator resources

We settled on:
resource name: config(s)
groupname: imageregistry.operator.openshift.io
resource instance name: instance

operator is hotlooping when s3 storage is configured

Evidenced by the quickly scrolling list of events being handled in the operator log. Changing to emptydir storage stops the hotloop.

enable must-gather object refs

If the clusteroperator defines object refs, the must-gather tool can automatically collect relevant resources/logs/etc from the references objects/namespaces. See:

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/76/files#diff-ef02f4dd56d4bd19d7b3e4913b9a57d5R82

This operator should be updated to define appropriate object references.

/cc @dmage @adambkaplan

operator does not upgrade node-ca daemonset

During upgrade testing, the node-ca daemonset is not upgraded.

$ oc get clusterversion
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.alpha-2019-02-08-113402   True        False         23m     Cluster version is 4.0.0-0.alpha-2019-02-08-113402

$ oc get pod -oyaml cluster-image-registry-operator-67587bc6c7-qbsm2
spec:
  containers:
  - command:
    - cluster-image-registry-operator
    env:
    - name: WATCH_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: OPERATOR_NAME
      value: cluster-image-registry-operator
    - name: IMAGE
      value: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-08-113402@sha256:c4cc6c2199cac465f128bf9a0a78a83cfae8302e86e9aeef22b740c1e45780c2
    image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-08-113402@sha256:9f31485bb394dba9264ee020d07b61f0554d4640b5c625e67904fadb02f5c3c6
    imagePullPolicy: Always
    name: cluster-image-registry-operator

$ oc get ds node-ca -oyaml
spec:
  template:
    spec:
      containers:
        image: registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-02-08-055616@sha256:c4cc6c2199cac465f128bf9a0a78a83cfae8302e86e9aeef22b740c1e45780c2  <--- should match cluster version

@derekwaynecarr @smarterclayton @bparees

TestAWSUnableToCreateBucketOnStartup InvalidAccessKeyId errors during tests

message: "Unable to apply resources: unable to sync storage configuration: InvalidAccessKeyId:
The AWS Access Key Id you provided does not exist in our records.\n\tstatus
code: 403, request id: 9F7CF2E209519A53, host id: KmxOvvgpisxuwNCxf8xm3HQSKHryoECU6n10p2YUQBiW3sytZYtUmny006BW5ypF47nYQTzVSzE="
status: "True"

use listwatchers to cache resources

we should be setting up listwatchers for all the resources we care about so that when we need to re-get them, we can get them from the cache instead of making redundant api calls.

Make test failing : deploy declared but not used

Oleg,
Since you added this test could you check this out?
Not sure how it got merged with this test failing either?

https://github.com/openshift/cluster-image-registry-operator/blob/master/test/e2e/recreate_deployment_test.go#L50

github.com/openshift/cluster-image-registry-operator/test/e2e_test

test/e2e/recreate_deployment_test.go:50:2: deploy declared but not used
vet: typecheck failures
FAIL github.com/openshift/cluster-image-registry-operator/test/e2e [build failed]
make: *** [Makefile:24: test-e2e] Error 2

Organize tests in test/e2e into well defined packages

Test timeouts (default of 10 minutes) are applied on a per package basis. If we organize the tests into proper packages we can go back to using the default timeout.

Crash-looping in installer e2e tests, /run/flannel/subnet.env missing, no matches for kind "Route" in version "route.openshift.io/v1"

I have an installer e2e job where this operator is crash-looping on AWS. More details in openshift/installer#403, but specific to this operator are this:

$ kubectl logs -n openshift-image-registry cluster-image-registry-operator-869c995bc5-ccrn9
time="2018-10-03T23:09:12Z" level=info msg="Cluster Image Registry Operator Version: c2753e9-dirty"
time="2018-10-03T23:09:12Z" level=info msg="Go Version: go1.10.3"
time="2018-10-03T23:09:12Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-10-03T23:09:12Z" level=info msg="operator-sdk Version: 0.0.6+git"
E1003 23:09:12.518994       1 memcache.go:153] couldn't get resource list for apps.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.521719       1 memcache.go:153] couldn't get resource list for authorization.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.523561       1 memcache.go:153] couldn't get resource list for build.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.525691       1 memcache.go:153] couldn't get resource list for image.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.535615       1 memcache.go:153] couldn't get resource list for network.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.538190       1 memcache.go:153] couldn't get resource list for oauth.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.662266       1 memcache.go:153] couldn't get resource list for project.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.683553       1 memcache.go:153] couldn't get resource list for quota.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.703499       1 memcache.go:153] couldn't get resource list for route.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.717847       1 memcache.go:153] couldn't get resource list for security.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.719404       1 memcache.go:153] couldn't get resource list for template.openshift.io/v1: the server is currently unable to handle the request
E1003 23:09:12.720938       1 memcache.go:153] couldn't get resource list for user.openshift.io/v1: the server is currently unable to handle the request
time="2018-10-03T23:09:13Z" level=info msg="Metrics service cluster-image-registry-operator created"
time="2018-10-03T23:09:13Z" level=info msg="Watching rbac.authorization.k8s.io/v1, ClusterRole, , 0"
time="2018-10-03T23:09:13Z" level=info msg="Watching rbac.authorization.k8s.io/v1, ClusterRoleBinding, , 0"
time="2018-10-03T23:09:13Z" level=info msg="Watching v1, ConfigMap, openshift-image-registry, 0"
time="2018-10-03T23:09:13Z" level=info msg="Watching v1, Secret, openshift-image-registry, 0"
time="2018-10-03T23:09:13Z" level=info msg="Watching v1, ServiceAccount, openshift-image-registry, 0"
time="2018-10-03T23:09:13Z" level=info msg="Watching route.openshift.io/v1, Route, openshift-image-registry, 0"
time="2018-10-03T23:09:13Z" level=error msg="failed to get resource client for (apiVersion:route.openshift.io/v1, kind:Route, ns:openshift-image-registry): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(route.openshift.io/v1, Kind=Route): no matches for kind \"Route\" in version \"route.openshift.io/v1\""
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(route.openshift.io/v1, Kind=Route): no matches for kind "Route" in version "route.openshift.io/v1"

goroutine 1 [running]:
github.com/openshift/cluster-image-registry-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.Watch(0xc4206786a0, 0x15, 0x133c5d9, 0x5, 0xc420040040, 0x18, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-image-registry-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/api.go:49 +0x4a8
main.watch(0xc4206786a0, 0x15, 0x133c5d9, 0x5, 0xc420040040, 0x18, 0x0)
        /go/src/github.com/openshift/cluster-image-registry-operator/cmd/cluster-image-registry-operator/main.go:39 +0x228
main.main()
        /go/src/github.com/openshift/cluster-image-registry-operator/cmd/cluster-image-registry-operator/main.go:82 +0x496

and this:

$ kubectl describe pods -n openshift-image-registry cluster-image-registry-operator-869c995bc5-ccrn9
Name:               cluster-image-registry-operator-869c995bc5-ccrn9
Namespace:          openshift-image-registry
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-0-136-219.ec2.internal/10.0.136.219
Start Time:         Wed, 03 Oct 2018 22:45:47 +0000
Labels:             name=cluster-image-registry-operator
                    pod-template-hash=4257551671
Annotations:        openshift.io/scc=restricted
Status:             Running
IP:                 10.2.4.6
Controlled By:      ReplicaSet/cluster-image-registry-operator-869c995bc5
Containers:
  cluster-image-registry-operator:
    Container ID:  cri-o://1bd15a65433c6f0cf3674fdf522bd7355c4e42741a7efccfa328fda1fea63ed2
    Image:         registry.svc.ci.openshift.org/ci-op-lpz1gxwg/stable@sha256:61b10a249a6efcf5ca2affd605365008115c1781fbd857b503f73d7091d23fd2
    Image ID:      registry.svc.ci.openshift.org/ci-op-lpz1gxwg/stable@sha256:61b10a249a6efcf5ca2affd605365008115c1781fbd857b503f73d7091d23fd2
    Port:          60000/TCP
    Host Port:     0/TCP
    Command:
      cluster-image-registry-operator
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 03 Oct 2018 23:40:11 +0000
      Finished:     Wed, 03 Oct 2018 23:40:11 +0000
    Ready:          False
    Restart Count:  15
    Environment:
      WATCH_NAMESPACE:  openshift-image-registry (v1:metadata.namespace)
      OPERATOR_NAME:    cluster-image-registry-operator
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6p6p5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-6p6p5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6p6p5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                  Age                  From                                   Message
  ----     ------                  ----                 ----                                   -------
  Warning  FailedScheduling        58m (x310 over 1h)   default-scheduler                      0/4 nodes are available: 4 node(s) had taints that the pod didn't tolerate.
  Warning  FailedCreatePodSandBox  55m                  kubelet, ip-10-0-136-219.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-869c995bc5-ccrn9_openshift-image-registry_e112b284-c75c-11e8-ad65-1267b6294ade_0(bd2b553ae930d9afea620ac3bc9401828ae7a577b0637f77739324638d3d414e): open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  55m                  kubelet, ip-10-0-136-219.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-869c995bc5-ccrn9_openshift-image-registry_e112b284-c75c-11e8-ad65-1267b6294ade_0(d27b30b91af951d2b96fcb67b9316b6c820ae1b0d8cd2681bcb2153cba249315): open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  54m                  kubelet, ip-10-0-136-219.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-869c995bc5-ccrn9_openshift-image-registry_e112b284-c75c-11e8-ad65-1267b6294ade_0(dabf597aba822b351b50772943de49434d51ba26367877881a6a4d03c3b3c88a): open /run/flannel/subnet.env: no such file or directory
  Warning  FailedCreatePodSandBox  54m                  kubelet, ip-10-0-136-219.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-869c995bc5-ccrn9_openshift-image-registry_e112b284-c75c-11e8-ad65-1267b6294ade_0(5286b6bd6666d5ea8bd1d0e79dd35110598dd46b0f1c6488cbe7738312c71386): open /run/flannel/subnet.env: no such file or directory
  Normal   Pulling                 52m (x4 over 54m)    kubelet, ip-10-0-136-219.ec2.internal  pulling image "registry.svc.ci.openshift.org/ci-op-lpz1gxwg/stable@sha256:61b10a249a6efcf5ca2affd605365008115c1781fbd857b503f73d7091d23fd2"
  Normal   Pulled                  52m (x4 over 54m)    kubelet, ip-10-0-136-219.ec2.internal  Successfully pulled image "registry.svc.ci.openshift.org/ci-op-lpz1gxwg/stable@sha256:61b10a249a6efcf5ca2affd605365008115c1781fbd857b503f73d7091d23fd2"
  Normal   Created                 52m (x4 over 54m)    kubelet, ip-10-0-136-219.ec2.internal  Created container
  Normal   Started                 52m (x4 over 54m)    kubelet, ip-10-0-136-219.ec2.internal  Started container
  Warning  BackOff                 21s (x242 over 53m)  kubelet, ip-10-0-136-219.ec2.internal  Back-off restarting failed container

I don't know if the missing /run/flannel/subnet.env are an API server issue (the API servers are also having trouble on this cluster), or a kubelet issue, or an operator issue, but thought I'd post here in case the issue was more obvious to y'all :).

registry memory resource limit test is flaking

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/203/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/1079/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/203/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/1073/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/203/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/1057/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/203/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/1049/

configuration_test.go:69: expected memory limit of 512Mi, found: 0

Given that the test creates the registry resource w/ the memory limit set, i can't see why we'd ever create a pod that didn't have the limit set, but it's the most consistent flake i'm seeing on the registry operator repo right now.

/cc @coreydaley @legionus @dmage

AWS S3: Abort incomplete multipart uploads

Configure the S3 bucket to abort multi-part uploads by default when we create the bucket

https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config

https://bugzilla.redhat.com/show_bug.cgi?id=1662132

finalizer hangs when deleting registry resource

deleting a registry resource results in a hang, it doesn't look like the finalizer is succeeding/working.

In my case I did not have a registry created, I was simply deleting the default CR that was created (the one that has no storage configured, thus has not deployed a registry).

cluster s3 config should not stomp local secret config

Today we copy s3 config from the cluster, into the secret in the image-registry namespace, and then drive that into the registry deployment.

Instead the registry operator should be:

checking if there is a local secret. If so use it.
If not, check if there is cluster-wide s3 config, if so, use it directly (set it on the registry deployment, don't copy it into a local secret)

We also need to be watching (2) for changes/resyncing it when there is no local secret.

Extract 'testframework' into openshift/library-go

We want to reuse the testframework for our the console-operator e2e tests. Is there any plan to refactor and extract it into https://github.com/openshift/library-go ?

Refactor storage driver(s) to use SyncStorage function

Drivers should use a SyncStorage function to determine what action to take (create, update, remove) based on the requested storage (spec) and current storage (status).

switch from clusteroperators.operatorstatus.openshift.io to clusteroperator.config.openshift.io

We should be creating/maintaining the new clusteroperator.config.openshift.io resource, not the old clusteroperators.operatorstatus.openshift.io one.

Also clusteroperator.config.openshift.io is not namespaced.

wrong creds log message

this is logging the wrong namespace:
https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/clusterconfig/clusterconfig.go#L123

"Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials \"kube-system/installer-cloud-credentials\": timed out waiting for the condition",

it's not looking in kube-system, it's looking in openshift-image-registry.

probably worth double checking others as well.

Image Registry Operator reporting it's status to the wrong resource?

If I run "oc get ImageRegistry -o yaml", I see the Image Registry Operator status conditions such as

status:
  conditions:
  - lastTransitionTime: 2018-12-11T00:30:49Z
    message: deployment has minimum availability
    status: "True"
    type: Available
  - lastTransitionTime: 2018-12-11T00:30:49Z
    message: everything is ready
    status: "False"
    type: Progressing
  - lastTransitionTime: 2018-12-11T00:30:41Z
    status: "False"
    type: Failing
  - lastTransitionTime: 2018-12-11T00:30:41Z
    status: "False"
    type: Removed

But shouldn't these be reported in the ClusterOperator resource?
When I run "oc get ClusterOperator" i see the status for the other operators (including the Samples Operator). And when I run "oc get ClusterOperator -o yaml" I see similar conditions to what we have in our ImageRegistry resource.

And then we should be storing things such as the StorageExists condition in the ImageRegistry resource status conditions I would think.

Thoughts?

operator allows filesystem storage change

The operator prevents changing from filesystem to another storagetype, but it doesn't prevent me from changing my filesystem volumeSource from PVC to EmptyDir (or presumably other changes). Such a change also invalidates my storage and therefore needs to be prevented.

Can not use Dockerfile

docker build -t docker.io/openshift/origin-cluster-image-registry-operator:latest .
Sending build context to Docker daemon 180.2 MB
Step 1/9 : FROM openshift/origin-release:golang-1.10
Trying to pull repository registry.access.redhat.com/openshift/origin-release ... 
Trying to pull repository docker.io/openshift/origin-release ... 
sha256:76e8479eb8c137cff06a5d35b27e5b5b5d4ce70f7a30de4ef01c5b3159a4d1ce: Pulling from docker.io/openshift/origin-release
256b176beaff: Already exists 
2bd622ac2b02: Pull complete 
Digest: sha256:76e8479eb8c137cff06a5d35b27e5b5b5d4ce70f7a30de4ef01c5b3159a4d1ce
Status: Downloaded newer image for docker.io/openshift/origin-release:golang-1.10
 ---> b017c842702b
Step 2/9 : COPY . /go/src/github.com/openshift/cluster-image-registry-operator
 ---> d442a944130b
Removing intermediate container 830fd06ddc48
Step 3/9 : RUN cd /go/src/github.com/openshift/cluster-image-registry-operator &&     go build ./cmd/cluster-image-registry-operator
 ---> Running in 5f596199bb25
 ---> beb23f9b901d
Removing intermediate container 5f596199bb25
Step 4/9 : FROM centos:7
Trying to pull repository registry.access.redhat.com/centos ... 
Trying to pull repository docker.io/library/centos ... 
sha256:6f6d986d425aeabdc3a02cb61c02abb2e78e57357e92417d6d58332856024faf: Pulling from docker.io/library/centos
256b176beaff: Already exists 
Digest: sha256:6f6d986d425aeabdc3a02cb61c02abb2e78e57357e92417d6d58332856024faf
Status: Downloaded newer image for docker.io/centos:7
 ---> 5182e96772bf
Step 5/9 : RUN useradd cluster-image-registry-operator
 ---> Running in 0af16ee1240f
 ---> af4ca97226ce
Removing intermediate container 0af16ee1240f
Step 6/9 : USER cluster-image-registry-operator
 ---> Running in 6c2ead775a45
 ---> 860d888ecc10
Removing intermediate container 6c2ead775a45
Step 7/9 : COPY --from=0 /go/src/github.com/openshift/cluster-image-registry-operator /usr/bin
Unknown flag: from
make: *** [Makefile:7: build-image] Error 1

$ docker version
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-60.git9cb56fd.fc28.x86_64
 Go version:      go1.10.3
 Git commit:      bdb8293-unsupported
 Built:           Sun Jul  8 08:29:45 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-60.git9cb56fd.fc28.x86_64
 Go version:      go1.10.3
 Git commit:      bdb8293-unsupported
 Built:           Sun Jul  8 08:29:45 2018
 OS/Arch:         linux/amd64
 Experimental:    true

Can we remove this modern option ?

regsitryresource was not recreated after deleting it

I deleted the registry resource and saw the registry deployment was cleaned up, but I did not see a new registry resource created until I restarted the registry operator.

TestCustomPVC test timeout flake

seen in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/222/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/1205/

=== RUN   TestCustomPVC
--- FAIL: TestCustomPVC (304.29s)
	pvc_test.go:79: PersistentVolume pv-q4sfdkgprvnr27pz25pfsrqgvfcqt66x766lk8m9z66qj7hczdvswfw4p6675j5j created
	pvc_test.go:79: PersistentVolume pv-qn842l7shqhwnrt6l57k7ckk7d44gkv92n6nzrthwjwmlpnl8j7wm6kl9bc9b9xb created
	pvc_test.go:224: timed out waiting for the condition
	framework.go:44: storageclasses:
		items:
		- metadata:
		    annotations:
		      storageclass.kubernetes.io/is-default-class: "true"
		    creationTimestamp: "2019-02-27T03:08:54Z"
		    labels:
		      cluster.storage.openshift.io/owner-name: cluster-config-v1
		      cluster.storage.openshift.io/owner-namespace: kube-system
		    name: gp2
		    resourceVersion: "12396"
		    selfLink: /apis/storage.k8s.io/v1/storageclasses/gp2
		    uid: 04235027-3a3d-11e9-a148-0a177483df06
		  parameters:
		    type: gp2
		  provisioner: kubernetes.io/aws-ebs
		  reclaimPolicy: Delete
		  volumeBindingMode: WaitForFirstConsumer
		metadata:
		  resourceVersion: "40806"
		  selfLink: /apis/storage.k8s.io/v1/storageclasses
	framework.go:44: persistentvolumes:
		items:
		- metadata:
		    annotations:
		      pv.kubernetes.io/bound-by-controller: "yes"
		    creationTimestamp: "2019-02-27T03:29:30Z"
		    finalizers:
		    - kubernetes.io/pv-protection
		    name: pv-jf95fgm2v5bbzp7z5jxm2mpn2fm8kxgj5pz5r8c66kzmz222v225nr78p5kcwtw4
		    resourceVersion: "37743"
		    selfLink: /api/v1/persistentvolumes/pv-jf95fgm2v5bbzp7z5jxm2mpn2fm8kxgj5pz5r8c66kzmz222v225nr78p5kcwtw4
		    uid: e4d85967-3a3f-11e9-a659-12bbbb61cb90
		  spec:
		    accessModes:
		    - ReadWriteOnce
		    - ReadOnlyMany
		    - ReadWriteMany
		    capacity:
		      storage: 100Gi
		    claimRef:
		      apiVersion: v1
		      kind: PersistentVolumeClaim
		      name: image-registry-storage
		      namespace: openshift-image-registry
		      resourceVersion: "36217"
		      uid: eb3bbf4c-3a3f-11e9-82b8-0a177483df06
		    hostPath:
		      path: /tmp/pv-jf95fgm2v5bbzp7z5jxm2mpn2fm8kxgj5pz5r8c66kzmz222v225nr78p5kcwtw4
		      type: ""
		    persistentVolumeReclaimPolicy: Retain
		    storageClassName: gp2
		  status:
		    phase: Released
		- metadata:
		    creationTimestamp: "2019-02-27T03:29:29Z"
		    finalizers:
		    - kubernetes.io/pv-protection
		    name: pv-k76bz94w9fbsz7n6h65s87c9cw8796h6dphq2rnft7nkwr69tq7qgm62l4ktqcdb
		    resourceVersion: "36055"
		    selfLink: /api/v1/persistentvolumes/pv-k76bz94w9fbsz7n6h65s87c9cw8796h6dphq2rnft7nkwr69tq7qgm62l4ktqcdb
		    uid: e4363455-3a3f-11e9-a659-12bbbb61cb90
		  spec:
		    accessModes:
		    - ReadWriteOnce
		    - ReadOnlyMany
		    - ReadWriteMany
		    capacity:
		      storage: 100Gi
		    hostPath:
		      path: /tmp/pv-k76bz94w9fbsz7n6h65s87c9cw8796h6dphq2rnft7nkwr69tq7qgm62l4ktqcdb
		      type: ""
		    persistentVolumeReclaimPolicy: Retain
		  status:
		    phase: Available
		- metadata:
		    creationTimestamp: "2019-02-27T03:31:07Z"
		    finalizers:
		    - kubernetes.io/pv-protection
		    name: pv-q4sfdkgprvnr27pz25pfsrqgvfcqt66x766lk8m9z66qj7hczdvswfw4p6675j5j
		    resourceVersion: "37818"
		    selfLink: /api/v1/persistentvolumes/pv-q4sfdkgprvnr27pz25pfsrqgvfcqt66x766lk8m9z66qj7hczdvswfw4p6675j5j
		    uid: 1e466d1c-3a40-11e9-a659-12bbbb61cb90
		  spec:
		    accessModes:
		    - ReadWriteOnce
		    - ReadOnlyMany
		    - ReadWriteMany
		    capacity:
		      storage: 100Gi
		    hostPath:
		      path: /tmp/pv-q4sfdkgprvnr27pz25pfsrqgvfcqt66x766lk8m9z66qj7hczdvswfw4p6675j5j
		      type: ""
		    persistentVolumeReclaimPolicy: Retain
		  status:
		    phase: Available
		- metadata:
		    creationTimestamp: "2019-02-27T03:31:08Z"
		    finalizers:
		    - kubernetes.io/pv-protection
		    name: pv-qn842l7shqhwnrt6l57k7ckk7d44gkv92n6nzrthwjwmlpnl8j7wm6kl9bc9b9xb
		    resourceVersion: "37832"
		    selfLink: /api/v1/persistentvolumes/pv-qn842l7shqhwnrt6l57k7ckk7d44gkv92n6nzrthwjwmlpnl8j7wm6kl9bc9b9xb
		    uid: 1ee7f556-3a40-11e9-a659-12bbbb61cb90
		  spec:
		    accessModes:
		    - ReadWriteOnce
		    - ReadOnlyMany
		    - ReadWriteMany
		    capacity:
		      storage: 100Gi
		    hostPath:
		      path: /tmp/pv-qn842l7shqhwnrt6l57k7ckk7d44gkv92n6nzrthwjwmlpnl8j7wm6kl9bc9b9xb
		      type: ""
		    persistentVolumeReclaimPolicy: Retain
		    storageClassName: gp2
		  status:
		    phase: Available
		metadata:
		  resourceVersion: "40806"
		  selfLink: /api/v1/persistentvolumes
	framework.go:44: persistentvolumeclaims:
		items:
		- metadata:
		    creationTimestamp: "2019-02-27T03:31:09Z"
		    finalizers:
		    - kubernetes.io/pvc-protection
		    name: test-custom-pvc
		    namespace: openshift-image-registry
		    resourceVersion: "37842"
		    selfLink: /api/v1/namespaces/openshift-image-registry/persistentvolumeclaims/test-custom-pvc
		    uid: 1f876dd3-3a40-11e9-a659-12bbbb61cb90
		  spec:
		    accessModes:
		    - ReadWriteMany
		    dataSource: null
		    resources:
		      requests:
		        storage: 1Gi
		    storageClassName: gp2
		  status:
		    phase: Pending
		metadata:
		  resourceVersion: "40806"
		  selfLink: /api/v1/namespaces/openshift-image-registry/persistentvolumeclaims
	imageregistry.go:146: uninstalling the image registry...
	imageregistry.go:150: stopping the operator...
	imageregistry.go:154: deleting the image registry resource...

Uninstall procedure

We need a mechanism that will allow us to remove the registry. We have the ManagementState removed, but there is no indication of the removal progress. I can't delete the custom resource because the operator will re-bootstrap it.

I need this because I want to uninstall the operator and removed everything that was generated by the operator.

Write tests for AdditionalTrustedCA

Recently we had a regression (#205). We need to have a test for AdditionalTrustedCA to avoid such regressions in the future.

need to retry failed deploymentconfig rollouts or switch to deployments

Upstream deployments constantly try to rollout the latest state until explicitly asked not to. DeploymentConfigs give up unless triggered again. Given an operator managed deploymentconfig that should be actively reconciled, you should be retrying failed deployments since they can later succeed.

The current state in openshift/installer is a good example where a blip in the first rollout will cause the image registry to be stuck indefinitely. Switching to a deployment will resolve this problem. Alternatively you can rebuild the oc rollout logic.

Migrate from operator-sdk

Since the operator has deleted the entire API [1], we are on the unsupported version in the vendor that cannot be updated. Of course, you can live with it for now, but eventually we need to rewrite everything using some another API.

[1] operator-framework/operator-sdk@a78c102

Usage question

Guess it's more a clarification request than an issue or bug, but how does the internal pods manage to pull images from the registry?

I have a situation where I installed the registry operator, I can push and pull images from outside openshift, but when I try oc new-app namespace/image the pods simply cannot pull the image from the registry.
First there is a certificate error (Unknown issuer) that I solved by copying the image registry certificate to /etc/docker/certs.d/<host> inside each node, but then it starts to fail with

Failed to pull image [...] rpc error: code = Unknown desc = unauthorized: authentication required

The image registry pod itself implies the pods is trying to pull images anonymously:

time="2019-01-15T19:12:26.189081838Z" level=error msg="OpenShift access denied: no RBAC policy
 matched" go.version=go1.10.3 openshift.auth.user=anonymous vars.name=default/fedoraa 
vars.reference="sha256:d79ddffdce8112f111878afcbe0205ef43e3eead131399194e2cf66fa8f3e5ed"

So my question is, what is the expect way to use the registry operator? Does it require extra configurations beyond what is https://github.com/openshift/cluster-image-registry-operator/tree/master/deploy ?

rename CRD to imageregistry

@legionus Can we rename openshiftdockerregistries.dockerregistry.operator.openshift.io to "openshiftimageregistries.imageregistry.operator.openshift"?

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

Operator crashes when storage backend cannot be detected

Seems #58 is back, seeing this on BYOR control plane cluster, no cloudprovider set:

I0111 01:01:08.021751       1 main.go:24] Cluster Image Registry Operator Version: 2e125da-dirty
I0111 01:01:08.021867       1 main.go:25] Go Version: go1.10.3
I0111 01:01:08.021872       1 main.go:26] Go OS/Arch: linux/amd64
I0111 01:01:08.049742       1 controller.go:378] waiting for informer caches to sync
I0111 01:01:09.251837       1 controller.go:387] started events processor
I0111 01:01:09.263690       1 bootstrap.go:102] generating registry custom resource
E0111 01:01:09.267640       1 storage.go:106] unknown storage backend: 
E0111 01:01:09.267778       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/bootstrap.go:128
/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:122
/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:213
/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:220
/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:385
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x12918ad]

goroutine 96 [running]:
github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x144be20, 0x223d8d0)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).Bootstrap(0xc42002c160, 0xc420810480, 0x1)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/bootstrap.go:128 +0x6ad
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).sync(0xc42002c160, 0x17f2570, 0xc4200c7460)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:122 +0x1e6
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).eventProcessor.func1(0xc42002c160, 0x13b60a0, 0x17bb930)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:213 +0x8f
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).eventProcessor(0xc42002c160)
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:220 +0x8e
github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).(github.com/openshift/cluster-image-registry-operator/pkg/operator.eventProcessor)-fm()
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:385 +0x2a
github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc420aa80a0)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc420aa80a0, 0x3b9aca00, 0x0, 0x9c295639fd35a501, 0xc4200b0de0)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc420aa80a0, 0x3b9aca00, 0xc4200b0de0)
	/go/src/github.com/openshift/cluster-image-registry-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cluster-image-registry-operator/pkg/operator.(*Controller).Run
	/go/src/github.com/openshift/cluster-image-registry-operator/pkg/operator/controller.go:385 +0xc32

Am I missing something or are the logs not working after the refactor to use klog?

registry-ca-hostmapper should run on linux only nodes

cc @mrunalp

clusteroperator condition message improvements

slack exchange:

i’d like to get some style cleanups in both the devex team operators on the progressing message
this is your primary communication channel and the intended receiver is the admin
right now it’s “The samples operator configuration is valid” and for registry it’s something slightly more machine oriented
i’d like the message to be something more affirmative like “The latest sample images and templates are installed” or something
or even shorter “All sample resources are up to date”
the style is sentence without punctuation
upper case leading, no punctution, human intended receipient
try to avoid kube-isms and overly technical explanations

Gabe Montero [1:41 PM]
the available condition is the one that says "Samples exist in the openshift project"

Clayton Coleman [1:42 PM]
Progressing is the generic condition

Gabe Montero [1:42 PM]
the failing condition is the one that says ""The samples operator configuration is valid""

Clayton Coleman [1:43 PM]
that’s where you put your human focused summary message
Progressing must represent a summary of the other conditions
so it’s fine to have the same message on progressing as available
(it’s fine to have the same messages on available)
but available and failed should explain the details as necessary, while progressing should explain to a human what the current state is

For context see https://github.com/openshift/cluster-version-operator/blob/master/pkg/cvo/status.go which has comments and examples of how the CVO works which is what this pattern is modeled on.

openshift/cluster-version-operator#72 is docs around what to do

This came from openshift/cluster-samples-operator#69

so see that issue for additional discussion.

	// Enable default encryption on the bucket
	_, err = svc.PutBucketEncryption(&s3.PutBucketEncryptionInput{
	Bucket: aws.String(d.Config.Bucket),
	ServerSideEncryptionConfiguration: &s3.ServerSideEncryptionConfiguration{
	Rules: []*s3.ServerSideEncryptionRule{
	{
	ApplyServerSideEncryptionByDefault: &s3.ServerSideEncryptionByDefault{
	SSEAlgorithm: aws.String(s3.ServerSideEncryptionAes256),
	},
	},
	},
	},
	})

	err = gen.Create()
	if err != nil {
	return fmt.Errorf("failed to create object %s: %s", Name(gen), err)
	}
	glog.Infof("object %s created", Name(gen))
	*modified = true
	return nil
	}

	updated, err := gen.Update(o.DeepCopyObject())
	if err != nil {
	return fmt.Errorf("failed to update object %s: %s", Name(gen), err)
	}
	if updated {
	glog.Infof("object %s updated", Name(gen))
	*modified = true