Giter Site home page Giter Site logo

Comments (14)

jbw976 avatar jbw976 commented on May 27, 2024 1

Perfect @turkenh! Let's close down this issue then in favor of #63. Thank you for your support and thank you to @idallaserra for his keep troubleshooting as well! 💪

from provider-helm.

jbw976 avatar jbw976 commented on May 27, 2024

Further details about verifying that the connection secret to the GKE cluster seems OK. See the secret referenced by the ProviderConfig:

> k get providerconfig.helm.crossplane.io multik8s-cluster-gcp -o yaml
...
spec:
  credentials:
    secretRef:
      key: kubeconfig
      name: 2b80fbbb-076b-4908-9034-71db73c3cc86-gkecluster
      namespace: upbound-system
    source: Secret

Extracting the data.kubeconfig field from that secret into a file then using the file to run kubectl get node worked OK:

> kubectl -n upbound-system get secret 2b80fbbb-076b-4908-9034-71db73c3cc86-gkecluster -o jsonpath='{.data.kubeconfig}' | base64 -d > kubeconfig-gke
> k --kubeconfig=kubeconfig-gke get node
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-cluster-gcp-tkbx-cluster-gcp-s59h-366d1ae5-9nxr   Ready    <none>   64m   v1.16.15-gke.7300
gke-cluster-gcp-tkbx-cluster-gcp-s59h-894e22ae-p5c9   Ready    <none>   65m   v1.16.15-gke.7300
gke-cluster-gcp-tkbx-cluster-gcp-s59h-f5400ebc-zp8n   Ready    <none>   65m   v1.16.15-gke.7300

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

It should keep retrying with the default shortWait which is 30 sec. Every time reconcile (at connect step here) failed it is being re-added to queue.

Do you see anything in provider helm controller logs? Just wondering if controller stopped working somehow?

from provider-helm.

jbw976 avatar jbw976 commented on May 27, 2024

I reproduced this issue again just now and I captured the full pod logs in the gist below. A lot of messages for this, but not sure if that's an issue:

E1231 19:10:49.562709       1 memcache.go:206] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Note that the provider-helm pod has no restarts and appears healthy (full pod details in gist)

> kcs get pod crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww
NAME                                                    READY   STATUS    RESTARTS   AGE
crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww   1/1     Running   0          16m

full gist: https://gist.github.com/jbw976/2d13e79147c2aae564c61a827389de57

Anything else I can look into? or get you access to?

from provider-helm.

jbw976 avatar jbw976 commented on May 27, 2024

Could this be a network policy or something that is blocking provider-helm in the hosted crossplane instance in Upbound Cloud from being able to reach the GKE cluster? When I successfully used the kubeconfig from the connection secret, I was doing that from my own laptop.

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

Could this be a network policy or something that is blocking provider-helm in the hosted crossplane instance in Upbound Cloud from being able to reach the GKE cluster? When I successfully used the kubeconfig from the connection secret, I was doing that from my own laptop.

We don't have egress policy that might block such kind of a traffic in Upbound Cloud, so, I don't think so.

Couple of things that I am wondering:

  • Is it 100% reproducible?
  • Does the controller stop reconciling on CR, or, keeps retrying but getting some sort of error and somehow this is not reflected to CR status. Controller logs could help here (thanks for the logs above @jbw976, but I think we would need --debug flag in the controller to better understand what is going on)
  • What happens if we restart the controller?

Otherwise, it would be helpful if I could get access to the environment next time for debugging further.

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

Hmm, interesting point in the logs you shared:

2020/12/31 19:10:47 info: skipping unknown hook: "crd-install"

which indicates that, installation of helm started at 2020/12/31 19:10:47, so, helm controller seems to get connected to remote cluster, but probably didn't get completed (yet?) because of some reason. (AFAIR, that chart contains a pre-install helm hook which might be related). So, I would definitely check what is going on on the cluster side, for example:

helm ls --all --all-namespaces
kubectl get jobs -n operators
kubectl get pods -n operators
kubectl get all -n operators
...

from provider-helm.

idallaserra avatar idallaserra commented on May 27, 2024

Dear, I am experiencing exactly the same problem described by @jbw976 .
But in my opinion the problem is not related to the helm gcp provider.
In my case the issue is due to the fact that I am testing the gcp part of platform-ref-multi-k8s and the GKE cluster created by this crossplane repo is Private!
So only some image in cache is working. In fact trying to install with plain helm, no provider helm, some example chart give me pull error. Fixing this using cloud nat and now everything works:
image

from provider-helm.

jbw976 avatar jbw976 commented on May 27, 2024

Super cool that you got it working for you @idallaserra! Can you explain a little more detail about what "using cloud nat" means? Any docs you could point me to that would help me understand the details?

It seemed to me like 2 different problems though, where my provider-helm couldn't even connect to the newly provisioned GKE cluster at all, even with my changes to enable basic auth in the GKE cluster in upbound/platform-ref-multi-k8s#6 that are now published in v0.0.4 at https://cloud.upbound.io/registry/upbound/platform-ref-multi-k8s.

For you though, your provider-helm seemed to be able to connect to the GKE cluster, but then container pulls from inside the GKE cluster were not working. That felt like a different issue to me, and I'd love to hear more about the "cloud nat" you used to get that working.

@turkenh I'll try to get more debugging details (and get you access to the repro instance) the next time I repro this. Thanks for your help so far! 🙇‍♂️

edit: ah, now i'm seeing that @turkenh mentioned the evidence for my provider-helm connecting to the cluster OK but just not finishing the job. Perhaps @idallaserra has it figured out with the "cloud nat" 💪

from provider-helm.

idallaserra avatar idallaserra commented on May 27, 2024

Dear for the cloud nat the problem is well explained here:
https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#docker_hub in "Can't pull image from public Docker Hub".
Shortly the created cluster is Private, so the node has no internet connectivity, meaning no docker pull.
Here a document, old but still valid, explaing how to do.
The best in my opinion would be the Cloud Nat support in provider-gcp (still lacking it seems to me) to permit the creation directly in Crossplane.
Hope to be useful :-)
Ivan

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

Just reproduced the issue and here is my findings:

  • Helm Release status is same as the one given in the beginning of this issue.
  • On the created GKECluster, I see:
    1. operators namespace created by provider-helm
    2. helm release is stuck as pending-install:
    helm ls -a
    NAME                      	NAMESPACE	REVISION	UPDATED                              	STATUS         	CHART                       	APP VERSION
    vault1-cluster-p7pfn-mkbc6	operators	1       	2021-01-19 14:08:47.1636178 +0000 UTC	pending-install	kube-prometheus-stack-10.1.0	0.42.1
    
    1. pre-install job in prometheus chart created and its pod stuck as ImagePullBackOff with the following events:
     Events:
       Type     Reason     Age                   From               Message
       ----     ------     ----                  ----               -------
       Normal   Scheduled  11m                   default-scheduler  Successfully assigned operators/vault1-cluster-p7pfn-mkbc6-admission-create-8m2qr to gke-vault1-cluster-p-vault1-cluster-p-d7ca7c52-3465
       Normal   Pulling    8m36s (x4 over 11m)   kubelet            Pulling image "jettech/kube-webhook-certgen:v1.2.1"
       Warning  Failed     8m21s (x4 over 10m)   kubelet            Failed to pull image "jettech/kube-webhook-certgen:v1.2.1": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
       Warning  Failed     8m21s (x4 over 10m)   kubelet            Error: ErrImagePull
       Normal   BackOff    5m57s (x14 over 10m)  kubelet            Back-off pulling image "jettech/kube-webhook-certgen:v1.2.1"
       Warning  Failed     63s (x34 over 10m)    kubelet            Error: ImagePullBackOff
    
    1. I can pull that docker image from my local
    2. Tried to run two pods with public docker images, one started running, the other failed with ImagePullBackoff

I think, this confirms @idallaserra`s observations. This stackoverflow describe the same issue and has a well-written answer.

@jbw976, wondering if there is any special reason to enable private cluster nodes in gke resource spec here.

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

And, after waiting loong enough, helm release reports following status back :)

status:
  atProvider:
    releaseDescription: 'Release "vault1-cluster-p7pfn-mkbc6" failed: failed pre-install:
      failed to deploy vault1-cluster-p7pfn-mkbc6-admission-create'
    revision: 1
    state: failed
  conditions:
  - lastTransitionTime: "2021-01-19T14:08:32Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  - lastTransitionTime: "2021-01-19T14:08:17Z"
    reason: Unavailable
    status: "False"
    type: Ready
  failed: 1
  synced: true

from provider-helm.

jbw976 avatar jbw976 commented on May 27, 2024

@jbw976, wondering if there is any special reason to enable private cluster nodes in gke resource spec here.

No I don't think there is @turkenh. Most likely I was following another example and didn't realize the impact of that setting 😨

I'll get this patched in the reference platforms that use GKE (GCP and multi-k8s).

Is there anything to fix here in terms of provider-helm reporting back status earlier, so that people in the future in similar scenarios don't get tricked by the red herring of "connection refused"? That was showing to me as the main/obvious thing that was wrong on the Release object when I first investigated and got lead down the wrong trail.

Can that status behavior be improved? 🙏

from provider-helm.

turkenh avatar turkenh commented on May 27, 2024

Is there anything to fix here in terms of provider-helm reporting back status earlier, so that people in the future in similar scenarios don't get tricked by the red herring of "connection refused"? That was showing to me as the main/obvious thing that was wrong on the Release object when I first investigated and got lead down the wrong trail.

Can that status behavior be improved? 🙏

Yes, once we have implemented this, we will be able to report early / better instead of getting blocked until helm client returns.

from provider-helm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.