Giter Site home page Giter Site logo

Comments (9)

rawmind0 avatar rawmind0 commented on August 29, 2024

@nazarewk, pre-release version of the provider was waiting for cluster state in the same way that it's waiting this version, no update on that.

When a custom clusters is created the expected state is provisioning https://github.com/terraform-providers/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L52
and for update, the expected state could be "active", "provisioning" or "pending" https://github.com/terraform-providers/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L180

May be i'm not fully understanding your issue. When you say, join command, are you referencing cluster node command to execute on nodes?? The cluster node command should be available once you create the cluster, in provisioning state.

In order to clarify and reproduce the issue, could you please provide your tf file to create and update the cluster??

from terraform-provider-rancher2.

nazarewk avatar nazarewk commented on August 29, 2024

I've used https://github.com/terraform-providers/terraform-provider-rancher2/blob/d96820fa09b19bb6ff56498ab97af444fe89eb7b/rancher2/resource_rancher2_cluster.go before (this exact commit) and it worked fine.

On Monday after updating the provider to 1.0.0 and re-creating aws_instances responsible for etcd and controlplane all at once, rancher2_cluster was stuck in waiting state until I skipped the resource and hardcoded join command copied from Rancher UI (it is used in user_data to join the cluster later), only when nodes were operational again the cluster passed state waiting.

from terraform-provider-rancher2.

rawmind0 avatar rawmind0 commented on August 29, 2024

@nazarewk, in order to reproduce your issue, could you please provide your tf file to create and to update the cluster?? Also, could you please explain the steps that you are doing??

Are you using node_template and node_pool or just deploying aws instances by other way?? When you mentioned join command, are you referencing cluster node command to execute on nodes??

from terraform-provider-rancher2.

nazarewk avatar nazarewk commented on August 29, 2024

My Terraform configuration has upwards of 4k LoC so i can't really share it, but during last week I have nailed down the scenario that is impossible to realize in current form:

  1. Deploy operational cluster using Custom Nodes cluster (not node pool or anything like this, just plain old EC2 instances running a docker command from rancher2_cluster.<>.cluster_registration_token["node_command"]),
  2. Turn off Custom Nodes (stop/terminate EC2 instances),
  3. Make a modification to the cluster (eg.: turn Nginx Ingress on -> off).

The result is:

  1. rancher2_cluster is stuck in pending/provisioning state (whatever it is, resources Update does not finish) indefinitely due to etcd/controlplane being down,
  2. rancher2_cluster.<>.cluster_registration_token["node_command"] is pending indefinitely for rancher2_cluster to finish update
  3. aws_instance cannot be brought back up due to previous command not finishing until ControlPlane is up,

It is up to you to decide whether you want to support similar flow or not.

My workaround is to edit module's source code to remove rancher2_cluster dependencies:

locals {
  _rancher2_cluster_registration_tokens = "${flatten(concat(flatten(rancher2_cluster.this.*.cluster_registration_token), list(map())))}"
  _rancher2_cluster_registration_token  = "${local._rancher2_cluster_registration_tokens[0]}"

  #  rancher2_cluster_id   = "${join("#", rancher2_cluster.this.*.id)}"
  #  rancher2_join_command = "${lookup(local._rancher2_cluster_registration_token, "node_command", "")}"

  # Uncomment this and fix values from Rancher UI's URL and docker command while editing cluster
  rancher2_cluster_id   = "c-g8r8v"
  rancher2_join_command = "sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.3 --server https://rancher.shared-infra.global.projectdrgn.com --token example123...321example"
}

instead of:

locals {
  _rancher2_cluster_registration_tokens = "${flatten(concat(flatten(rancher2_cluster.this.*.cluster_registration_token), list(map())))}"
  _rancher2_cluster_registration_token  = "${local._rancher2_cluster_registration_tokens[0]}"

  rancher2_cluster_id   = "${join("#", rancher2_cluster.this.*.id)}"
  rancher2_join_command = "${lookup(local._rancher2_cluster_registration_token, "node_command", "")}"

  # # Uncomment this and fix values from Rancher UI's URL and docker command while editing cluster
  #  rancher2_cluster_id = "c-g8r8v"
  #  rancher2_join_command = "sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.3 --server https://rancher.shared-infra.global.projectdrgn.com --token example123...321example"}
}

from terraform-provider-rancher2.

rawmind0 avatar rawmind0 commented on August 29, 2024

@nazarewk, Is it needed that you follow the steps in this order?? I think it is not a good way to update a cluster. When you remove all the etcd and controlplane hosts, your cluster becomes unavailable, that's why couldn't be modified. It's like you delete and recreate it. If you edit the cluster, the changes will be updated to the nodes automatically, no need to redeploy them.

Anyway, if you need to redeploy hosts, could you do it before or after update the cluster?? Also, node_command should be the same it was, it's already in your tfstate or output...

  1. Deploy operational cluster using Custom Nodes cluster
  2. Redeploy Custom Nodes (stop/terminate EC2 instances and start new ones)
  3. Make a modification to the cluster (eg.: turn Nginx Ingress on -> off)

or

  1. Deploy operational cluster using Custom Nodes cluster
  2. Make a modification to the cluster (eg.: turn Nginx Ingress on -> off)
  3. Redeploy Custom Nodes (stop/terminate EC2 instances and start new ones)

from terraform-provider-rancher2.

nazarewk avatar nazarewk commented on August 29, 2024

The thing is rancher join command is part of user_data, meaning that when both cluster and nodes configuration changes simultaneously nodes are deleted before cluster is updated.

Let's say the modules creating cluster and nodes weren't applied for a while and both changed, then we have simultaneous update of both cluster and nodes' user_data. Seems like more or less natural flow in Terraform.

from terraform-provider-rancher2.

rawmind0 avatar rawmind0 commented on August 29, 2024

The thing is rancher join command is part of user_data, meaning that when both cluster and nodes configuration changes simultaneously nodes are deleted before cluster is updated.

Why?? When cluster node command is part of user_data, if you update cluster and node command is updated, nodes configuration would change after cluster is updated.

Let's say the modules creating cluster and nodes weren't applied for a while and both changed, then we have simultaneous update of both cluster and nodes' user_data. Seems like more or less natural flow in Terraform.

The problem is not that both resources have changed simultaneously, the problem is that the state of the cluster is unavailable when you try to update it, it's not possible to update a cluster that it's unavailable. If the state of the cluster would be active or provisioning, no problem to update both resources simultaneously. Have you considered to redeploy vm's first and then update cluster??

From the point of view of a k8s cluster lifecycle, destroy/terminate all etcd and controlplane nodes in a row doesn't seem a good practice, all cluster info will be lost. No big difference to remove the cluster and create a new one.

from terraform-provider-rancher2.

nazarewk avatar nazarewk commented on August 29, 2024

It works fine when I persiste /var/lib/etcd on external volume, also as far as i know Terraform destroys everything that requires destroying before it moves to updating stuff.

from terraform-provider-rancher2.

rawmind0 avatar rawmind0 commented on August 29, 2024

Added special resource, rancher2_cluster_sync to wait for active cluster. Please, reopen issue if needed

from terraform-provider-rancher2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.