I am pretty sure pre-release version of the provider did not wait for any kind of clus

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Added special resource, <a href="https://github.com/terraform-providers/terraform-prov

Wait for/skip waiting for specific state of the cluster about terraform-provider-rancher2 HOT 9 CLOSED

rancher commented on August 29, 2024

Wait for/skip waiting for specific state of the cluster

from terraform-provider-rancher2.

Comments (9)

rawmind0 commented on August 29, 2024

@nazarewk, pre-release version of the provider was waiting for cluster state in the same way that it's waiting this version, no update on that.

When a custom clusters is created the expected state is provisioning https://github.com/terraform-providers/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L52
and for update, the expected state could be "active", "provisioning" or "pending" https://github.com/terraform-providers/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L180

May be i'm not fully understanding your issue. When you say, join command, are you referencing cluster node command to execute on nodes?? The cluster node command should be available once you create the cluster, in provisioning state.

In order to clarify and reproduce the issue, could you please provide your tf file to create and update the cluster??

from terraform-provider-rancher2.

nazarewk commented on August 29, 2024

I've used https://github.com/terraform-providers/terraform-provider-rancher2/blob/d96820fa09b19bb6ff56498ab97af444fe89eb7b/rancher2/resource_rancher2_cluster.go before (this exact commit) and it worked fine.

On Monday after updating the provider to 1.0.0 and re-creating aws_instances responsible for etcd and controlplane all at once, rancher2_cluster was stuck in waiting state until I skipped the resource and hardcoded join command copied from Rancher UI (it is used in user_data to join the cluster later), only when nodes were operational again the cluster passed state waiting.

from terraform-provider-rancher2.

rawmind0 commented on August 29, 2024

@nazarewk, in order to reproduce your issue, could you please provide your tf file to create and to update the cluster?? Also, could you please explain the steps that you are doing??

Are you using node_template and node_pool or just deploying aws instances by other way?? When you mentioned join command, are you referencing cluster node command to execute on nodes??

from terraform-provider-rancher2.

nazarewk commented on August 29, 2024

My Terraform configuration has upwards of 4k LoC so i can't really share it, but during last week I have nailed down the scenario that is impossible to realize in current form:

Deploy operational cluster using Custom Nodes cluster (not node pool or anything like this, just plain old EC2 instances running a docker command from rancher2_cluster.<>.cluster_registration_token["node_command"]),
Turn off Custom Nodes (stop/terminate EC2 instances),
Make a modification to the cluster (eg.: turn Nginx Ingress on -> off).

The result is:

rancher2_cluster is stuck in pending/provisioning state (whatever it is, resources Update does not finish) indefinitely due to etcd/controlplane being down,
rancher2_cluster.<>.cluster_registration_token["node_command"] is pending indefinitely for rancher2_cluster to finish update
aws_instance cannot be brought back up due to previous command not finishing until ControlPlane is up,

It is up to you to decide whether you want to support similar flow or not.

My workaround is to edit module's source code to remove rancher2_cluster dependencies:

locals {
  _rancher2_cluster_registration_tokens = "${flatten(concat(flatten(rancher2_cluster.this.*.cluster_registration_token), list(map())))}"
  _rancher2_cluster_registration_token  = "${local._rancher2_cluster_registration_tokens[0]}"

  #  rancher2_cluster_id   = "${join("#", rancher2_cluster.this.*.id)}"
  #  rancher2_join_command = "${lookup(local._rancher2_cluster_registration_token, "node_command", "")}"

  # Uncomment this and fix values from Rancher UI's URL and docker command while editing cluster
  rancher2_cluster_id   = "c-g8r8v"
  rancher2_join_command = "sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.3 --server https://rancher.shared-infra.global.projectdrgn.com --token example123...321example"
}

instead of:

locals {
  _rancher2_cluster_registration_tokens = "${flatten(concat(flatten(rancher2_cluster.this.*.cluster_registration_token), list(map())))}"
  _rancher2_cluster_registration_token  = "${local._rancher2_cluster_registration_tokens[0]}"

  rancher2_cluster_id   = "${join("#", rancher2_cluster.this.*.id)}"
  rancher2_join_command = "${lookup(local._rancher2_cluster_registration_token, "node_command", "")}"

  # # Uncomment this and fix values from Rancher UI's URL and docker command while editing cluster
  #  rancher2_cluster_id = "c-g8r8v"
  #  rancher2_join_command = "sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.2.3 --server https://rancher.shared-infra.global.projectdrgn.com --token example123...321example"}
}

from terraform-provider-rancher2.

rawmind0 commented on August 29, 2024

@nazarewk, Is it needed that you follow the steps in this order?? I think it is not a good way to update a cluster. When you remove all the etcd and controlplane hosts, your cluster becomes unavailable, that's why couldn't be modified. It's like you delete and recreate it. If you edit the cluster, the changes will be updated to the nodes automatically, no need to redeploy them.

Anyway, if you need to redeploy hosts, could you do it before or after update the cluster?? Also, node_command should be the same it was, it's already in your tfstate or output...

Deploy operational cluster using Custom Nodes cluster
Redeploy Custom Nodes (stop/terminate EC2 instances and start new ones)
Make a modification to the cluster (eg.: turn Nginx Ingress on -> off)

Deploy operational cluster using Custom Nodes cluster
Make a modification to the cluster (eg.: turn Nginx Ingress on -> off)
Redeploy Custom Nodes (stop/terminate EC2 instances and start new ones)

from terraform-provider-rancher2.

nazarewk commented on August 29, 2024

The thing is rancher join command is part of user_data, meaning that when both cluster and nodes configuration changes simultaneously nodes are deleted before cluster is updated.

Let's say the modules creating cluster and nodes weren't applied for a while and both changed, then we have simultaneous update of both cluster and nodes' user_data. Seems like more or less natural flow in Terraform.

from terraform-provider-rancher2.

rawmind0 commented on August 29, 2024

The thing is rancher join command is part of user_data, meaning that when both cluster and nodes configuration changes simultaneously nodes are deleted before cluster is updated.

Why?? When cluster node command is part of user_data, if you update cluster and node command is updated, nodes configuration would change after cluster is updated.

Let's say the modules creating cluster and nodes weren't applied for a while and both changed, then we have simultaneous update of both cluster and nodes' user_data. Seems like more or less natural flow in Terraform.

The problem is not that both resources have changed simultaneously, the problem is that the state of the cluster is unavailable when you try to update it, it's not possible to update a cluster that it's unavailable. If the state of the cluster would be active or provisioning, no problem to update both resources simultaneously. Have you considered to redeploy vm's first and then update cluster??

From the point of view of a k8s cluster lifecycle, destroy/terminate all etcd and controlplane nodes in a row doesn't seem a good practice, all cluster info will be lost. No big difference to remove the cluster and create a new one.

from terraform-provider-rancher2.

nazarewk commented on August 29, 2024

It works fine when I persiste /var/lib/etcd on external volume, also as far as i know Terraform destroys everything that requires destroying before it moves to updating stuff.

from terraform-provider-rancher2.

rawmind0 commented on August 29, 2024

Added special resource, rancher2_cluster_sync to wait for active cluster. Please, reopen issue if needed

from terraform-provider-rancher2.

Wait for/skip waiting for specific state of the cluster about terraform-provider-rancher2 HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent