Giter Site home page Giter Site logo

Comments (9)

muvaf avatar muvaf commented on May 30, 2024 1

Also, I think we need similar experiments for multiple kinds. As discussed in standup, we need to know more about the behavior of terrajet with resources whose creation take much longer than VirtualNetwork, like databases or k8s clusters.

from terrajet.

ulucinar avatar ulucinar commented on May 30, 2024

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-medium (2 vCPU, 4 GB memory)
3 workers
Control plane version - 1.20.9-gke.701

I deployed a stripped-down version of provider-tf-azure (with 22 MRs including the VirtualNetwork resource) as we would like to focus on the scalability dimension of #-of-CRs. The #-of-CRDs dimension has already been investigated here. The VirtualNetwork MRs are provisioned via a simple shell script, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-virtualnetworks.sh create $(seq 31 40)
apiVersion: virtual.azure.tf.crossplane.io/v1alpha1
kind: VirtualNetwork
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40
    tf.crossplane.io/state: ...
  creationTimestamp: "2021-09-17T12:41:34Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 1
  name: test-40
  resourceVersion: "622141"
  uid: d6b7a148-7b1d-432d-b665-8550647e5c8f
spec:
  deletionPolicy: Delete
  forProvider:
    addressSpace:
    - 10.0.0.0/16
    dnsServers:
    - 10.0.0.1
    - 10.0.0.2
    - 10.0.0.3
    location: East US
    name: test-40
    resourceGroupName: alper
    tags:
      experiment: "2"
  providerConfigRef:
    name: example
status:
  atProvider:
    guid: 8b4aa7b0-8256-4ea2-b6b9-c6f86d6e2857
  conditions:
  - lastTransitionTime: "2021-09-17T13:00:48Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-09-17T12:47:09Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced

I have done a set of experiments with provider-tf-azure using the free Azure VirtualNetwork resource. After an initial batch of 10 VirtualNetworks were provisioned and successfully transitioned to the Ready state, another batch of 10 more VirtualNetwork MRs were provisioned, and I observed that the MRs added in the latter batch were failing to transition to the Ready state. And the workqueue depth for the resource approached ~30:

queue-metrics

The VirtualNetwork controller has a default worker count of 1 in this setup, and as it can be observed in the Workqueue depth for VirtualNetworks graph, the workqueue of the VirtualNetwork controller quickly responds to these newcomer MRs. At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup. As it can be observed in the Reconciliation Times for VirtualNetworks graph, the reconciliation times of the VirtualNetwork controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop. Also the 5 min average wait times reconciliation requests spend in the workqueue is increasing with the number of VirtualNetworks we have in the cluster. We have a single worker routine processing the slow Terraform pipelines, including the synchronous Obseervation pipeline.

During these experiments, I also discovered a bug in the tfcli library, where we do not consume an already available TF pipeline result if the pipeline had a chance to produce its result. After this bug is fixed with #67 and without increasing the maximum concurrent reconciler count for the VirtualNetwork controller, all of the 40 VirtualNetwork MRs provisioned in the cluster could successfully transition to the Ready state. I have taken rough measurements to give an idea of the time it took for the last batch of 10 resources to transition to Ready state:

test-31   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-31   15m
test-32   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-32   16m
test-33   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-33   16m
test-34   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-34   16m
test-35   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-35   17m
test-36   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-36   17m
test-37   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-37   17m
test-38   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-38   18m
test-39   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-39   18m
test-40   True    True     /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40   19m

Please note that I have not repeated these experiments and these results do not represent averages, however, the p99 workqueue wait times (exceeding ~8 min) look consistent with these measurements.

One notable issue is that because we are only capable of using a single worker, currently we cannot utilize the available CPU and memory resources to terrajet-based providers:
cpu-mem-metrics
This can become an issue if a terraform-based provider is being used to provision multiple objects of the same kind. crossplane-contrib/provider-jet-azure#4 adds an option to increase the maximum concurrent reconcilers for a resource without modifying the default value, so that we will be able to utilize CPU & memory if needed.

I'm planning to perform some further experiments to see whether we can utilize CPU & memory resources more efficiently to decrease transition-to-Ready times and to check some other metrics.

from terrajet.

muvaf avatar muvaf commented on May 30, 2024

As it can be observed in the Reconciliation Times for VirtualNetworks graph, the reconciliation times of the VirtualNetwork controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop.

Wonder how this compares to provider-azure and terraform apply -refresh-only. I'd expect 20s be closer to be taken mostly by terraform operation.

At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup.

We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of VirtualNetworks. Do you think we need the shared gRPC server?

from terrajet.

ulucinar avatar ulucinar commented on May 30, 2024

We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of VirtualNetworks. Do you think we need the shared gRPC server?

I'm planning to do another set of experiments with higher concurrency using the new -c option. I expect we will utilize both CPU & memory better and decrease time-to-completion, i.e. time to move a CR to the Ready state. Another thing to look into is to introduce updates (the experiments described above did not involve asynchronous Update operations, which will not block the workers, yield shorter queue waiting times, and thus utilize more resources.

I suspect because of so many process forks done by Terraform CLI, as we increase concurrency, the Terraform-based providers might become CPU-bound. That's something to be explored. If this turns out to be the case, and if we are not satisfied with our level of scaling, we might want to give shared gRPC servers a try.

Thank you for taking a look into this @muvaf, very much appreciated!

from terrajet.

muvaf avatar muvaf commented on May 30, 2024

@ulucinar FYI, @negz let me know that we have a global rate limiter in most providers that limits number of reconciles to 1 per second. I have a hunch that this rate limiter affects the queue more than the concurrency number.

Broadly though, I think this issue was opened to see whether we can work with 100+ instances with varying kinds without crashing or exceeding the deadline to the point where provider is useless. So the answer to that question should be enough to close the issue.

from terrajet.

muvaf avatar muvaf commented on May 30, 2024

@ulucinar it'd be great if you can share your script and provide a base line for performance statement so that we can reproduce the tests but that's not in the original scope of this issue, so feel free to close this issue after the 100+ instances with varying kinds without crashing or exceeding the deadline result or some other performance statement that we can use in the future repeatedly to measure the effects big changes.

from terrajet.

ulucinar avatar ulucinar commented on May 30, 2024

Here are the results from another set of experiments involving our target of 100 MRs in this issue:

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001

Previous experiments have shown that Terraform-based providers are CPU-bound, and hence in this setup we are using nodes with higher CPU capacity in order to be able to scale up to 100 MRs.

I deployed a stripped-down version of provider-tf-azure (with 32 MRs including the VirtualNetwork and Lb resources, Docker image: ulucinar/provider-tf-azure-controller:98a23918b69a778e4910f81483b7767c56cf41e5) containing the --concurrent-reconciles command-line option, which allows the provider to better utilize node resources. After experimenting with several max. concurrent reconciles options, I have chosen a value of 3 for this cluster. A total of 45 VirtualNetwork MRs and a total of 55 Lb MRs are provisioned simultaneously, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 55)
apiVersion: lb.azure.tf.crossplane.io/v1alpha1
kind: Lb
metadata:
  annotations:
    crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/loadBalancers/test-1
    tf.crossplane.io/state: ...
  creationTimestamp: "2021-09-27T13:38:48Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generation: 1
  name: test-1
  resourceVersion: "5862396"
  uid: 2b97bbb4-791b-4449-8326-ef1ee5f13eb7
spec:
  deletionPolicy: Delete
  forProvider:
    location: East US
    name: test-1
    resourceGroupName: alper
  providerConfigRef:
    name: example
status:
  atProvider: {}
  conditions:
  - lastTransitionTime: "2021-09-27T13:43:56Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2021-09-27T13:38:49Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced

As also observed in our previous experiments, CPU utilization shows a sharp increase in parallel to the increasing #-of-CRs in the cluster during the provisioning phase, and also during the de-provisioning phase. After all of the 100 MRs transition to the Ready state and the system stabilizes, all of the 100 MRs have been deleted simultaneously at 14:07:10 UTC. In the below graphs, the first peak CPU utilization at around 13:44 UTC is caused by the provisioning phase, and the second peak at around 14:13 UTC is caused by the de-provisioning phase.

azure-object-count

node-provider

provider-cpu-usage

nodes-cpu-mem-utilizations

First, CPU utilization climbs just over 92% as we have many asynchronous concurrent Create calls, however the limits we have imposed on the concurrency prevent saturation. However, we also observe increasing workqueue wait times for both kinds of resources:
avg-queue-wait-times

The following graph shows the time to readiness periods, the time it takes for an MR to become Ready (acquire the Ready status condition with status == True) measured from the time it's created:

ttr

Please note that these are not averages. The max time to readiness interval for an MR in these experiments has been 670 s and the min has been observed to be 221 s.

Another interesting observation is that when all the MRs are deleted at 14:07:10 UTC (they acquired a non-zero metadata.deletionTimestamp), CPU utilization starts to increase reaching a max. of ~94% at 14:13:30. And it takes ~1020 s (~17 min) to remove all of the 100 MRs from the cluster. Although a corresponding external resource is deleted via the Cloud API, it takes provider-tf-azure a longer time to dequeue a request from the workqueue and to make an observation for the deleted resource and to remove the finalizer.

from terrajet.

ulucinar avatar ulucinar commented on May 30, 2024

Results from another set of experiments with the native provider provider-azure:

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001

I deployed provider-azure v0.17.0. Please note that the max. concurrent reconcilers is 1 for this version. The same script used in the previous experiments is employed with different template manifests to provision a total of 45 VirtualNetworks and a total of 205 Subnets. We cannot efficiently utilize node resources even with 250 MRs:

native-various-metrics

We will clearly benefit from increased concurrency here.

The time to readiness intervals are distributed as follows for these 250 MRs:

native-ttr

When all of the 250 MRs are deleted simultaneously at ~23:31:46 UTC, we observe a surge in workqueue lengths but only a slight increase in CPU utilization, and at ~23:34:30 UTC all MRs have been removed (in ~3 min). Please note that these deletion measurements are skewed because all of the 205 subnets that are deleted belonged to the same virtual network. Probably, when that virtual network was deleted, all observations for the subnets returned 404. Nevertheless, provider-azure needs to observe all of the subnets before it removes the associated finalizers from the Subnet MRs:

native-deletion-metrics

We need to incorporate the recently proposed --max-reconcile-rate command-line option for provider-azure (by @negz in crossplane/crossplane#2595) to make a fair comparison with provider-tf-azure, which already benefits from increased concurrency & better utilization of the CPU resources as described in the above experiments.

from terrajet.

ulucinar avatar ulucinar commented on May 30, 2024

We will continue scale testing of Terrajet-based providers with the latest improvements.

from terrajet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.