Comments (9)
Also, I think we need similar experiments for multiple kinds. As discussed in standup, we need to know more about the behavior of terrajet with resources whose creation take much longer than VirtualNetwork
, like databases or k8s clusters.
from terrajet.
Experiment Setup:
On a GKE cluster with the following specs:
Machine family: General purpose e2-medium (2 vCPU, 4 GB memory)
3 workers
Control plane version - 1.20.9-gke.701
I deployed a stripped-down version of provider-tf-azure
(with 22 MRs including the VirtualNetwork
resource) as we would like to focus on the scalability dimension of #-of-CRs. The #-of-CRDs dimension has already been investigated here. The VirtualNetwork
MRs are provisioned via a simple shell script, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:
$ ./manage-virtualnetworks.sh create $(seq 31 40)
apiVersion: virtual.azure.tf.crossplane.io/v1alpha1
kind: VirtualNetwork
metadata:
annotations:
crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40
tf.crossplane.io/state: ...
creationTimestamp: "2021-09-17T12:41:34Z"
finalizers:
- finalizer.managedresource.crossplane.io
generation: 1
name: test-40
resourceVersion: "622141"
uid: d6b7a148-7b1d-432d-b665-8550647e5c8f
spec:
deletionPolicy: Delete
forProvider:
addressSpace:
- 10.0.0.0/16
dnsServers:
- 10.0.0.1
- 10.0.0.2
- 10.0.0.3
location: East US
name: test-40
resourceGroupName: alper
tags:
experiment: "2"
providerConfigRef:
name: example
status:
atProvider:
guid: 8b4aa7b0-8256-4ea2-b6b9-c6f86d6e2857
conditions:
- lastTransitionTime: "2021-09-17T13:00:48Z"
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2021-09-17T12:47:09Z"
reason: ReconcileSuccess
status: "True"
type: Synced
I have done a set of experiments with provider-tf-azure
using the free Azure VirtualNetwork
resource. After an initial batch of 10 VirtualNetworks
were provisioned and successfully transitioned to the Ready
state, another batch of 10 more VirtualNetwork
MRs were provisioned, and I observed that the MRs added in the latter batch were failing to transition to the Ready
state. And the workqueue depth for the resource approached ~30:
The VirtualNetwork
controller has a default worker count of 1 in this setup, and as it can be observed in the Workqueue depth for VirtualNetworks
graph, the workqueue of the VirtualNetwork
controller quickly responds to these newcomer MRs. At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup. As it can be observed in the Reconciliation Times for VirtualNetworks
graph, the reconciliation times of the VirtualNetwork
controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop. Also the 5 min average wait times reconciliation requests spend in the workqueue is increasing with the number of VirtualNetworks
we have in the cluster. We have a single worker routine processing the slow Terraform pipelines, including the synchronous Obseervation
pipeline.
During these experiments, I also discovered a bug in the tfcli
library, where we do not consume an already available TF pipeline result if the pipeline had a chance to produce its result. After this bug is fixed with #67 and without increasing the maximum concurrent reconciler count for the VirtualNetwork
controller, all of the 40 VirtualNetwork
MRs provisioned in the cluster could successfully transition to the Ready
state. I have taken rough measurements to give an idea of the time it took for the last batch of 10 resources to transition to Ready
state:
test-31 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-31 15m
test-32 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-32 16m
test-33 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-33 16m
test-34 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-34 16m
test-35 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-35 17m
test-36 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-36 17m
test-37 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-37 17m
test-38 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-38 18m
test-39 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-39 18m
test-40 True True /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/virtualNetworks/test-40 19m
Please note that I have not repeated these experiments and these results do not represent averages, however, the p99 workqueue wait times (exceeding ~8 min) look consistent with these measurements.
One notable issue is that because we are only capable of using a single worker, currently we cannot utilize the available CPU and memory resources to terrajet-based providers:
This can become an issue if a terraform-based provider is being used to provision multiple objects of the same kind. crossplane-contrib/provider-jet-azure#4 adds an option to increase the maximum concurrent reconcilers for a resource without modifying the default value, so that we will be able to utilize CPU & memory if needed.
I'm planning to perform some further experiments to see whether we can utilize CPU & memory resources more efficiently to decrease transition-to-Ready times and to check some other metrics.
from terrajet.
As it can be observed in the Reconciliation Times for VirtualNetworks graph, the reconciliation times of the VirtualNetwork controller measured as the 99-th percentile over the last 5m (a common SLI used to give Kubernetes API latency SLOs) is above 20 s, due to the aforementioned Terraform pipelines we are running at each reconciliation loop.
Wonder how this compares to provider-azure
and terraform apply -refresh-only
. I'd expect 20s
be closer to be taken mostly by terraform operation.
At each reconciliation, we are running Terraform pipelines using the Terraform CLI, and each pipeline is potentially forking multiple Terraform Azurerm provider plugins to have them communicate with Azure. Please note that we are not running the Terraform provider plugins as shared gRPC servers in this setup.
We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of VirtualNetwork
s. Do you think we need the shared gRPC server?
from terrajet.
We were concerned about the multiple gRPC servers eating up resources. From the graphs you shared, it doesn't look like that's the case since the memory & CPU usage doesn't seem to increase with number of
VirtualNetwork
s. Do you think we need the shared gRPC server?
I'm planning to do another set of experiments with higher concurrency using the new -c
option. I expect we will utilize both CPU & memory better and decrease time-to-completion, i.e. time to move a CR to the Ready
state. Another thing to look into is to introduce updates (the experiments described above did not involve asynchronous Update operations, which will not block the workers, yield shorter queue waiting times, and thus utilize more resources.
I suspect because of so many process forks done by Terraform CLI, as we increase concurrency, the Terraform-based providers might become CPU-bound. That's something to be explored. If this turns out to be the case, and if we are not satisfied with our level of scaling, we might want to give shared gRPC servers a try.
Thank you for taking a look into this @muvaf, very much appreciated!
from terrajet.
@ulucinar FYI, @negz let me know that we have a global rate limiter in most providers that limits number of reconciles to 1 per second. I have a hunch that this rate limiter affects the queue more than the concurrency number.
Broadly though, I think this issue was opened to see whether we can work with 100+ instances with varying kind
s without crashing or exceeding the deadline to the point where provider is useless. So the answer to that question should be enough to close the issue.
from terrajet.
@ulucinar it'd be great if you can share your script and provide a base line for performance statement so that we can reproduce the tests but that's not in the original scope of this issue, so feel free to close this issue after the 100+ instances with varying kinds without crashing or exceeding the deadline
result or some other performance statement that we can use in the future repeatedly to measure the effects big changes.
from terrajet.
Here are the results from another set of experiments involving our target of 100 MRs in this issue:
Experiment Setup:
On a GKE cluster with the following specs:
Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001
Previous experiments have shown that Terraform-based providers are CPU-bound, and hence in this setup we are using nodes with higher CPU capacity in order to be able to scale up to 100 MRs.
I deployed a stripped-down version of provider-tf-azure
(with 32 MRs including the VirtualNetwork
and Lb
resources, Docker image: ulucinar/provider-tf-azure-controller:98a23918b69a778e4910f81483b7767c56cf41e5) containing the --concurrent-reconciles
command-line option, which allows the provider to better utilize node resources. After experimenting with several max. concurrent reconciles options, I have chosen a value of 3
for this cluster. A total of 45 VirtualNetwork
MRs and a total of 55 Lb
MRs are provisioned simultaneously, with the MR name and infrastructure object name suffixes being passed from the command-line. An example invocation of the generator script and an example generated MR manifest looks like the following:
$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 55)
apiVersion: lb.azure.tf.crossplane.io/v1alpha1
kind: Lb
metadata:
annotations:
crossplane.io/external-name: /subscriptions/038f2b7c-3265-43b8-8624-c9ad5da610a8/resourceGroups/alper/providers/Microsoft.Network/loadBalancers/test-1
tf.crossplane.io/state: ...
creationTimestamp: "2021-09-27T13:38:48Z"
finalizers:
- finalizer.managedresource.crossplane.io
generation: 1
name: test-1
resourceVersion: "5862396"
uid: 2b97bbb4-791b-4449-8326-ef1ee5f13eb7
spec:
deletionPolicy: Delete
forProvider:
location: East US
name: test-1
resourceGroupName: alper
providerConfigRef:
name: example
status:
atProvider: {}
conditions:
- lastTransitionTime: "2021-09-27T13:43:56Z"
reason: Available
status: "True"
type: Ready
- lastTransitionTime: "2021-09-27T13:38:49Z"
reason: ReconcileSuccess
status: "True"
type: Synced
As also observed in our previous experiments, CPU utilization shows a sharp increase in parallel to the increasing #-of-CRs in the cluster during the provisioning phase, and also during the de-provisioning phase. After all of the 100 MRs transition to the Ready
state and the system stabilizes, all of the 100 MRs have been deleted simultaneously at 14:07:10 UTC. In the below graphs, the first peak CPU utilization at around 13:44 UTC is caused by the provisioning phase, and the second peak at around 14:13 UTC is caused by the de-provisioning phase.
First, CPU utilization climbs just over 92% as we have many asynchronous concurrent Create calls, however the limits we have imposed on the concurrency prevent saturation. However, we also observe increasing workqueue wait times for both kinds of resources:
The following graph shows the time to readiness periods, the time it takes for an MR to become Ready
(acquire the Ready
status condition with status == True
) measured from the time it's created:
Please note that these are not averages. The max time to readiness interval for an MR in these experiments has been 670 s
and the min has been observed to be 221 s
.
Another interesting observation is that when all the MRs are deleted at 14:07:10 UTC (they acquired a non-zero metadata.deletionTimestamp
), CPU utilization starts to increase reaching a max. of ~94% at 14:13:30. And it takes ~1020 s (~17 min) to remove all of the 100 MRs from the cluster. Although a corresponding external resource is deleted via the Cloud API, it takes provider-tf-azure
a longer time to dequeue a request from the workqueue and to make an observation for the deleted resource and to remove the finalizer.
from terrajet.
Results from another set of experiments with the native provider provider-azure
:
Experiment Setup:
On a GKE cluster with the following specs:
Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 workers
Control plane version - 1.20.9-gke.1001
I deployed provider-azure v0.17.0
. Please note that the max. concurrent reconcilers is 1 for this version. The same script used in the previous experiments is employed with different template manifests to provision a total of 45
VirtualNetworks
and a total of 205
Subnets
. We cannot efficiently utilize node resources even with 250
MRs:
We will clearly benefit from increased concurrency here.
The time to readiness intervals are distributed as follows for these 250
MRs:
When all of the 250
MRs are deleted simultaneously at ~23:31:46 UTC, we observe a surge in workqueue lengths but only a slight increase in CPU utilization, and at ~23:34:30 UTC all MRs have been removed (in ~3 min). Please note that these deletion measurements are skewed because all of the 205
subnets that are deleted belonged to the same virtual network. Probably, when that virtual network was deleted, all observations for the subnets returned 404
. Nevertheless, provider-azure
needs to observe all of the subnets before it removes the associated finalizers from the Subnet
MRs:
We need to incorporate the recently proposed --max-reconcile-rate
command-line option for provider-azure
(by @negz in crossplane/crossplane#2595) to make a fair comparison with provider-tf-azure
, which already benefits from increased concurrency & better utilization of the CPU resources as described in the above experiments.
from terrajet.
We will continue scale testing of Terrajet-based providers with the latest improvements.
from terrajet.
Related Issues (20)
- Reference network interface in VM creation
- How to deploy into existing infrastructure? HOT 2
- Customizable late initialization function for a resource
- Add support for numeric sensitive fields
- PollInterval is not considered by controllers
- main.tf's error message on failed write is incorrect
- Fix security vulnerabilities with hashicorp/go-getter upgrade to v1.6.1 HOT 1
- Fix security vulnerabilities by using Go 1.19 HOT 2
- can't get the correct terraform state file when importing the existed resources HOT 3
- Update documentation and provider-jet-template repo to cover creating a crossplane provider from a CUSTOM Terraform provider
- Allow resource configuration to provide additional state file values
- Terrajet providers take a long time on cluster with many resources HOT 2
- Issue when trying to convert ldap provider
- Need tip with converting ldap provider HOT 12
- Import cycle not allowed HOT 1
- Allow prevent_destroy for provider
- Using "Test":true in TF_REATTACH_PROVIDER cause memory leaks
- Cannot generate provider crds HOT 2
- Support configuring the json tags of a field
- Deprecating Terrajet HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from terrajet.