Giter Site home page Giter Site logo

azure / agentbaker Goto Github PK

View Code? Open in Web Editor NEW
90.0 31.0 194.0 230.66 MB

Agent Baker is aiming to provide a centralized, portable k8s agent node provisioning lib as well as rich support on different OS image with optimized k8s binaries.

License: MIT License

Makefile 0.55% Shell 25.46% PowerShell 26.99% Go 44.60% Roff 0.39% CUE 0.19% C# 0.08% Python 0.95% Batchfile 0.08% JavaScript 0.03% HCL 0.01% Groovy 0.68%
k8s agent vhd os cloud-init

agentbaker's Introduction

Agentbaker

Coverage Status

Agentbaker is a collection of components used to provision Kubernetes nodes in Azure.

Agentbaker has a few pieces

  • Packer templates and scripts to build VM images.
  • A set of templates and a public API to render those templates given input config.
  • An API to retrieve the latest VM image version for new clusters.

The primary consumer of Agentbaker is Azure Kubernetes Service (AKS).

AKS uses Agentbaker to provision Linux and Windows Kubernetes nodes.

Contributing

Developing agentbaker requires a few basic requisites:

  • Go (at least version 1.19)
  • Make

Run make -C hack/tools install to install all development tools.

If you change code or artifacts used to generate custom data or custom script extension payloads, you should run make.

This re-runs code to embed static files in Go code, which is what will actually be used at runtime.

This additionally runs unit tests (equivalent of go test ./...) and regenerates snapshot testdata.

Style

We use golangci-lint to enforce style.

Run make -C hack/tools install to install the linter.

Run ./hack/tools/bin/golangci-lint run to run the linter.

We currently have many failures we hope to eliminate.

We have job to run golangci-lint on pull requests.

This job uses the linters "no-new-issues" feature.

As long as PRs don't introduce net new issues, they should pass.

We also have a linting job to enforce commit message style.

We adhere to conventional commits.

Prefer pull requests with single commits.

To clean up in-progress commits, you can use git rebase -i to fixup commits.

See the git documentation for more details.

Testing

Most code may be tested with vanilla Go unit tests.

Snapshot

We also have snapshot data tests, which store the output of key APIs as files on disk.

We can manually verify the snapshot content looks correct.

We now have unit tests which can directly validate the content without leaving generated files on disk.

See ./pkg/agent/baker_test.go for examples (search for dynamic-config-dir to see a validation sample.).

E2E

Checkout the e2e directory.

Contributor License Agreement (CLA)

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

CGManifest File

A cgmanifest file is a json file used to register components manually when the component type is not supported by governance. The file name is "cgmanifest.json" and you can have as many as you need and can be anywhere in your repository.

File path: ./vhdbuilder/cgmanifest.json

Reference: https://docs.opensource.microsoft.com/tools/cg/cgmanifest.html

Package:

agentbaker's People

Contributors

abelhu avatar adilliadil avatar alexeldeib avatar alisonb319 avatar andyliuliming avatar andyzhangx avatar anujmaheshwari1 avatar bowang-666 avatar cameronmeissner avatar chengliangli0918 avatar dependabot[bot] avatar devinwong avatar ganeshkumarashok avatar hbeberman avatar hdya avatar jaer-tsun avatar junjiezhang1997 avatar mattstam avatar nilo19 avatar paulgmiller avatar rbtr avatar seandougherty avatar shiqiantao avatar tobiasb-ms avatar tyler-lloyd avatar utheman avatar wanqingfu avatar wedaly avatar xuto2 avatar yizhang4321 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agentbaker's Issues

Windows 2022_containerd_gen2 PR check failed for missing image but VHD build succeeded

What happened:

VHD build succeeded but Windows 2022_containerd_gen2 PR check failed for missing image

What you expected to happen:

image should exist

How to reproduce it:

https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=59900982&view=logs&j=05b0285b-7424-5da1-4fe4-29b48f7b1d3a&t=e2ed4504-bf94-5207-4338-b6117c41217b

ERROR: {"error":{"code":"InvalidTemplateDeployment","message":"The template deployment 'vm_deploy_lsPbsw0MGltObZiGrK6beKtGHaF2R7pi' is not valid according to the validation procedure. The tracking id is 'c5afa5c6-eda2-49c2-8251-24950ebf7fe6'. See inner errors for details.","details":[{"code":"InvalidParameter","target":"imageReference","message":"In region eastus, the following list of Shared Image Gallery images referenced from the deployment template were not found: /subscriptions/a15c116e-99e3-4c59-aebc-8f864929b4a0/resourceGroups/akswinvhdbuilderrg/providers/Microsoft.Compute/galleries/WS2022Gen2Gallery220902/images/windows-2022-containerd/versions/1.0.0. Please check that at least one image has been replicated to eastus or that the ‘exclude from latest’ flag is set to false. Please refer to https://docs.microsoft.com/en-us/azure/virtual-machines/windows/shared-image-galleries for instructions on creating and deleting such images."}]}}

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

WS2022 gen2 build is broken by the last change

What happened:
Pipelines - Run 20221014.2_merge_61985143 logs (visualstudio.com)

./vhdbuilder/packer/test/run-test.sh: line 78: ENABLE_TRUSTED_LAUNCH: unbound variable

It is broken by 6a6e0a9#diff-a054700877653ec1c551d474ced34005de704973e7315af31eed299df486c033

What you expected to happen:
WS2022 gen2 build pass

How to reproduce it:

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Add option in e2e testing to allow secure boot to be disabled for windows

Is your feature request related to a problem?/Why is this needed
Windows doesn't allow unsigned drivers to be installed when secure boot is enabled. This makes it possible to test in-development drivers w/ Windows.

Describe the solution you'd like in detail
Check an env. arg in e2e-scenario.sh and edit the uefi settings in the vmss template.

https://learn.microsoft.com/en-us/azure/templates/microsoft.compute/virtualmachinescalesets?pivots=deployment-language-arm-template#securityprofile-1

Describe alternatives you've considered
This is a boot option so it cannot be disabled while the VM is running.

Additional context

Update-DefenderSignatures may fail in building Windows VHD

Is your feature request related to a problem?/Why is this needed

Update-DefenderSignatures may fail in building Windows VHD

Describe the solution you'd like in detail

Assure that Update-DefenderSignatures does not fail

Describe alternatives you've considered

Maybe we can add a retry until it succeeds.

Additional context

Refine windows-vhd-content-test.ps1 to use the same policy to filter k8s versions in configure-windows-vhd.ps1

In configure-windows-vhd.ps1


            # Windows containerD supports Windows containerD, starting from Kubernetes 1.20
            if ($containerRuntime -eq 'containerd' -And $dir -eq "c:\akse-cache\win-k8s\") {
                $k8sMajorVersion = $fileName.split(".",3)[0]
                $k8sMinorVersion = $fileName.split(".",3)[1]
                if ($k8sMinorVersion -lt "20" -And $k8sMajorVersion -eq "v1") {
                    Write-Log "Skip to download $url for containerD is supported from Kubernets 1.20"
                    continue
                }
            }

But in windows-vhd-content-test.ps1, it uses two list to validate cached k8s versions. This is not convenient. I think that we can use the same policy as that in configure-windows-vhd.ps1

AKS Log Collector doesn't collect Cilium logs

What happened: Looks like there's a typo in the /var/log/ location for the cilium-cni logs.

What you expected to happen:

https://github.com/arc9693/AgentBaker/blob/master/parts/linux/cloud-init/artifacts/aks-log-collector.sh#L180C1-L181C1

collects logs in
GLOBS+=(/var/log/cillium-cni)*

but the logs seem to actually lie in
GLOBS+=(/var/log/cilium-cni)*

cf: 05-cilium.conflist

{
"cniVersion": "0.3.1",
"name": "cilium",
"plugins": [
{
"type": "cilium-cni",
"ipam": {
"type": "azure-ipam"
},
"enable-debug": true,
"log-file": "/var/log/cilium-cni.log"
}
]
}

and

root@aks-nodepool1-46355595-vmss000000:/var/log# ls -l | grep cil
-rw-r--r-- 1 root root 9093 Apr 22 14:34 cilium-cni.log
root@aks-nodepool1-46355595-vmss000000:/var/log#
root@aks-nodepool1-46355595-vmss000000:/var/log#
root@aks-nodepool1-46355595-vmss000000:/var/log#
root@aks-nodepool1-46355595-vmss000000:/var/log# cat cilium-cni.log
level=debug msg="Processing CNI ADD request &skel.CmdArgs{ContainerID:"44e5cbfa7bafb255dc016271a4ba04cf1fd7b8a2ca65f8abddc89f4caee4453d", Netns:"/var/run/netns/cni-6fbe0e49-2928-1345-3510-0cf3375c4702", IfName:"eth0", Args:"K8S_POD_NAMESPACE=kube-system;K8S_POD_NAME=coredns-767bfbd4fb-pm4vz;K8S_POD_INFRA_CONTAINER_ID=44e5cbfa7bafb255dc016271a4ba04cf1fd7b8a2ca65f8abddc89f4caee4453d;K8S_POD_UID=fafdc602-4edc-40ae-aa51-074e5faf5cd1;IgnoreUnknown=1", Path:"/opt/cni/bin", StdinData:[]uint8{0x7b, 0x22, 0......

How to reproduce it:

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Why aren't new builds pushed to Azure locations (Public)

What is AKS not showing the node images as applicable?

The latest image available for my cluster in EastUS is 2 node images old.(i.e. 2021.08.26)

image

While there aren't any CVE's affecting our current node images, it is frustrating to realize that a nodeimage that is released as per release notes is not available in common regions!

Is there any way to grab the latest image? any --force-pull node image!!

Explicit configuration interface

Moving some private discussion I've had with @alexeldeib into a public ticket to increase bus factor.

Is your feature request related to a problem?/Why is this needed
Our team uses a custom AKS image which has a few dependencies which are currently provided by AgentBaker's customnodedata:

We currently depend on the following files to be provided via nodecustomdata:
/etc/default/kubelet
/var/lib/kubelet/bootstrap-kubeconfig 
/etc/kubernetes/certs/ca.crt

Within those files we depend on:
/etc/default/kubelet:

  • node labels
  • kubelet command line parameters

/var/libe/kubelet/bootstrap-kubeconfg:

  • cluster-server:
name: localcluster
  cluster:
    certificate-authority: /etc/kubernetes/certs/ca.crt
    server: https://dev-azure-westus-xxx-cc782afe.hcp.westus.azmk8s
  • token
- name: kubelet-bootstrap
  user:
    token: "sbiizf.avcjfgfj5h3oni"
  • ca.crt - we depend on the whole file for the bootstrap kube-config.

We explicitly prevent the CustomScriptExtension from running by touching the /opt/azure/containers/provision.complete file which CSE checks prior to running. We don't want CSE to run because it does node level configuration which conflicts with our own.

Describe the solution you'd like in detail
Ideally, the interface in AgentBaker we come up with:

  • Allows us to test that these files are present prior to CSE running 
  • Allows us to confirm that CSE does not run if the following file is already set: /opt/azure/containers/provision.complete 

There are a few options which stand out to me:

  1. Write a test just confirming that those files are populated in nodedata
  2. Write a new file interface (bootstrap.cfg) which either:
    a. Has those files included in it
    b. Has the fields we need only included
  3. Some mechanism to fetch these dynamically from within the node, so we can completely remove the dependency on data provided by AgentBaker (though we’d shift the dependency to requiring that the fields are accessible somehow)

After discussions with @alexeldeib - there's a preference to not have the data provided via customdata.

Describe alternatives you've considered

Additional context
We've had a number of incidents due to us not having a clear contract around node configuration, so hoping to work with y'all to get one defined. Thanks in advance!

5 sysctls do not correctly respect AKS defaults when using custom node config

What happened:
When creating nodepools in AKS cluster using Custom Node config when all the values for following sysctls are not explicitly provided:

  • net.core.somaxconn
  • net.ipv4.tcp_max_syn_backlog
  • net.ipv4.neigh.default.gc_thresh1
  • net.ipv4.neigh.default.gc_thresh2
  • net.ipv4.neigh.default.gc_thresh3

will cause the AKS default values to not be added.

What you expected to happen:
All 5 sysctls should be assigned to their default AKS value when not explicitly overwritten:

  • net.core.somaxconn=16384
  • net.ipv4.tcp_max_syn_backlog=16384
  • net.ipv4.neigh.default.gc_thresh1=4096
  • net.ipv4.neigh.default.gc_thresh2=8192
  • net.ipv4.neigh.default.gc_thresh3=16384

How to reproduce it:
Create AKS cluster/Nodepool using Custom Node Config and ommit one of the mentioned values from supplied custom node config. The defaults will not be applied correctly.

How it can be fixed manually:

Explicitly set all the values to their AKS default or value you need, for example:

  • net.core.somaxconn=16384
  • net.ipv4.tcp_max_syn_backlog=16384
  • net.ipv4.neigh.default.gc_thresh1=4096
  • net.ipv4.neigh.default.gc_thresh2=8192
  • net.ipv4.neigh.default.gc_thresh3=16384

More info about defaults and custom node config here

A wrong storage account name is introduced by a mistake

What happened:
In https://github.com/Azure/AgentBaker/pull/2583/files,
--name vhds is changed to --name vhd for generating traditional SAS token with CLASSIC_SA_CONNECTION_STRING

What you expected to happen:
vhds is still used.

How to reproduce it:
It breaks publishing Windows images.

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Only exclude UDP source port 65330 in AKS Windows nodes

Is your feature request related to a problem?/Why is this needed

The Networking team recommended we use an excluded port range rather than reducing the dynamic range. The latter (reducing the dynamic range) could cause issues with port exhaustion.

Rec:
netsh int ipv4 Add excludedportrange protocol=udp startport=65330 numberofports=1 store=persistent
Describe the solution you'd like in detail

Run netsh int ipv4 Add excludedportrange protocol=udp startport=65330 numberofports=1 store=persistent instead of Invoke-Executable -Executable "netsh.exe" -ArgList @("int", "ipv4", "set", "dynamicportrange", "udp", "49152", "16178")
Describe alternatives you've considered

Additional context

seems unnecessary to force PR to be up-to-date

PRs easily gets out-of-date, constantly requires rebase to master every a few hours.
This is significantly making the PR lifecycle longer, as auto-merge will still be blocked most of the times.

Is there anything can be optimized from config?
Thanks!

provision_source.sh: Syntax error: "(" unexpected (expecting "}")

CRI 230819802 - provisioned a new AKS Cluster 1.18.14 - and we can see the following in the Cloud-init-output.log:


 root@aks-agentpool-54615928-vmss000000:/var/lib/cloud/instance/scripts# sh runcmd
 + . /opt/azure/containers/provision_source.sh
 + ERR_SYSTEMCTL_START_FAIL=4
 …
 + ERR_TELEPORTD_INSTALL_ERR=151
 + gawk match($0, /^(ID_LIKE=(coreos)|ID=(.*))$/, a) { print toupper(a[2] a[3]); exit }
 + sort -r /etc/lsb-release /etc/os-release
 + OS=UBUNTU
 + UBUNTU_OS_NAME=UBUNTU
 + RHEL_OS_NAME=RHEL
 + COREOS_OS_NAME=COREOS
 + KUBECTL=/usr/local/bin/kubectl
 + DOCKER=/usr/bin/docker
 + export GPU_DV=450.51.06
 + export GPU_DEST=/usr/local/nvidia
 + NVIDIA_DOCKER_VERSION=2.0.3
 + DOCKER_VERSION=1.13.1-1
 + NVIDIA_CONTAINER_RUNTIME_VERSION=2.0.0
 + NVIDIA_DOCKER_SUFFIX=docker18.09.2-1
 **runcmd: 324: /opt/azure/containers/provision_source.sh: Syntax error: "(" unexpected (expecting "}")**
 

If we use bash instead, looks like that part of code works fine:


root@aks-agentpool-54615928-vmss000000:/var/lib/cloud/instance/scripts# bash runcmd
+ . /opt/azure/containers/provision_source.sh
++ ERR_SYSTEMCTL_START_FAIL=4
…
++ ERR_TELEPORTD_INSTALL_ERR=151
+++ gawk 'match($0, /^(ID_LIKE=(coreos)|ID=(.*))$/, a) { print toupper(a[2] a[3]); exit }'
+++ sort -r /etc/lsb-release /etc/os-release
++ OS=UBUNTU
++ UBUNTU_OS_NAME=UBUNTU
++ RHEL_OS_NAME=RHEL
++ COREOS_OS_NAME=COREOS
++ KUBECTL=/usr/local/bin/kubectl
++ DOCKER=/usr/bin/docker
++ export GPU_DV=450.51.06
++ GPU_DV=450.51.06
++ export GPU_DEST=/usr/local/nvidia
++ GPU_DEST=/usr/local/nvidia
++ NVIDIA_DOCKER_VERSION=2.0.3
++ DOCKER_VERSION=1.13.1-1
++ NVIDIA_CONTAINER_RUNTIME_VERSION=2.0.0
++ NVIDIA_DOCKER_SUFFIX=docker18.09.2-1
+ aptmarkWALinuxAgent hold
++ date
++ hostname
+ echo Mon Mar 8 09:55:08 UTC 2021,aks-agentpool-54615928-vmss000000, startAptmarkWALinuxAgent hold
Mon Mar 8 09:55:08 UTC 2021,aks-agentpool-54615928-vmss000000, startAptmarkWALinuxAgent hold

Images >2023.01.20 have duplicated /etc/machine-id

What happened:
Since image version 2023.01.20 /etc/machine-id is not being randomized per VM like it should be. cleanup-vhd.sh should empty the file during the build process but that doesn't seem to be working.

I deployed clusters using multiple image versions with 3 nodes each and got the machine IDs for each cluster:

2023.01.10:
1fbcae4dc8b24dd29ba2761ffa1975c0
b41324ebbd694f51aa9b5b1a104ba9f9
7b9cf358e3aa40a7a974946a29224a15

2023.01.19:
f012255033074f0984576b0b51b9f848
c853ec1fcab643b792d20cf9c9908627
8077caa0cab84596bfa084179cc083d3

2023.01.20:
de1dd9ede40041b7bceb409b0a3b12cb
de1dd9ede40041b7bceb409b0a3b12cb
de1dd9ede40041b7bceb409b0a3b12cb

2023.01.25:
c82bf5de25c44e56a41d3c75b74967d6
c82bf5de25c44e56a41d3c75b74967d6
c82bf5de25c44e56a41d3c75b74967d6

2023.01.26:
4813896e0be44568a53fb331dcb4af79
4813896e0be44568a53fb331dcb4af79
4813896e0be44568a53fb331dcb4af79

2023.02.01:
8b30ea1c06764b83aef59f790313e369
8b30ea1c06764b83aef59f790313e369
8b30ea1c06764b83aef59f790313e369

What you expected to happen:
/etc/machine-id should be different on every AKS node. This is visible in .status.nodeInfo.machineID:

$ kubectl explain node.status.nodeInfo
KIND:     Node
VERSION:  v1

RESOURCE: nodeInfo <Object>

DESCRIPTION:
     Set of ids/uuids to uniquely identify the node. More info:
     https://kubernetes.io/docs/concepts/nodes/node/#info

     NodeSystemInfo is a set of ids/uuids to uniquely identify the node.

How to reproduce it:
Deploy an AKS cluster with multiple nodes on image version >= 2023.01.20, look at machineID.

Anything else we need to know?:

Environment:

  • AgentBaker version: whatever's in RP release v20230115
  • Kubernetes version (use kubectl version): v1.24.9
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Allow us to turn off Mariner's equivalent of unattended upgrade

Is your feature request related to a problem?/Why is this needed
We can turn off in ubuntu. If mariner has same functionality we should allow it to be turned off.
Don't think Windows has an equivalent. @AbelHu.

Describe the solution you'd like in detail
Honor NodeBootstrapingConfig's DisableUnattendedUpgrade and turn any automatic updates off.

Describe alternatives you've considered
Daemonsets and remediators are fine but there is always a race with new images if we don't do it here.

Additional context

#2175

make generate is broken by new code $(which awk) in #3963

What happened:
make generate is broken by #3963

shellcheck installed
Running shellcheck...

In ./vhdbuilder/packer/install-dependencies.sh line 111:
AWK_PATH=$(which awk)
           ^---^ SC2230: which is non-standard. Use builtin 'command -v' instead.

For more information:
  https://www.shellcheck.net/wiki/SC2230 -- which is non-standard. Use builti...
make[1]: *** [Makefile:78: validate-shell] Error 1

What you expected to happen:
Expect that there is no error.

How to reproduce it:
Run make generate

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Check whether docker is installed or not before get events for docker to avoid error

Check whether docker is installed or not before get events for docker to avoid below error

get-eventlog : No matches found
At C:\k\debug\collect-windows-logs.ps1:60 char:1
+ get-eventlog -LogName Application -Source Docker | Select-Object Inde ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (:) [Get-EventLog], ArgumentException
    + FullyQualifiedErrorId : GetEventLogNoEntriesFound,Microsoft.PowerShell.Commands.GetEventLogCommand

Clean up update-node-labels code

The following PR switched off the update-node-labels.service:
#671

We need to clean up after this change. Weather it is removing the associated service files, or finding a backup approach that doesn't cause the same issue the previous service was with kubectl calls (specifically discovery calls) that would allow us to turn it back on.

Add more Container Platform logs to collect-windows-logs.ps1

Add more containerd logs to collect-windows-logs.ps1 to help troubleshooting for Container Platform team.

We would like the following data to be collected by default with collect-windows-log.ps1 script when ICMs are filed. This will reduced some back and forth with the cx to collect additional data for us to look into:

  1. cd '.\Program Files\containerd'
  2. .\containerd.exe --v
  3. .\crictl.exe pods
  4. .\ctr.exe c ls
  5. .\shimdiag.exe list
  6. for each row got from (5), run the following and share outputs got for each:
       .\shimdiag.exe stacks <id got from each row in (5)> 
  7. .\ctr.exe snapshot ls
  8. .\crictl images
    9)  get-process containerd-shim-runhcs-v1
  9. get-process CExecSvc
  10. Get-Process vmcompute

[BUG]: aks-log-collector.sh creates a large ip_netns_commands.txt which lead to ephemeral-storage issues #4148

What happened: cf Azure/AKS#4148

Describe the bug
We have observed that files ip_netns_commands.txt (example of location folder /tmp/tmp.4FKbTfOrn4/collect) in our AKS cluster nodes sometimes growing to many GBs and when the size comes to around 90GB nodes start having issues with ephemeral storage (The node was low on resource: ephemeral-storage.) then pods become evicted and multiple other issues appear.

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# ls -lh ip_netns_commands.txt
-rw-r--r-- 1 root root 38G Mar 7 13:19 ip_netns_commands.txt
root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# fuser -v ip_netns_commands.txt
USER PID ACCESS COMMAND
/tmp/tmp.4FKbTfOrn4/collect/ip_netns_commands.txt:
root 83233 F.... ip
root 109814 F.... ip
root 560481 F.... ip
root 1133085 F.... ip
root 1133086 F.... ip
root 1133087 F.... ip
root 1134797 F.... ip
root 1737066 F.... ip
root 1737172 F.... ip
root 1737210 F.... ip
root 1737451 F.... ip

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# pstree -aps 83233
systemd,1
└─aks-log-collect,61814 /opt/azure/containers/aks-log-collector.sh
└─ip,83233 -all netns exec /bin/bash -x -c...
└─ip,109814 -all netns exec /bin/bash -x -c...
└─ip,1737066 -all netns exec /bin/bash -x -c...
└─ip,1737172 -all netns exec /bin/bash -x -c...
└─ip,1737210 -all netns exec /bin/bash -x -c...
└─ip,1737451 -all netns exec /bin/bash -x -c...
└─ip,560481 -all netns exec /bin/bash -x -c...
└─ip,1133085 -all netns exec /bin/bash -x -c...
└─bash,1147264 -x -c...
└─ss,1147282 -anoempiO --cgroup

root@aks-apps5--vmss000014:/tmp/tmp.4FKbTfOrn4/collect# head --lines 20 /opt/azure/containers/aks-log-collector.sh
#! /bin/bash

AKS Log Collector

This script collects information and logs that are useful to AKS engineering

for support and uploads them to the Azure host via a private API. These log

bundles are available to engineering when customers open a support case and

are especially useful for troubleshooting failures of networking or

kubernetes daemons.

This script runs via a systemd unit and slice that limits it to low CPU

priority and 128MB RAM, to avoid impacting other system functions.

Log bundle upload max size is limited to 100MB

MAX_SIZE=104857600

Shell options - remove non-matching globs, don't care about case, and use

extended pattern matching

shopt -s nullglob nocaseglob extglob

AKS 1.28.5

Secret store CSI driver AKV Provider is failing on GPU nodes

AKV Provider is failing to start due to permission issue on GPU nodes with error:
Error: failed to create containerd task: OCI runtime create failed: container_linux.go:344: starting container process caused "chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied": unknown

We deploy with following security context:

securityContext:
  runAsUser: 0
  capabilities:
    drop:
    - ALL

/opt/azure/containers/provision.sh needs to respect DNS failover mechanism

Hi team, I am testing DNS failover during AKS creation. An unreachable IP is set as my primary DNS server of AKS and the secondary DNS server is good.

Then the CSE of VMSS failed to be provisioned.

Error messages:
Enable failed: failed to execute command: command terminated with exit status=124 [stdout] { "ExitCode": "124", "Output": "ookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ '[' 60 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ '[' 61 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++'[' 79 -eq 100 ']'\n++ sleep 1\n++ for i in $(seq 1 $retries)\n++ timeout 10 nslookup deaaks901-deaaks-c4ab-hu74r11b.hcp.eastasia.azmk8s.io\n++ 

After some investigation, we found the issue may relate to the script /opt/azure/containers/provision.sh
image

It seems that the timeout setting in this script doesn't respect the DNS failover mechanism because base on our tests, the nslookup needs 15 seconds to complete the failover.

test command:
time nslookup google.com

Could you please extend the timeout period of nslookup in this script to 20 sec? it will give more enhancement for this project. Thank you!

Log Collector Addition Triggering Azure Defender for Containers

What happened:
On upgrading to the latest image / AKS 1.27.7, the cloud-init run has caused 50+ Medium Severity Alerts from Defender for Containers for kubelet config file access

image

Introduced by #3991

What you expected to happen:
Log collectors should not trigger microsoft's own runtime protection detection's and should be sanity checked against it prior.

How to reproduce it:
During an upgrade to image 202402.07.0, have Microsoft Defender for Containers runtime protection enabled.

Potential Fixes
Potentially rewriting this may escape it from microsofts detections, elsewise some support in routing this to the AKS Security team at Microsoft so they can optimize their alerts would be appreciated, while we can just ignore set conditions for now surely other customers must be impacted.

Environment:

FYI @phealy @cameronmeissner

AgentBaker E2E pipeline fails to create new cluster and reports "suite_test.go:32: failed to get aks cluster"

What happened:
Since the resource group of AgentBaker E2E pipeline is created on a test sub, it will be deleted after a certain time (3 days). After the resource group is deleted, current AgentBaker E2E pipeline fails to create a new cluster and reports "suite_test.go:32: failed to get aks cluster".

What you expected to happen:
AgentBaker E2E pipeline recreates a new resource group and cluster, then runs test successfully.

How to reproduce it:
Delete the resource manually and run AgentBaker E2E pipeline.

Anything else we need to know?:
It's a bit strange that I noticed a new cluster was created yesterday by an old branch with bash version of AgentBaker E2E pipeline https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=70508837&view=results. But the cluster was deleted in 24 hours. Not sure whether there is any new change for cleaning resource automatically.

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

[need help]fix spelling error in this repo

any volunteer to fix the following spelling errors in this project?

Error: ./vhdbuilder/notice.txt:1435: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:1749: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:1766: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:3837: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:7707: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:8199: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:8273: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:83[18](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:18): packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:8363: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:9795: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:10752: mke ==> make
Error: ./vhdbuilder/notice.txt:10754: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:10774: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:10871: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:10875: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:11888: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:12021: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:12203: mke ==> make
Error: ./vhdbuilder/notice.txt:12205: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:12225: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:12421: explicitely ==> explicitly
Error: ./vhdbuilder/notice.txt:15116: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:17091: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:17405: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:17422: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:[19](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:19)493: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:23363: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:23855: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:23929: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:23974: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:24019: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:25451: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:26408: mke ==> make
Error: ./vhdbuilder/notice.txt:26410: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:26430: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:26527: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:26531: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:27544: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:27677: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:27859: mke ==> make
Error: ./vhdbuilder/notice.txt:27861: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:27881: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:28077: explicitely ==> explicitly
Error: ./vhdbuilder/notice.txt:30772: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:33194: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:33211: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:34945: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:35025: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:36155: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:40403: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:40536: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:41149: synopsys ==> synopsis
Error: ./vhdbuilder/notice.txt:41242: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:41316: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:41361: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:41406: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:42864: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:43886: mke ==> make
Error: ./vhdbuilder/notice.txt:43888: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:43908: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:44021: explictly ==> explicitly
Error: ./vhdbuilder/notice.txt:44057: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:44061: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:45074: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:45[20](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:20)7: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:45389: mke ==> make
Error: ./vhdbuilder/notice.txt:45391: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:45411: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:45607: explicitely ==> explicitly
Error: ./vhdbuilder/notice.txt:48282: synopsys ==> synopsis
Error: ./vhdbuilder/notice.txt:48866: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:48927: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:50547: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:51356: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:51373: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:54566: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:58493: MERCHANTIBILITY ==> MERCHANTABILITY
Error: ./vhdbuilder/notice.txt:59398: rouines ==> routines
Error: ./vhdbuilder/notice.txt:60589: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:60722: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:61073: rouines ==> routines
Error: ./vhdbuilder/notice.txt:62089: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:6[21](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:21)63: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:6[22](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:22)08: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:62253: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:64514: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:65685: mke ==> make
Error: ./vhdbuilder/notice.txt:65687: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:65707: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:65804: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:65808: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:66306: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:66439: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:66621: mke ==> make
Error: ./vhdbuilder/notice.txt:666[23](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:23): ists ==> its, lists
Error: ./vhdbuilder/notice.txt:66643: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:69581: MERCHANTIBILITY ==> MERCHANTABILITY
Error: ./vhdbuilder/notice.txt:69934: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:69995: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:71918: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:72727: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:72744: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:75937: mantained ==> maintained
Error: ./vhdbuilder/notice.txt:79864: MERCHANTIBILITY ==> MERCHANTABILITY
Error: ./vhdbuilder/notice.txt:80769: rouines ==> routines
Error: ./vhdbuilder/notice.txt:81960: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:82093: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:8[24](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:24)44: rouines ==> routines
Error: ./vhdbuilder/notice.txt:83460: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:83534: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:83579: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:83624: packe ==> packed, packet
Error: ./vhdbuilder/notice.txt:85885: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:87056: mke ==> make
Error: ./vhdbuilder/notice.txt:87058: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:87078: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:87175: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:87179: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:87677: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:87810: Troup ==> Troupe
Error: ./vhdbuilder/notice.txt:87992: mke ==> make
Error: ./vhdbuilder/notice.txt:87994: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:88014: merchantibility ==> merchantability
Error: ./vhdbuilder/notice.txt:90952: MERCHANTIBILITY ==> MERCHANTABILITY
Error: ./vhdbuilder/notice.txt:91305: ists ==> its, lists
Error: ./vhdbuilder/notice.txt:91366: Troup ==> Troupe
Error: ./vhdbuilder/packer/init-variables.sh:60: hasnt ==> hasn't
Error: ./vhdbuilder/packer/install-dependencies.sh:407: upsteam ==> upstream
Error: ./vhdbuilder/packer/configure-windows-vhd.ps1:29: attemping ==> attempting
Error: ./vhdbuilder/packer/configure-windows-vhd.ps1:99: upates ==> updates
Error: ./vhdbuilder/packer/configure-windows-vhd.ps1:342: provisiong ==> provisioning
Error: ./vhdbuilder/packer/test/windows-vhd-content-test.ps1:123: pathced ==> patched
Error: ./vhdbuilder/packer/test/windows-vhd-content-test.ps1:131: instad ==> instead
Error: ./vhdbuilder/packer/test/linux-vhd-content-test.sh:41: downlaoded ==> downloaded
Error: ./vhdbuilder/packer/test/linux-vhd-content-test.sh:56: downlaoded ==> downloaded
Error: ./vhdbuilder/publish/Marketplace/new-sku-and-add-image-version.sh:81: pubilsher ==> publisher
Error: ./vhdbuilder/scripts/linux/ubuntu/tool_installs_ubuntu.sh:109: usuable ==> usable
Error: ./vhdbuilder/scripts/linux/mariner/tool_installs_mariner.sh:49: doesnt ==> doesn't, does not
Error: ./vhdbuilder/scripts/linux/mariner/tool_installs_mariner.sh:53: pacakge ==> package
Error: ./parts/linux/cloud-init/nodecustomdata.yml:774: priviledged ==> privileged
Error: ./parts/linux/cloud-init/artifacts/cse_helpers.sh:10: Timout ==> Timeout
Error: ./parts/linux/cloud-init/artifacts/cse_install.sh:42: invalide ==> invalid
Error: ./parts/linux/cloud-init/artifacts/cse_install.sh:49: invalide ==> invalid
Error: ./parts/linux/cloud-init/artifacts/cse_config.sh:6: chage ==> change, charge
Error: ./parts/linux/cloud-init/artifacts/cse_config.sh:7: chage ==> change, charge
Error: ./parts/linux/cloud-init/artifacts/cse_main.sh:63: boostrapping ==> bootstrapping
184
Error: ./parts/linux/cloud-init/artifacts/init-aks-custom-cloud.sh:42: usuable ==> usable
Error: ./parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh:1[26](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:26): versons ==> versions
Error: ./parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh:165: overriden ==> overridden
Error: ./parts/windows/kuberneteswindowssetup.ps1:263: packge ==> package
Error: ./e2e/e2e-scenario.sh:67: chaning ==> chaining, changing
Error: ./e2e/e2e_test.go:17: Seperate ==> Separate
Error: ./e2e/e2e_test.go:65: couldnt ==> couldn't
Error: ./e2e/e2e_test.go:72: couldnt ==> couldn't
Error: ./apiserver/apiserver.go:44: cancelation ==> cancellation
Error: ./pkg/templates/templates_generated.go:553: chage ==> change, charge
Error: ./pkg/templates/templates_generated.go:554: chage ==> change, charge
Error: ./pkg/templates/templates_generated.go:1138: Timout ==> Timeout
Error: ./pkg/templates/templates_generated.go:1457: invalide ==> invalid
Error: ./pkg/templates/templates_generated.go:1464: invalide ==> invalid
Error: ./pkg/templates/templates_generated.go:1909: boostrapping ==> bootstrapping
Error: ./pkg/templates/templates_generated.go:2674: usuable ==> usable
Error: ./pkg/templates/templates_generated.go:4350: versons ==> versions
Error: ./pkg/templates/templates_generated.go:4389: overriden ==> overridden
Error: ./pkg/templates/templates_generated.go:5344: priviledged ==> privileged
Error: ./pkg/templates/templates_generated.go:5783: packge ==> package
Error: ./pkg/agent/baker_test.go:60: te ==> the, be, we, to
Error: ./pkg/agent/baker_test.go:495: verison ==> version
Error: ./pkg/agent/baker_test.go:635: te ==> the, be, we, to
Error: ./pkg/agent/baker_test.go:843: hel ==> help, hell, heal
Error: ./pkg/agent/baker_test.go:844: hel ==> help, hell, heal
Error: ./pkg/agent/const.go:30: privides ==> provides
Error: ./pkg/agent/datamodel/types_test.go:45: properities ==> properties
Error: ./pkg/agent/datamodel/types_test.go:47: properities ==> properties
Error: ./pkg/agent/datamodel/types.go:106: wil ==> will, well
Error: ./pkg/agent/datamodel/types.go:194: optionaly ==> optionally
Error: ./staging/cse/windows/kubeletfunc.ps1:[29](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:29)1: avalible ==> available
Error: ./staging/cse/windows/containerdfunc.ps1:118: avalaible ==> available
Error: ./staging/cse/windows/containerdfunc.ps1:136: Stoping ==> Stopping
Error: ./staging/cse/windows/azurecnifunc.ps1:64: lenght ==> length
Error: ./staging/cse/windows/azurecnifunc.ps1:65: execptions ==> exceptions
Error: ./staging/cse/windows/kubernetesfunc.ps1:54: depencencies ==> dependencies
Error: ./staging/cse/windows/configfunc.ps1:123: shiped ==> shipped
Error: ./staging/cse/windows/provisioningscripts/cleanupnetwork.ps1:[34](https://github.com/andyzhangx/AgentBaker/runs/5689210163?check_suite_focus=true#step:4:34): becuase ==> because

collect-windows-logs.ps1 throwed an error in using kubectl.exe to collect logs

What happened:
collect-windows-logs.ps1 throwed below error.


Path C:\Windows\Minidump does not exist
Collecting logs from C:\Windows\SystemTemp
d-----         3/12/2024  11:00 AM                px4lqsv0.2fu
Collecting the information of the node and pods by kubectl
c:\k\kubectl.exe : WARNING: This version information is deprecated and will be replaced with the output from kubectl 
version --short.  Use --output=yaml|json to get the full version.
At C:\k\debug\collect-windows-logs.ps1:336 char:24
+ ...  function kubectl { c:\k\kubectl.exe --kubeconfig c:\k\config $args }
+                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (WARNING: This v...e full version.:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError
All logs collected: ...
**********************
Windows PowerShell transcript end
End time: 20240312110015
**********************

What you expected to happen:
No error.

How to reproduce it:

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

CSEResultFilePath does not produce any file

Despite providing CSEResultFilePath nothing is created on this path

Environment:

  • AgentBaker version:
  • Kubernetes version (use 1.27.9):
  • OS Windows19:
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Avoid downloading any files when running `collect-windows-logs.ps`

Is your feature request related to a problem?/Why is this needed

When running collect-windows-logs.ps, it will call collectlogs.ps which will download many files from github.

DownloadFile -Url  "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/dumpVfpPolicies.ps1" -Destination $BaseDir\dumpVfpPolicies.ps1DownloadFile -Url "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/hns.psm1" -Destination $BaseDir\hns.psm1DownloadFile -Url "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/starthnstrace.cmd" -Destination $BaseDir\starthnstrace.cmdDownloadFile -Url "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/starthnstrace.ps1" -Destination $BaseDir\starthnstrace.ps1DownloadFile -Url "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/startpacketcapture.cmd" -Destination $BaseDir\startpacketcapture.cmdDownloadFile -Url "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/startpacketcapture.ps1" -Destination $BaseDir\startpacketcapture.ps1DownloadFile -Url  "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/stoppacketcapture.cmd" -Destination $BaseDir\stoppacketcapture.cmdDownloadFile -Url  "https://raw.githubusercontent.com/$GithubSDNRepository/master/Kubernetes/windows/debug/portReservationTest.ps1" -Destination $BaseDir\portReservationTest.ps1

Describe the solution you'd like in detail

Use DownloadFileOverHttp to replace DownloadFile in collectlogs.ps

Describe alternatives you've considered

Additional context

deprecated images in the vhd

I believe these images are deprecated. Can someone from AKS double check and remove these if needed

mcr.microsoft.com/k8s/core/pause 
mcr.microsoft.com/k8s/flexvolume/blobfuse-flexvolume
mcr.microsoft.com/k8s/flexvolume/keyvault-flexvolume
mcr.microsoft.com/oss/kubernetes/exechealthz

Check whether the folder C:\k\azurecni\netconf exists before creating it

Check whether the folder C:\k\azurecni\netconf exists before creating it

2021-04-27T03:49:28.7639517+00:00: Installing Azure VNet plugins


    Directory: C:\k\azurecni


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----        4/27/2021   3:49 AM                bin
mkdir : An item with the specified name C:\k\azurecni\netconf already exists.
At C:\AzureData\windows\windowsazurecnifunc.ps1:32 char:5
+     mkdir $AzureCNIConfDir
+     ~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ResourceExists: (C:\k\azurecni\netconf:String) [New-Item], IOException
    + FullyQualifiedErrorId : DirectoryExist,Microsoft.PowerShell.Commands.NewItemCommand

only kubelet cached in the VHD can be picked for a specific version

When building the VHD, we extract the kubelet binary to /usr/local/bin/ according to install-dependencies.sh, so that when we use CSE to install the exact kubelet from the user-input URL, like a hotfix, the hotfix will be skipped for the existing kubelet binary with the same kubernetes version, achieved by cse_install.sh. This brings less flexibility because only the cached kubelet can be used regardless of the user input, and we have to build a VHD in order to include the hotfixed kubelet.

Windows: Use vnet resource group as build_resource_group_name instead of creating temporary resource group

Is your feature request related to a problem?/Why is this needed

no

Describe the solution you'd like in detail

https://www.packer.io/plugins/builders/azure/arm#build_resource_group_name
I tried to use the pre-created rg VNET_RESOURCE_GROUP_NAME as build_resource_group_name for packer but it did not work. Need to investigate it. If it works, all resources created by packer can be in the same pre-created resource group and it will save the time to delete it with --no-wait. Currently we need to wait the Windows build VM to be deleted before deleting the pre-created rg VNET_RESOURCE_GROUP_NAME

Describe alternatives you've considered

Additional context

Details: version v0.20211030.0 is not supported for AgentBaker

What happened: I was attempting to scale our Kubernetes node pool though the azure portal UI, when it failed with the following error:
Details: version v0.20211030.0 is not supported for AgentBaker

What you expected to happen: I was expecting it to scale as it has in the past

How to reproduce it: Attempting to scale will result in the same error each time.

Anything else we need to know?: None that I can think of

Environment:

  • AgentBaker version: v0.20211030.0
  • Kubernetes version (use kubectl version): 1.21.2

Please help to confirm when $Env:ProgramFiles\containerd\diag.ps1 will exist for collecting containerd hyperv logs

Please help to confirm when $Env:ProgramFiles\containerd\diag.ps1 will exist for collecting containerd hyperv logs in the Windows node.

powershell C:\k\debug\collect-windows-logs.ps1

Collecting containerd hyperv logs
Containerd hyperv logs not avalaible

ls $Env:ProgramFiles\containerd\diag.ps1

ls : Cannot find path 'C:\Program Files\containerd\diag.ps1' because it does not exist.
At line:1 char:1
+ ls $Env:ProgramFiles\containerd\diag.ps1
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (C:\Program Files\containerd\diag.ps1:String) [Get-ChildItem], ItemNotFoundException
    + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetChildItemCommand

Env:

NAME                                STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
aks-nodepool1-17963920-vmss000000   Ready    agent   14m     v1.20.2   10.240.0.4    <none>        Ubuntu 18.04.5 LTS               5.4.0-1046-azure   containerd://1.5.0-beta.git31a0f92df+azure
aksnpwin000000                      Ready    agent   8m24s   v1.20.2   10.240.0.35   <none>        Windows Server 2019 Datacenter   10.0.17763.1879    containerd://1.4.4+unknown

Missing documentation/examples

Would be nice to have few examples docs/examples how one could run all these pipelines outside VSTS and build their own images. For now this project looks like one-leg-opensource, where external consumer would have challenges to re-use it.

Not usable anymore outside MSFT or standalone project

What happened:

After the introduction of #3792 this project is not usable outside Microsoft infrastructure. We use this project to do security scans of Azure image structure and configurations, run tests and overlay configuration for some of our BYO nodes.

Now to build windows nodes you need access to MSFT managed identity and private packages.

What you expected to happen:

Opensource stay open :)

How to reproduce it:

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

vmssCSE exit status 50 with a firewall blocking AAAA/IPV6 record

Wanted your quick opinion on the exit status 50 with AKS vmssCSE below. Incident 218837180

Customer has rules to block AAAA/IPV6 record on their firewall – and they were facing vmssCSE throwing exit status 50 - ERR_OUTBOUND_CONN_FAIL.
I’m looking at the AgentBaker code – but I’m not finding any special dependencies on that.
 Does that ring something to you?
 Is there any reasons why we’d force calling mcr.microsoft.com through AAAA/IPv6?
 Any way to fall back to IPv4 in case we fail multiple times in a row (which would indicate customer has special config blocking IPv6)?

The file with the same name as the cached one won't be downloaded for Windows CSE

What happened:

During CSE phase, DownloadFileOverHttp will not download the file from the remote server and use the cached file instead, when the remote file has the same name as the cached one, even if they are not the same.

What you expected to happen:
The remote file will be downloaded if it's not the same one, or has the same md5, as the cached file.

How to reproduce it:

Anything else we need to know?:

Environment:

  • AgentBaker version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.