Giter Site home page Giter Site logo

microsoft / docker-provider Goto Github PK

View Code? Open in Web Editor NEW
131.0 49.0 104.0 917.71 MB

Azure Monitor for Containers

License: Other

Makefile 0.63% Shell 15.28% Ruby 42.63% Python 9.01% PowerShell 13.65% C# 0.94% Batchfile 0.04% Dockerfile 0.37% Go 16.18% Mustache 0.12% C++ 0.19% Bicep 0.43% HCL 0.53%
kubernetes aks monitoring azure containers logging metrics cloud loganalytics hybrid

docker-provider's Introduction

About

This repository contains source code for Azure Monitor for containers Linux and Windows Agent

Questions?

Feel free to contact engineering team owners in case you have any questions about this repository or project.

Prerequisites

Common

Note: If you are using WSL2, make sure you have cloned the code onto ubuntu not onto windows

WSL2

Linux

  • Ubuntu 14.04 or higher to build Linux Agent.
  • Docker to build the docker image for Linux Agent

Note: if you are using WSL2, you can ignore Docker since Docker for windows will be used.

Windows

Repo structure

The general directory structure is:

├── .pipelines/                               - files related to azure devops ci and cd pipelines
├── build/                                    - files to related to  compile and build the code
│   ├── version                               - build version used for docker prvider and go shared object(so) files
│   ├── common/                               - common to both windows and linux installers
│   │   ├── installer                         - files related to installer
|   |   |   |── scripts/                      - script files related to configmap parsing
│   ├── linux/                                - Makefile and installer files for the Docker Provider
│   │   ├── Makefile                          - Makefile to build the docker provider
│   │   ├── installer                         - files related to installer
|   |   |   |── bundle/                       - shell scripts to create shell bundle
|   |   |   |── conf/                         - plugin configuration files
|   |   |   |── datafiles/                    - data files for the installer
|   |   |   |── scripts/                      - script files related to livenessproble, tomlparser etc..
|   |   |   |── InstallBuilder/               - python script files for the install builder
│   ├── windows/                              - scripts to build the .net and go code
|   |   |── Makefile.ps1                      - powershell script to build .net and go lang code and copy the files to amalogswindows directory
│   │   ├── installer                         - files related to installer
|   |   |   |── conf/                         - fluent, fluentbit and out_oms plugin configuration files
|   |   |   |── scripts/                      - script files related to livenessproble, filesystemwatcher, keepCertificateAlive etc..
|   |   |   |── certificategenerator/         - .NET code for the generation self-signed certificate of the windows agent
├── charts/                                   - helm charts
│   ├── azuremonitor-containers/              - azure monitor for containers helm chart used for non-AKS clusters
├── alerts/                                   - alert queries
├── kubernetes/                               - files related to Linux and Windows Agent for Kubernetes
│   ├── linux/                                - scripts to build the Docker image for Linux Agent
│   │   ├── dockerbuild                       - script to build docker provider, docker image and publish docker image
│   │   ├── DockerFile.multiarch              - DockerFile for Linux Agent Container Image
│   │   ├── main.sh                           - Linux Agent container entry point
│   │   ├── setup.sh                          - setup file for Linux Agent Container Image
│   │   ├── acrworkflows/                     - acr work flows for the Linux Agent container image
│   │   ├── defaultpromenvvariables           - default environment variables for Prometheus scraping
│   │   ├── defaultpromenvvariables-rs        - cluster level default environment variables for Prometheus scraping
│   │   ├── defaultpromenvvariables-sidecar   - cluster level default environment variables for Prometheus scraping in sidecar
│   ├── windows/                              - scripts to build the Docker image for Windows Agent
│   │   ├── dockerbuild                       - script to build the code and docker imag, and publish docker image
│   │   ├── acrworkflows/                     - acr work flows for the Windows Agent container image
│   │   ├── DockerFile                        - DockerFile for Windows Agent Container Image
│   │   ├── main.ps1                          - Windows Agent container entry point
│   │   ├── setup.ps1                         - setup file for Windows Agent Container Image
│   ├── ama-logs.yaml                         - kubernetes yaml for both Linux and Windows Agent
│   ├── container-azm-ms-agentconfig.yaml     - kubernetes yaml for agent configuration
├── scripts/                                  - scripts for onboarding, troubleshooting and preview scripts related to Azure Monitor for containers
│   ├── troubleshoot/                         - scripts for troubleshooting of Azure Monitor for containers onboarding issues
│   ├── onboarding/                           - scripts related to Azure Monitor for containers onboarding.
│   ├── preview/                              - scripts related to preview features ...
│   ├── build/                                - scripts related to build such as installing pre-requisites etc.
│   ├── deployment/                           - scripts related to deployment goes here.
│   ├── release/                              - scripts related to release  goes here.
├── source/                                   - source code
│   ├── plugins/                              - plugins source code
│   │   ├── go/                               - out_oms plugin code in go lang
│   │   ├── ruby/                             - plugins code in ruby
│   │   |   ├── health/                       - code for health feature
│   │   |   ├── lib/                          - lib for app insights ruby and this code of application_insights gem
│   │   |   ...                               - plugins in, out and filters code in ruby
├── test/                                     - source code for tests
│   ├── e2e/                                  - e2e tests to validate agent and e2e workflow(s)
│   ├── unit-tests/                           - unit tests code
│   ├── scenario/                             - scenario tests code
├── !_README.md                               - this file
├── .gitignore                                - git config file with include/exclude file rules
├── LICENSE                                   - License file
├── Rakefile                                  - Rake file to trigger ruby plugin tests
└── ReleaseProcess.md                         - Release process instructions
└── ReleaseNotes.md                           - Release notes for the release of the Azure Monitor for containers agent

Branches

  • We are using a single branch which has all the code in development and we will be releasing from this branch itself.
  • ci_prod branch contains codebase version in development.

To contribute: create your private branch off of ci_prod, make changes and use pull request to merge back to ci_prod. Pull request must be approved by at least one engineering team members.

Authoring code

We recommend using Visual Studio Code for authoring. Windows 10 with Ubuntu App can be used for both Windows and Linux Agent development and recommened to clone the code onto Ubuntu app so that you dont need to worry about line ending issues LF vs CRLF.

Building code

Linux Agent

Install Pre-requisites

  1. Install go1.18.3, dotnet, powershell, docker and build dependencies to build go code for both Linux and Windows platforms
bash ~/Docker-Provider/scripts/build/linux/install-build-pre-requisites.sh
  1. Verify python, docker and golang installed properly and also PATH and GOBIN environment variables set with go bin path. For some reason go env not set by install-build-pre-requisites.sh script, run the following commands to set them
    export PATH=$PATH:/usr/local/go/bin
    export GOBIN=/usr/local/go/bin
    
  2. If you want to use Docker on the WSL2, verify following configuration settings configured on your Ubuntu app
    echo $DOCKER_HOST
    # if either DOCKER_HOST not set already or doesnt have tcp://localhost:2375 value, set DOCKER_HOST value via this command
    echo "export DOCKER_HOST=tcp://localhost:2375" >> ~/.bashrc && source ~/.bashrc
    # on Docker Desktop for Windows make sure docker running linux mode and enabled Expose daemon on tcp://localhost:2375 without TLS
    

Build Docker Provider Shell Bundle and Docker Image and Publish Docker Image

Note: If you are using WSL2, ensure Docker for windows running with Linux containers mode on your windows machine to build Linux agent image successfully

Note: format of the imagetag will be ci<release><MMDDYYYY>. possible values for release are test, dev, preview, dogfood, prod etc. Please use MCR urls while building internally.

Preferred Way: You can build and push images for multiple architectures. This is powered by docker buildx Directly use the docker buildx commands (the MCR images can be found in our internal wiki to be used as arguments)

# multiple platforms
cd ~/Docker-Provider
docker buildx build --platform linux/arm64/v8,linux/amd64 -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> --build-arg CI_BASE_IMAGE=<ciimage> --build-arg GOLANG_BASE_IMAGE=<golangimage> -f kubernetes/linux/Dockerfile.multiarch --push .

# single platform
cd ~/Docker-Provider
docker buildx build --platform linux/amd64 -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> --build-arg CI_BASE_IMAGE=<ciimage> --build-arg GOLANG_BASE_IMAGE=<golangimage> -f kubernetes/linux/Dockerfile.multiarch --push .

Using the build and publish script

cd ~/Docker-Provider/kubernetes/linux/dockerbuild
sudo docker login # if you want to publish the image to acr then login to acr via `docker login <acr-name>`
# build provider, docker image and publish to docker image
bash build-and-publish-docker-image.sh --image <repo>/<imagename>:<imagetag> --ubuntu <ubuntu image url> --golang <golang image url>
cd ~/Docker-Provider/kubernetes/linux/dockerbuild
sudo docker login # if you want to publish the image to acr then login to acr via `docker login <acr-name>`
# build and publish using docker buildx
bash build-and-publish-docker-image.sh --image <repo>/<imagename>:<imagetag> --ubuntu <ubuntu image url> --golang <golang image url> --multiarch

You can also build and push images for multiple architectures. This is powered by docker buildx

cd ~/Docker-Provider/kubernetes/linux/dockerbuild
sudo docker login # if you want to publish the image to acr then login to acr via `docker login <acr-name>`
# build and publish using docker buildx
bash build-and-publish-docker-image.sh --image <repo>/<imagename>:<imagetag> --multiarch

or directly use the docker buildx commands

# multiple platforms
cd ~/Docker-Provider
docker buildx build --platform linux/arm64/v8,linux/amd64 -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> -f kubernetes/linux/Dockerfile.multiarch --push .

# single platform
cd ~/Docker-Provider
docker buildx build --platform linux/amd64 -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> -f kubernetes/linux/Dockerfile.multiarch --push .

If you prefer to build docker provider shell bundle and image separately, then you can follow below instructions

Build Docker Provider shell bundle
cd ~/Docker-Provider/build/linux
make
Build and Push Docker Image
cd ~/Docker-Provider/kubernetes/linux/
docker build -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> --build-arg CI_BASE_IMAGE=<ciimage> .
docker push <repo>/<imagename>:<imagetag>

Windows Agent

To build the windows agent, you will have to build .NET and Go code, and docker image for windows agent. Docker image for windows agent can only build on Windows machine with Docker for windows with Windows containers mode but the .NET code and Go code can be built either on Windows or Linux or WSL2.

Install Pre-requisites

Install pre-requisites based on OS platform you will be using to build the windows agent code

Option 1 - Using Windows Machine to Build the Windows agent

powershell # launch powershell with elevated admin on your windows machine
Set-ExecutionPolicy -ExecutionPolicy bypass # set the execution policy
cd %userprofile%\Docker-Provider\scripts\build\windows # based on your repo path
.\install-build-pre-requisites.ps1 #

Option 2 - Using WSL2 to Build the Windows agent

powershell # launch powershell with elevated admin on your windows machine
Set-ExecutionPolicy -ExecutionPolicy bypass # set the execution policy
net use z: \\wsl$\Ubuntu-16.04 # map the network drive of the ubuntu app to windows
cd z:\home\sshadmin\Docker-Provider\scripts\build\windows # based on your repo path
.\install-build-pre-requisites.ps1 #

Build Windows Agent code and Docker Image

Note: format of the windows agent imagetag will be win-ci<release><MMDDYYYY>. possible values for release are test, dev, preview, dogfood, prod etc.

Option 1 - Using Windows Machine to Build the Windows agent

Execute below instructions on elevated command prompt to build windows agent code and docker image, publishing the image to acr or docker hub

cd %userprofile%\Docker-Provider\kubernetes\windows\dockerbuild # based on your repo path
docker login # if you want to publish the image to acr then login to acr via `docker login <acr-name>`
powershell -ExecutionPolicy bypass  # switch to powershell if you are not on powershell already
.\build-and-publish-docker-image.ps1 -image <repo>/<imagename>:<imagetag> # trigger build code and image and publish docker hub or acr
Developer Build optimizations

If you do not want to build the image from scratch every time you make changes during development,you can choose to build the docker images that are separated out by

  • Base image and dependencies including agent bootstrap(setup.ps1)
  • Agent conf and plugin changes

To do this, the very first time you start developing you would need to execute below instructions in elevated command prompt of powershell. This builds the base image(ama-logs-win-base) with all the package dependencies

cd %userprofile%\Docker-Provider\kubernetes\windows\dockerbuild # based on your repo path
docker login # if you want to publish the image to acr then login to acr via `docker login <acr-name>`
powershell -ExecutionPolicy bypass  # switch to powershell if you are not on powershell already
.\build-dev-base-image.ps1  # builds base image and dependencies

And then run the script to build the image consisting of code and conf changes.

.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> # trigger build code and image and publish docker hub or acr
By default, multi-arc docker image will be built, but if you want generate test image either with ltsc2019 or ltsc2022 base image, then you can follow the instructions below

For building image with base image version ltsc2019
.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> -windowsBaseImageVersion ltsc2019

For building image with base image version ltsc2022
.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> -windowsBaseImageVersion ltsc2022


For the subsequent builds, you can just run -

.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> # trigger build code and image and publish docker hub or acr
By default, multi-arc docker image will be built, but if you want generate test image either with ltsc2019 or ltsc2022 base image, then you can follow the instructions below

For building image with base image version ltsc2019
.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> -windowsBaseImageVersion ltsc2019

For building image with base image version ltsc2022
.\build-and-publish-dev-docker-image.ps1 -image <repo>/<imagename>:<imagetag> -windowsBaseImageVersion ltsc2022
Note - If you have changes in setup.ps1 and want to test those changes, uncomment the section consisting of setup.ps1 in the Dockerfile-dev-image file.

Option 2 - Using WSL2 to Build the Windows agent

On WSL2, Build Certificate Generator Source code and Out OMS Go plugin code
cd ~/Docker-Provider/build/windows # based on your repo path on WSL2 Ubuntu app
pwsh #switch to powershell
.\Makefile.ps1 # trigger build and publish of .net and go code

On Windows machine, build and Push Docker Image

Note: Docker image for windows container can only built on windows hence you will have to execute below commands on windows via accessing network share or copying published bits amalogswindows under kubernetes directory on to windows machine

net use z: \\wsl$\Ubuntu-16.04 # map the network drive of the ubuntu app to windows
cd z:\home\sshadmin\Docker-Provider\kubernetes\windows # based on your repo path
docker build -t <repo>/<imagename>:<imagetag> --build-arg IMAGE_TAG=<imagetag> .
docker push <repo>/<imagename>:<imagetag>

Azure DevOps Build Pipeline

Navigate to https://github-private.visualstudio.com/microsoft/_build?definitionId=444&_a=summary to see Linux and Windows Agent build pipelines. These pipelines are configured with CI triggers for ci_prod.

Docker Images will be pushed to CDPX ACR repos and these needs to retagged and pushed to corresponding ACR or docker hub. Only onboarded Azure AD AppId has permission to pull the images from CDPx ACRs.

Please reach out the agent engineering team if you need access to it.

Onboarding feature branch

Here are the instructions to onboard the feature branch to Azure Dev Ops pipeline

  1. Navigate to https://github-private.visualstudio.com/microsoft/_apps/hub/azurecdp.cdpx-onboarding.cdpx-onboarding-tab
  2. Select the repository as "docker-provider" from repository drop down
  3. click on validate repository
  4. select the your feature branch from Branch drop down
  5. Select the Operation system as "Linux" and Build type as "buddy"
  6. create build definition
  7. enable continous integration on trigger on the build definition

This will create build definition for the Linux agent. Repeat above steps except that this time select Operation system as "Windows" to onboard the pipeline for Windows agent.

Azure DevOps Release Pipeline

Integrated to Azure DevOps release pipeline for the ci_prod branch. With this, for every commit to ci_prod branch, latest bits automatically deployed to DEV AKS clusters in Build subscription.

When releasing the agent, we have a separate Azure DevOps pipeline which needs to be run to publish the image to prod MCR and our PROD AKS clusters.

For development, agent image will be in this format mcr.microsoft.com/azuremonitor/containerinsights/cidev:Major.Minor.Patch-CommitAheadCount-. Image tag for windows will be win-Major.Minor.Patch-CommitAheadCount-. For releases, agent will be in this format mcr.microsoft.com/azuremonitor/containerinsights/ciprod:Major.Minor.Patch. Image tag for windows will be win-Major.Minor.Patch.

Navigate to https://github-private.visualstudio.com/microsoft/_release?_a=releases&view=all to see the release pipelines.

Update Kubernetes yamls

Navigate to Kubernetes directory and update the yamls with latest docker image of Linux and Windows Agent and other relevant updates.

Deployment and Validation

For our single branch ci_prod, automatically deployed latest yaml with latest agent image (which automatically built by the azure devops pipeline) onto CIDEV AKS clusters in build subscription. So, you can use CIDEV AKS cluster to validate E2E. Similarly, you can set up build and release pipelines for your feature branch.

Testing MSI Auth Mode Using Yaml

  1. Enable Monitoring addon with Managed Idenity Auth Mode either using Portal or CLI or Template
  2. Get the MSI token (which is valid for 24 hrs.) value via kubectl get secrets -n kube-system aad-msi-auth-token -o=jsonpath='{.data.token}'
  3. Disable Monitoring addon via az aks disable-addons -a monitoring -g <rgName> -n <clusterName>
  4. Deploy ARM template with enabled = false to create DCR, DCR-A and link the workspace to Portal

Note - Make sure to update the parameter values in existingClusterParam.json file and have enabled = false in template file az deployment group create --resource-group <ResourceGroupName> --template-file ./existingClusterOnboarding.json --parameters @./existingClusterParam.json

  1. Uncomment MSI auth related yaml lines, replace all the placeholder values, MSI token value and image tag in the ama-logs.yaml
  2. Deploy the ama-logs.yaml via kubectl apply -f ama-logs.yaml > Note: use the image toggle for release E2E validation
  3. validate E2E for LA & Metrics data flows, and other scenarios

E2E Tests

For executing tests

  1. Deploy the ama-logs.yaml with your agent image. In the yaml, make sure ISTEST environment variable set to true if its not set already
  2. Update the Service Principal CLIENT_ID, CLIENT_SECRET and TENANT_ID placeholder values and apply e2e-tests.yaml to execute the tests

    Note: Service Principal requires reader role on log analytics workspace and cluster resource to query LA and metrics

    cd ~/Docker-Provider/test/e2e # based on your repo path
    kubectl apply -f e2e-tests.yaml # this will trigger job to run the tests in sonobuoy namespace
    kubectl get po -n sonobuoy # to check the pods and jobs associated to tests
    
  3. Download (sonobuoy)[https://github.com/vmware-tanzu/sonobuoy/releases] on your dev box to view the results of the tests
    results=$(sonobuoy retrieve) # downloads tar file which has logs and test results
    sonobuoy results $results # get the summary of the results
    tar -xzvf <downloaded-tar-file> # extract downloaded tar file and look for pod logs, results and other k8s resources if there are any failures
    

For adding new tests

  1. Add the test python file with your test code under tests directory
  2. Build the docker image, recommended to use ACR & MCR
 cd ~/Docker-Provider/test/e2e/src # based on your repo path
 docker login <acr> -u <user> -p <pwd> # login to acr
 docker build -f ./core/Dockerfile -t <repo>/<imagename>:<imagetag> .
 docker push <repo>/<imagename>:<imagetag>
  1. update existing agentest image tag in e2e-tests.yaml & conformance.yaml with newly built image tag with MCR repo

Scenario Tests

Clusters are used in release pipeline already has the yamls under test\scenario deployed. Make sure to validate these scenarios. If you have new interesting scenarios, please add/update them.

Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct] (https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ] (https://opensource.microsoft.com/codeofconduct/faq/) or contact [email protected] with any additional questions or comments.

docker-provider's People

Contributors

amitsara avatar arunavs49 avatar bdegs avatar bragi92 avatar daweim0 avatar deagraw avatar dependabot[bot] avatar ganga1980 avatar henryrawas avatar jatakiajanvi12 avatar jeffaco avatar keikhara avatar marcsj avatar ms-hujia avatar msftxiangyu avatar oriyosefimsft avatar pfrcks avatar r-dilip avatar raghushantha avatar rashmichandrashekar avatar saaror avatar samisms avatar sarahpeiffer avatar sugr4 avatar teamdman avatar tokaplan avatar tomkerkhove avatar vishiy avatar wanlonghenry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-provider's Issues

[Feature Request] Ability to set PriorityClass for OMSAgent in AKS

I am not sure if this is the correct repository to open this issue, but:

It would be great if it was possible to somehow configure a priority-class for the omsagent within an AKS-Cluster.

Currently, that does not seem possible - or at least I could find no documentation whatsoever about it.

In our scenario, we want to give the cluster monitoring a somewhat higher priority than most of the other services - so that in case of an error, we won't be flying blind.

The only workaround I can think of is to use a global default class - however that is not really feasible as there might be other, unimportant pods without a PriorityClass within the cluster that might suddenly be ranked way higher than they should be.

Is such a feature possible?

OSM Agent high memory usage

Hello,

Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.

Reopening because of last Issue was closed automatically.
#624

collect_all_kube_events is not working correctly

Collecting all kube events is not working correctly, only the first 4000 events are recorded taking normal events into account. The loop for fetching events is missing the logic when collect_all_kube_events is enabled:

The code at this line: https://github.com/microsoft/Docker-Provider/blob/ci_prod/source/plugins/ruby/in_kube_events.rb#L115 should look like:

if @collectAllKubeEvents
            continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
          else
            continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?fieldSelector=type!=Normal&limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}")
          end

Divergence between ARM templates and Portal for K8s alarms

I am in the process of implementing the alarms based on the ARM templates. Comparing the alarms created via the template, it seems the metrics chosen and thresholds are different than what the portal creates for the corresponding alarms in the "Recommended alerts (Preview)" pane.

Maybe it is not wrong, but I would like to understand why the difference.

Examples:
Alarm on the portal: "(New) Container CPU %"
Description: "Average CPU percent is greater than the configured threshold (default is 95%)"
Metric used on the portal: cpuThresholdExceeded > 0
Metric used on the ARM template: cpuExceededPercentage > 95

Alarm on the portal: "(New) Container working set memory %"
Description: "Average working set memory percent is greater the configured threshold (default is 95%)"
Metric used on the portal: memoryworkingsetthresholdviolated > 0
Metric used on the ARM template: memoryWorkingSetExceededPercentage > 95

How to configure the secured endpoint for metric collection

From this doc: https://github.com/microsoft/Docker-Provider/blob/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml#L70

we can configure the endpoint, then the agent can collect the metrics from my endpoint. but my endpoint is secured. something like this:

curl -u admin:password http://automation-controller-service.ansible-automation-platform:8080/api/v2/metrics

need to use the username and password to access it. I can not find any configuration to do it. do you know how to provide the auth info for my endpoint? thanks

Performance metrics are traced for completed jobs

We run more than a thousand short-living jobs on our cluster every day. These jobs stay for some time in the "Completed" state. As a result, we reach our Log Analytics size quota much earlier than expected, because performance metrics are written every minute for every completed job along with other pods (as far as I understand, in_kube_perfinventory.rb is responsible for that).

Can completed pods be excluded from performance traces?

Thanks

retry cadvisor port determination

We should add retries in main.sh here:

#Setting environment variable for CAdvisor metrics to use port 10255/10250 based on curl request
echo "Making wget request to cadvisor endpoint with port 10250"
#Defaults to use port 10255
cAdvisorIsSecure=false
RET_CODE=`wget --server-response https://$NODE_IP:10250/stats/summary --no-check-certificate --header="Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" 2>&1 | awk '/^  HTTP/{print $2}'`
if [ $RET_CODE -eq 200 ]; then
      cAdvisorIsSecure=true
fi

omiagent (OMI-1.4.2-5) keeps crashing

omiagent process keeps crashing every minutes and is filling up the filesystem.
I have traced the problem to the docker-cimprov-1.0.0-32 provider. Please see below for details.

Version:

# /opt/omi/bin/omiserver -v
/opt/omi/bin/omiserver: OMI-1.4.2-5 - Wed Jul 25 10:59:15 PDT 2018 

omiserver.log:

2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a52618:4099:EnumerateInstancesReq:e005)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40039 Priority=INFO New request received: command=(EnumerateInstancesReq), namespace=(root/cimv2), class=(Container_DaemonEvent)
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a56c40, name=null
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:15:BinProtocolNotification:13), from original operationId: 0 to 13
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92034,92034] INFO: null(0): EventId=40005 Priority=INFO _SendRequestToAgent msg(0x1a576a8:4099:EnumerateInstancesReq:14), from original operationId: e005 to 14
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2db28:36:VerifySocketConn:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40011 Priority=INFO (S)done with receiving msg(0x1b2ed68:34:CreateAgentMsg:0)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40012 Priority=INFO (S)Socket: 0x1b1d850, closing connection (mask 2)
2018/10/11 15:22:27 [92025,92025] INFO: null(0): EventId=40033 Priority=INFO Selector_RemoveHandler: selector=0x5e2168, handler=0x1b1d850, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:27 [92025,92025] WARNING: null(0): EventId=30209 Priority=WARNING child process with PID=[92081] terminated abnormally
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40011 Priority=INFO (E)done with receiving msg(0x1a54908:4:PostResultMsg:14)
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40032 Priority=INFO Selector_AddHandler: selector=0x5dff88, handler=0x1a525e0, name=BINARY_SERVER_CONNECTION
2018/10/11 15:22:37 [92034,92034] INFO: null(0): EventId=40028 Priority=INFO (E)Socket: 0x1a54aa0, Connection Closed while reading header

core.92081 :

[New LWP 92081]
[New LWP 92082]
[New LWP 92083]
[New LWP 92085]
[New LWP 92089]
[New LWP 92088]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/omi/bin/omiagent 9 10 --destdir / --providerdir /opt/omi/lib --loglevel IN'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
(gdb) bt
#0  0x00007fc0fdb13240 in std::allocator<std::pair<std::string const, unsigned long long> >::~allocator() () from /opt/omi/lib/libcontainer.so
#1  0x00007fc0fdb13fcb in std::_Miter_base<mi::Container_ContainerStatistics_Class*>::iterator_type std::__miter_base<mi::Container_ContainerStatistics_Class*>(mi::Container_ContainerStatistics_Cla
ss*) () from /opt/omi/lib/libcontainer.so
#2  0x00007fc0fdb12986 in __gnu_cxx::__normal_iterator<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >*, std::v
ector<std::map<std::string, unsigned long long, std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > >, std::allocator<std::map<std::string, unsigned long long,
 std::less<std::string>, std::allocator<std::pair<std::string const, unsigned long long> > > > > >::operator*() const () from /opt/omi/lib/libcontainer.so
#3  0x00007fc0fdb31db8 in ensure(printbuffer*, unsigned long) () from /opt/omi/lib/libcontainer.so
#4  0x0000000000408707 in ?? ()
#5  0x0000000000404952 in ?? ()
#6  0x0000000000468f9d in ?? ()
#7  0x0000000000465227 in ?? ()
#8  0x00000000004632d3 in ?? ()
#9  0x000000000044c092 in ?? ()
#10 0x000000000046205b in ?? ()
#11 0x00000000004632d3 in ?? ()
#12 0x000000000044f37a in ?? ()
#13 0x000000000044fff8 in ?? ()
#14 0x000000000046b23d in ?? ()
#15 0x0000000000404ffc in ?? ()
#16 0x00007fc10845c3d5 in __libc_start_main (main=0x4052b0, argc=9, ubp_av=0x7ffc84310a08, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc843109f8)
    at ../csu/libc-start.c:274
#17 0x0000000000404699 in ?? ()
#18 0x00007ffc843109f8 in ?? ()
#19 0x000000000000001c in ?? ()
#20 0x0000000000000009 in ?? ()
#21 0x00007ffc84310f36 in ?? ()
#22 0x00007ffc84310f4c in ?? ()
#23 0x00007ffc84310f4e in ?? ()
#24 0x0000000000000000 in ?? ()

[Feature request] Fork container logs based on content

This might not be the right repo for this issue... please do point me at a more appropriate place if not!

Is there a way to filter application logs from an AKS cluster based on the log contents - so that logs with, eg a specific JSON field, can go to a different Azure Monitor instance?

We have a deployment where we need to send application audit logs - which just go into the container logstream with a specific flag in the JSON log body - to a space with tighter access controls and longer log retention than the main bulk of the application logs.

As far as I can tell all container logs just get forwarded through the OMS agent into a single table in a configured workspace - there's no way to customize this on an AKS cluster, apart from the config options in kubernetes/container-azm-ms-agentconfig.yaml - which only allow stripping logs from specific namespaces.

Is there a way to get the agent to fork the container logs based on eg fluentd configuration? Or to deploy some additional containers onto the cluster to maybe intercept the logs and do this filtering?

Memory usage

Hello,

Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.

Disabling stdout without ConfigMap

Hello,

I am running container insights in a machine that do not have Kubernetes and I want to disable sending logs from stdout stream.

In the documentation (https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-agent-config), the only way to do that is by using Kubernetes ConfigMap. As I do not have kubernetes installed, I tried to configure this by setting the environment variable AZMON_LOG_EXCLUSION_REGEX_PATTERN=stdout (reference in some files at /build/windows/installer/conf/), but had no success.

I also tried to write a settings file at /etc/config/settings/log-data-collection-settings, but had no success.

Is there any way that I can disable sending stdout logs to azure workspace?

Prometheus integration - scraping API server metrics

Is it possible to scrape AKS API server metrics using https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration?

As far as I know, to get /metrics from API server authentication is required (bearer token) and I cannot see how this can be set in the monitoring agent config file https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration#prometheus-scraping-settings.

For standard Prometheus deployment this can be configured via bearer_token_file setting.

OMS Agent high memory usage

Hello,

Is there any way to reduce omsagent memory consumption in the kubernetes cluster? Just for 2 nodes it runs 3 instances of omsagent (1 daemon set - 2 instances, 1 replica set - 1 instance) and each instance uses 300mb of ram. This is the most demanding service in my cluster and it is just a monitoring tool.

Why is replicaset even required? Just adds one more instance to the node where daemonset already created one.

Reopening because of last Issue was closed without a solution. (With a solution for the problem what happened later than original issue was created)
#694

Missing information in syslog warning: Container image name (%s) is improperly formed and could not be parsed in SetRepositoryImageTag

It appears that when an image has a tag, a warning is logged in InventoryQuery::SetImageRepositoryImageTag at syslog(LOG_WARNING, "Container image name (%s) is improperly formed and could not be parsed in SetRepositoryImageTag", properties.c_str());

The syslog message is does not contain the image name: "Container image name () is improperly...". All the docker images are having names, so the name is available and should be displayed in the message.

Unable to specify settings for prometheus in ConfigMap

I am currently configuring the ConfigMap that is used by the OMS-agent pods. What I want to achieve is sending Prometheus metrics to a log analytics workspace.
For this I am following this Microsoft docs page.
On that page we can see this:

prometheus.io/scrape: "true"
prometheus.io/path: "/mymetrics"
prometheus.io/port: "8000"
prometheus.io/scheme: "http"

And then in the table under Cluster-wide we have keys like these:

  • prometheus.io/scrape
  • prometheus.io/path
  • ...

So, my understanding is that a user can specify what port e.g. the OMS-agent has to look at in the annotations of an application pod.
In my case, I have a pod that has an annotation: prometheus.io/port=8900. And the default that is mentioned in the documentation is 9102.

I tried to specify the following TOML in the ConfigMap:

prometheus-data-collection-settings: |-
    [prometheus_data_collection_settings.cluster]
        interval = "1m"
        fieldpass = ["platform_user_sessions", "platform_connection_bus"]
        monitor_kubernetes_pods = true
        monitor_kubernetes_pods_namespaces = ["dev-group-apps"]
        prometheus.io/port = 8900

Once the ConfigMap is read by the OMS-agent, I get the following error in the logs of the pod:

"config::error::Exception while parsing config map for prometheus config: \nparse error on value \"/\" (error), using defaults, please check config map for errors"

When commenting the prometheus.io/port = 8900, it is parsed successfully.

I started to look in the source code to find the error, and what it does when it successfully parses the configmap.
There I bumped into these statements:

interval = parsedConfig[:prometheus_data_collection_settings][:cluster][:interval]
fieldPass = parsedConfig[:prometheus_data_collection_settings][:cluster][:fieldpass]
fieldDrop = parsedConfig[:prometheus_data_collection_settings][:cluster][:fielddrop]
urls = parsedConfig[:prometheus_data_collection_settings][:cluster][:urls]
kubernetesServices = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_services]
# Remove below 4 lines after phased rollout
monitorKubernetesPods = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods]
monitorKubernetesPodsNamespaces = parsedConfig[:prometheus_data_collection_settings][:cluster][:monitor_kubernetes_pods_namespaces]
kubernetesLabelSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_label_selector]
kubernetesFieldSelectors = parsedConfig[:prometheus_data_collection_settings][:cluster][:kubernetes_field_selector]

There is no prometheus read from the parsedConfig so I am definitely doing something wrong.

How are we able to specify that the OMS-agent has to look for a different port in the prometheus.io/port annotation, or is my understanding of this completely wrong?

[Feature request] Setting Mem_Buf_limit of td-agent-bit

In an AKS environment, logs are lost in an environment where a large amount of logs are output from containers.
I forced myself to change Mem_Buf_limit from its default value in the container of omsagent's DaemonSet, and this improved the situation.
It would be very helpful if the Mem_Buf_limit setting of td-agent-bit could be changed using ConfigMap.

# cat /etc/opt/microsoft/docker-cimprov/td-agent-bit.conf
[INPUT]
    Name tail
    Tag oms.container.log.la.*
    Path ${AZMON_LOG_TAIL_PATH}
    DB /var/log/omsagent-fblogs.db
    DB.Sync Off
    Parser cri
    Mem_Buf_Limit 10m <------------------------------- I would like to change this parameter
    Rotate_Wait 20
    Refresh_Interval 30
    Path_Key filepath
    Skip_Long_Lines On
    Ignore_Older 5m
    Exclude_Path ${AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH}

ContainerInventory table continuously populated with duplicated entries every few seconds, costing a lot

Issue

Hi, we have two clusters running Container Insights and are paying hundreds of pounds each month for the Log Analytics bill. It has become the most expensive part of our Azure bill. Looking into this, the ContainerInventory table is flooded with many messages a second. On the surface, OMS agent appears to be sending the same messages many times. This seems to be happening on both clusters.

Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this table?

To reproduce

The OMS version in use is microsoft/oms:ciprod10162018-2.

Example:
The following Log Analytics query shows the ContainerInventory table contains almost 3 times as much data as any other table.

Usage
| where IsBillable = true
| summarize Quantity=sum(Quantity) by SourceSystem, DataType, Solution
| order by Quantity desc

The result shows the top row has a SourceSystem of OMS, a DataType of ContainerInventory and a Solution of ContainerInsights. We're ingesting 8GB per week on our test environment just to that table.

Running the following query on that table helps show the suspected duplicate entries. Note that kube-dns was picked as a common example but the problem occurs for every container.

ContainerInventory
| where TimeGenerated > ago(1d) and ContainerHostname startswith "kube-dns"
| order by TimeGenerated desc

The result shows lots of rows with unique values in the TimeGenerated [UTC] column, but duplicate values in all other columns. A good indicator is to check the values of the CreatedTime [UTC] and StartedTime [UTC] columns at the end - they seem to be exactly the same for many different values of TimeGenerated. This implies that the same Kubernetes events are being reported by the OMS agent to Log Analytics many times over.

Please could you let us know how we can reduce the volume of data that OMS agent sends to Log Analytics for this ContainerInventory table as the cost impact is currently a problem?

Background

We are deploying the OMS agent using the addon in the ARM template:

"addonProfiles": {
  "omsagent": {
    "enabled": true,
    "config": {
      "logAnalyticsWorkspaceResourceID": "[parameters('logAnalyticsWorkspaceResourceId')]"
    }
  }
}

We have five OMS pods running as a result (there are currently four nodes in the test cluster this is taken from):

omsagent-d9cvb                          1/1     Running   1          20h
omsagent-drz85                          1/1     Running   3          7d
omsagent-p6jts                          1/1     Running   4          7d
omsagent-rs-ccf8b9699-9976m             1/1     Running   0          2d
omsagent-swwpq                          1/1     Running   4          7d

Thanks!

Helm deployment does not on-board Azure Monitor for containers

Following the instructions from the charts/azuremonitor-containers url the helm deployment does not on-board Azure Monitor for containers.

Screenshot 2021-07-21 at 10 31 57

Expected behaviour: Following the instructions would automatically on-board Azure Monitor for containers.

Environment: AKS
Kubernetes: 1.19.11

Step 1 and 2 complete successfully with a "Log Analytics Workspace" and "ContainerInsights(iob-dev-westeurope-akstest-workspace)" solution created in the same resource group as the AKS cluster.

Step 3 fails with error "No k8s-master VMs or VMSSes found in the specified resource group:iob-dev-westeurope-akstest-rg-aks" but looking at the script i am not sure this applies to AKS.

Helm deployment completes without error.

helm upgrade --install --values=values.yaml azmon-containers microsoft/azuremonitor-containers --namespace kube-system Release "azmon-containers" does not exist. Installing it now. W0721 09:12:19.676421 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:19.834125 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding W0721 09:12:22.908504 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W0721 09:12:23.088834 20988 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding NAME: azmon-containers LAST DEPLOYED: Wed Jul 21 09:12:18 2021 NAMESPACE: kube-system STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: azmon-containers deployment is complete.

Log output from omsagent

kubectl logs omsagent-48t4s -n kube-system
not setting customResourceId
Making curl request to oms endpint with domain: opinsights.azure.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl request to oms endpoint succeeded.
****************Start Config Processing********************
Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log'
****************End Config Processing********************
****************Start Config Processing********************
****************Start NPM Config Processing********************
config::npm::Successfully substituted the NPM placeholders into /etc/opt/microsoft/docker-cimprov/telegraf.conf file for DaemonSet
config::Starting to substitute the placeholders in td-agent-bit.conf file for log collection
config::Successfully substituted the placeholders in td-agent-bit.conf file
****************Start Prometheus Config Processing********************
config::No configmap mounted for prometheus custom config, using defaults
****************End Prometheus Config Processing********************
****************Start MDM Metrics Config Processing********************
****************End MDM Metrics Config Processing********************
****************Start Metric Collection Settings Processing********************
****************End Metric Collection Settings Processing********************
Making wget request to cadvisor endpoint with port 10250
Wget request using port 10250 succeeded. Using 10250
Making curl request to cadvisor endpoint /pods with port 10250 to get the configured container runtime on kubelet
configured container runtime on kubelet is : containerd
set caps for ruby process to read container env from proc
aks-system1-34726002-vmss000000
 * Starting periodic command scheduler cron
   ...done.
docker-cimprov 16.0.0.0
DOCKER_CIMPROV_VERSION=16.0.0.0
*** activating oneagent in legacy auth mode ***
setting mdsd workspaceid & key for workspace:68299338-cb11-46a8-a42e-977e476105e4
azure-mdsd 1.10.1-build.master.213
starting mdsd in legacy auth mode in main container...
*** starting fluentd v1 in daemonset
starting fluent-bit and setting telegraf conf file for daemonset
since container run time is containerd update the container log fluentbit Parser to cri from docker
nodename: aks-system1-34726002-vmss000000
replacing nodename in telegraf config
checking for listener on tcp #25226 and waiting for 30 secs if not..
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

waitforlisteneronTCPport found listener on port:25226 in 5 secs
checking for listener on tcp #25228 and waiting for 30 secs if not..
Routing container logs thru v2 route...
waitforlisteneronTCPport found listener on port:25228 in 10 secs
Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-07-21T08:12:52Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...
 * Stopping enhanced syslogd rsyslogd
   ...done.
getting rsyslog status...
 * rsyslogd is not running

Can you confirm that the on-boarding of Azure Monitor for containers should have occurred and how to troubleshoot this further?

enable-monitoring.sh fails if az output format not set to json

The bash script to enable monitoring for Arc clusters fails if the output format is not json.

An example when output format is set to table:

$ bash enable-monitoring.sh --resource-id $azureArcClusterResourceId --kube-context $kubeContext --workspace-id $logAnalyticsWorkspaceResourceId
...
validating cluster identity
cluster identity type: result -------------- systemassigned
-e only supported cluster identity is systemassigned for Azure ARC K8s cluster type

The az command formatting only works for single line output.

In addition, a script does not satisfy our automation workflows, it would be useful to see RM/Terraform code to configure the workspace then instructions to implement the Helm chart natively.

100 CPU usage

We have noticed that the omi provider for docker tends to cause the docker daemon (dockerd) to spin to 100% cpu.

We think the issue is related to statistics metric being too aggressively queries as this is a known pitfall of docker stats system.

OMS Agent Win kube-system100%+ CPU Usage

Been experiencing this issue for some time, and not just on one client.
Running Sitecore Containers in Azure.
Advised after raising a support request that advise from MS Support was:

"After discussed with our container product team, seems it’s an known issue for windows containerd. And now the windows contained is an opensource and maintained by community, which means any issue regarding contained issue, we have to raise an issue to the community for the tracking. Thanks for your understanding!"

So hence I am raising issue here for help. There are obviously issues with restarts as you can see in the image, though that it seems that it is being investigated separately by MS.

So any assistance on this would be appreciated.
OMS Agent Win

Helm installation failure

I'm facing an issue of helm during installing the Azure Arc enabled Kubernetes agent in my Kubernetes cluster, according to the document below:
https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-enable-arc-enabled-clusters

At the end of installation script of ps1, it is showing not found message like Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found.
I assuming the ps1 script (and also the bash script) has wrong version number.
And also, after changing the version number to 2.8.1 by editing the script, the installation finished properly.

The log output of installation:

...
Helm version : version.BuildInfo{Version:"v3.5.3", GitCommit:"041ce5a2c17a58be0fcd5f5e16fb3e7e95fea622", GitTreeState:"dirty", GoVersion:"go1.15.8"}
Installing or upgrading if exists, Azure Monitor for containers HELM chart ...
pull the chart from mcr.microsoft.com
pull the chart from mcr.microsoft.com
2.8.2: Pulling from mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers
Error: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2: not found
export the chart from local cache to current directory
Error: Chart not found: mcr.microsoft.com/azuremonitor/containerinsights/preview/azuremonitor-containers:2.8.2
helmChartRepoPath is : ./azuremonitor-containers
using provided kube-context: minikube
Release "azmon-containers-release-1" does not exist. Installing it now.
Error: path "./azuremonitor-containers" not found
Successfully enabled Azure Monitor for containers for cluster: /subscriptions/************/resourceGroups/************/providers/Microsoft.Kubernetes/connectedClusters/************
Proceed to https://aka.ms/azmon-containers to view your newly onboarded Azure Managed cluster

And, how I download the the ps1 script (same as the document's guide):
Invoke-WebRequest https://aka.ms/enable-monitoring-powershell-script -OutFile enable-monitoring.ps1

PS:
If the Azure Arc Kubernetes has already been GA, I believe the URL doesn't have to preview anymore.
I appreciate if this URL is fixed properly.
Thank you.

Some Kubernetes events are not being forwarded to OMS

We've seen cases where Kubernetes life-cycle events for Pods (e.g. Killed) could be seen in output from kubectl get events, but did not show up in OMS log analytics.

It looks like there may be an issue with the way previously seen events are being tracked. In in_kube_events.rb ,the uuid from the event is used to track which events have already been seen. If the uuid is already in the KubeEventsStateFile, then the event is skipped, otherwise it's routed to the registered outputs.

The issue is that for some events (e.g. Pod events), the uuid does not change when the event occurs again for the same Pod. Instead the count and lastTimestamp property values are updated.

Here's an example of a Pod where there were there were multiple Killing events. In this case the uuid is fb90522d-a65a-11e7-bafb-000d3a36fbf1. The first event has count:970

{
      "metadata": {
        "name": "liveness-exec.14e9557aa4eccb1d",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
        "uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
        "resourceVersion": "3815919",
        "creationTimestamp": "2017-10-01T03:45:35Z"
      },
      "involvedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "liveness-exec",
        "uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
        "apiVersion": "v1",
        "resourceVersion": "3302183",
        "fieldPath": "spec.containers{liveness}"
      },
      "reason": "Killing",
      "message": "(events with common reason combined)",
      "source": {
        "component": "kubelet",
        "host": "k8s-agentpool1-39011252-2"
      },
      "firstTimestamp": "2017-10-01T03:45:35Z",
      "lastTimestamp": "2017-10-05T00:37:45Z",
      "count": 970,
      "type": "Normal"
}

A subsequent Killing event for the same Pod (count:971) followed but would have been skipped because it has the same uuid as the first event.

  {
      "metadata": {
        "name": "liveness-exec.14e9557aa4eccb1d",
        "namespace": "default",
        "selfLink": "/api/v1/namespaces/default/events/liveness-exec.14e9557aa4eccb1d",
        "uid": "fb90522d-a65a-11e7-bafb-000d3a36fbf1",
        "resourceVersion": "3816442",
        "creationTimestamp": "2017-10-01T03:45:35Z"
      },
      "involvedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "liveness-exec",
        "uid": "9089e7fd-a3be-11e7-9da0-000d3a36fbf1",
        "apiVersion": "v1",
        "resourceVersion": "3302183",
        "fieldPath": "spec.containers{liveness}"
      },
      "reason": "Killing",
      "message": "(events with common reason combined)",
      "source": {
        "component": "kubelet",
        "host": "k8s-agentpool1-39011252-2"
      },
      "firstTimestamp": "2017-10-01T03:45:35Z",
      "lastTimestamp": "2017-10-05T00:43:30Z",
      "count": 971,
      "type": "Normal"
    }

We've tried an experiment by concatenating the count property with the uuid property to construct the eventId for tracking seen events. Another option would have been to use the lastTimestamp property. That seems to resolve the issue of events being skipped. However, we also see periods of time where events are not forwarded, but that seems unrelated to this.

Critical Vulnerability (CVE-2016-7954)

I work on an internal security team, and one of our tools flagged an older critical vulnerability for:
mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021

Bundler 1.x might allow remote attackers to inject arbitrary Ruby code into an application by leveraging a gem name collision on a secondary source.

Installed Resource: bundler 1.10.6
Fixed Version: 1.11.0rc1
Published by NVD 2016-12-22
CVSS Score NVD CVSSv3: 9.8

Remediation
Upgrade package bundler to version 1.11.0rc1 or above.

Would it be possible for this project to upgrade the bundler to 1.11.0rc1 or above to remediate this issue?

Prometheus logs not showing up in Log Analytics Workspace

My pods has the following annotations for a few weeks:

  • prometheus.io/path: /metrics
  • prometheus.io/port: '8900'
  • prometheus.io/scrape: 'true'

I have deployed a ConfigMap with the following settings, also a few weeks ago:

  prometheus-data-collection-settings: |-
    [prometheus_data_collection_settings.cluster]
        interval = "1m"
        fieldpass = ["mendix_concurrent_user_sessions", "mendix_connection_bus", "mendix_current_request_duration_seconds_bucket", "mendix_current_request_duration_seconds_count", "mendix_current_request_duration_seconds_sum", "mendix_jvm_memory_bytes", "mendix_jvm_memory_pool_bytes", "mendix_license_count", "mendix_named_users", "mendix_runtime_requests_total", "mendix_threadpool_handling_external_requests"]
        monitor_kubernetes_pods = true
        monitor_kubernetes_pods_namespaces = ["dev-apps"]

    [prometheus_data_collection_settings.node]
        interval = "1m"

This Microsoft Learn article tells me where I need to look for querying Prometheus log, which in turn points me to this article.

When I write my query in the Log Analytics Workspace e.g.:

InsightsMetrics
| where Namespace contains "prometheus"
| summarize by Name

I only get the following results

  • volume_manager_total_volumes
  • kubelet_runtime_operations_total
  • process_cpu_seconds_total
  • process_resident_memory_bytes
  • kubelet_running_pods

So, where are the other metrics that can be seen in my ConfigMap's fieldpass property? Am I missing something?

How to fix 'Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only' error in windows server 2019 with DockerProvider service

I'm installed docker in windows server 2019 with DockerProvider
I'm using this code

Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion preview
[Environment]::SetEnvironmentVariable("LCOW_SUPPORTED", "1", "Machine")

after that I install Docker-Compose with this code

[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
Invoke-WebRequest "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-Windows-x86_64.exe" -UseBasicParsing -OutFile $Env:ProgramFiles\Docker\docker-compose.exe

after that I use a docker compose file

version: "3.5"

services:


  rabbitmq:
    # restart: always
    image: rabbitmq:3-management
    container_name: rabbitmq
    ports:
      - 5672:5672
      - 15672:15672
    networks:
      - myname
    # network_mode: host
    volumes: 
      - rabbitmq:/var/lib/rabbitmq

 

networks:
  myname:
    name: myname-network

volumes:
  rabbitmq:
    driver: local

everything is Ok up to here
but after i call http://localhost:15672/ url in my browser
rabbitmq crashes and I see this error in docker logs <container-id>

Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only

this .yml file is working correctly in docker for windows
but after running the file in windows server, I see this error

Security concern: Omsagent pod running as root user

Hi,

In a project I'm part of there is a security concern regards to that the omsagent pods, deployed into an AKS-Cluster, is running as root user. It maps up /var/log from the kublet (node), accessing the logs, effectively running as root process on the node. We understand that consuming /var/log requires root.

The question is, how much additional hardening has Microsoft done with the omsagent, and can we apply additional hardening that makes it "secure enough"? It would be nice to get a point of view on the matter.

The "attack vector" is through the /var/log filesystem. If it manages to mount up files into this directory somehow. It would require an attacker to break into the omsagent.

Trace-data:

rune@Azure:~$ kubectl get pods -n kube-system | grep -i omsagent
omsagent-7cb7z                               1/1     Running   0          18h
omsagent-rs-7c7b6c8d5b-h2zvh                 1/1     Running   0          18h
rune@Azure:~$ kubectl exec omsagent-rs-7c7b6c8d5b-h2zvh -n kube-system -- id
uid=0(root) gid=0(root) groups=0(root)
rune@Azure:~$ kubectl exec -it -n kube-system omsagent-rs-7c7b6c8d5b-h2zvh -- /bin/sh
# ps -aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0  18504  3140 ?        Ss   Jun15   0:00 /bin/bash /opt/main.sh
root         30  0.0  0.0   6704   116 ?        S    Jun15   0:00 inotifywait /etc/config/settings --daemon --recursive --outfile /opt/inotifyoutput.txt --event create,delete --format %e : %T --timefmt +%s
syslog      223  0.0  0.0 129672  4208 ?        Ssl  Jun15   0:02 /usr/sbin/rsyslogd
omsagent    263  0.6  0.7 394724 55132 ?        Sl   Jun15   6:48 /opt/microsoft/omsagent/ruby/bin/ruby /opt/microsoft/omsagent/bin/omsagent-32ed830c-9fbe-4a37-8b4d-a990f2a873f8 -d /var/opt/microsoft/omsagent/32ed830c-9fbe-4a37-8b4d-a990f2a873f8/run/omsagent.pid --no-supervisor -o /var/opt/micros
root        294  0.0  0.0  28356  2676 ?        Ss   Jun15   0:00 /usr/sbin/cron
root        338  0.0  0.6 150128 46848 ?        Sl   Jun15   0:04 /opt/td-agent-bit/bin/td-agent-bit -c /etc/opt/microsoft/docker-cimprov/td-agent-bit-rs.conf -e /opt/td-agent-bit/bin/out_oms.so
root        348  0.0  0.5 198028 38512 ?        Sl   Jun15   0:21 /opt/telegraf --config /etc/opt/microsoft/docker-cimprov/telegraf-rs.conf
root        369  0.0  0.0   4536   768 ?        S    Jun15   0:00 sleep inf
root      57390  0.0  0.0   4628   772 pts/0    Ss+  06:26   0:00 /bin/sh
root      59470  0.0  0.0   4628   820 pts/1    Ss   07:02   0:00 /bin/sh
root      59477  0.0  0.0  34404  2856 pts/1    R+   07:03   0:00 ps -aux
# exit
rune@Azure:~$

Aggregation Type/Aggregation Granularity - Not polutating in Portal page

Aggregation Type/Aggregation Granularity are not populating in portal page & emails are not triggering. But, I see values in export template.

Alerts were created using SPN account and am trying to see through my enterprise subscription. I guess that I am missing few access or restriction in my organization level, Kindly let me know whether any additional RBAC at Cluster level

Critical vulnerability (CVE-2017-10906)

Hi, I am part of an InfoSec team and we detected the following critical vulnerability in image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod02232021

Please validate. Thanks in advance!

Critical vulnerability (CVE-2017-10906)

Description: Escape sequence injection vulnerability in Fluentd versions 0.12.29 through 0.12.40 may allow an attacker to change the terminal UI or execute arbitrary commands on the device via unspecified vectors.
Installed Resource: fluentd 0.12.40
Fixed Version: 0.12.41
Published by NVD: 2017-12-08
CVSS Score: NVD CVSSv2 10.0
Remediation: Upgrade package fluentd to version 0.12.41 or above.
Full Path: /opt/microsoft/omsagent/ruby/lib/ruby/gems/2.4.0/specifications/fluentd-0.12.40.gemspec

OMS Agent does not respect namespace exclusion of custom configmap container-azm-ms-agentconfig

Version: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01312022
Platform: AKS

I'm using the proposed configuration template [1], and it seems to get loaded (according to the logs). The omsagents still seem not to respect the arrays of excluded namespaces. If I disable exportation of stdout and stderr, it takes effect, but filtering (while collection is enabled) does not work.

1 - https://github.com/microsoft/Docker-Provider/blob/ci_dev/kubernetes/container-azm-ms-agentconfig.yaml

Extending Container_HostInventory

Hi,
Add, please, to Container_HostInventory additional properties which returns generic API - Containers, ContainersRunning, ContainersPaused, ContainersStopped, Images.

wrong version for "microsoft/oms:win-ciprod10272020"?

https://github.com/microsoft/Docker-Provider/blob/ci_prod/ReleaseNotes.md#version-microsoftomswin-ciprod10272020-version-mcrmicrosoftcomazuremonitorcontainerinsightsciprodwin-ciprod10052020-windows

Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10052020 (windows)

Should it be "Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10272020"?
@vishiy

oms agent stopped working

oms agent stopped collecting prometheus metrics with log:
2021-04-20T02:17:49Z E! [inputs.prometheus] Unable to watch resources: Get "https://XXXXXXXXXXXX.hcp.westeurope.azmk8s.io:443/api/v1/pods?watch=true": context canceled
2021-04-20T02:17:49Z E! [telegraf] Error running agent: input plugins recorded 1 errors
End Telegraf Run in Test Mode**********
starting fluent-bit and setting telegraf conf file for replicaset
nodename: aks-main-38921269-vmss000000
replacing nodename in telegraf config
File Doesnt Exist. Creating file...
Fluent Bit v1.6.8

  • Copyright (C) 2019-2020 The Fluent Bit Authors
  • Copyright (C) 2015-2018 Treasure Data
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

Telegraf 1.18.0 (git: HEAD ac5c7f6a)
2021-04-20T02:17:49Z I! Starting Telegraf 1.18.0
td-agent-bit 1.6.8
stopping rsyslog...

  • Stopping enhanced syslogd rsyslogd
    ...done.
    getting rsyslog status...
  • rsyslogd is not running

image tag: mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod03262021

we had to manually restart the pod to see prom metrics being collected again

Using relabel_config with OMS agent

Hello!

Is there a way to provide a custom Prometheus config (and relabel_config in particular) to the OMS agent? The AWS equivalent of the OMS agent, ADOT Collector, has this feature. We'd rather not self-host Prometheus since the OMS agent is so convenient, but this is bit of a blocker for us.

Thanks!
Michael

Metric Alerts

Following up on #645 , are the Insights.Container/nodes and Insights.Container/containers namespaces deprecated? I too am not seeing any telemetry for them. Specifically, it seems that the following

namespace: Insights.Container/nodes
metrics: cpuUsagePercentage, memoryWorkingSetPercentage

are replaced in favor of

namespace: Microsoft.ContainerService/managedClusters
metrics: node_cpu_usage_percentage, node_memory_rss_percentage

Run fluentd with a custom configuration

I am trying to run fluentd with a custom configuration instead of the predefined one from https://github.com/microsoft/Docker-Provider/blob/ci_prod/build/linux/installer/conf/kube.conf. In particular, I am interested in changing the run_interval for the kube_events plugin.
I was thinking that the configuration from https://github.com/microsoft/Docker-Provider/blob/ci_prod/charts/azuremonitor-containers/templates/ama-logs-rs-configmap.yaml#L5 is the way to go. However, I couldn’t see any place in which that configuration is used, apart from some if-statements.
I would like to know what is the purpose of that configmap? And there is any way to change the fluentd configuration?
What I did in the end was to change the run_interval directly in the /etc/fluent/kube.conf, but I was thinking there should be a better way to achieve this without any workaround.

[Feature request] Configure reporting interval

Hello the monitoring of metrics per container is so expensive for us that we would either not use it all, or preferebly tune down the monitoring ammount. We dont really need metrics every minute, if we do this every 2 minutes we would be OK. So what would be nice if there was a way how to configure the interval to log container metrics

Why this repo name?

From the "About" and also the contents, this repo focuses on AzMon for containers. This makes the current name "Docker-Provider" quite bizarre and confusing.

Issue with azuremonitor-containers helm chart

Hi guys,

I am getting below error when I try to install azuremonitor-containers helm chart in arc enabled k8 v1.16.

Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "AzureClusterIdentityRequest" in version "clusterconfig.azure.com/v1beta1"

It looks like it does not support k8 version 1.16. Could you please help with this ? Thanks.

[Feature Request] - Ability to detect and monitor Docker swarm services

Provider currently gives:
CLASS=Container_ContainerInventory:CIM_ManagedElement
CLASS=Container_ContainerStatistics:CIM_StatisticalData:CIM_ManagedElement
CLASS=Container_DaemonEvent:CIM_ManagedElement
CLASS=Container_ImageInventory:CIM_ManagedElement
CLASS=Container_ContainerLog:CIM_ManagedElement
CLASS=Container_HostInventory:CIM_ManagedElement
CLASS=Container_Process:CIM_ManagedElement

It would be nice to monitor swarm services

/etc/opt/omi/conf# docker service ls
ID                  NAME                            MODE                REPLICAS            IMAGE                            PORTS
dkq2zy0opjvg        monitoring_telegraf             global              2/2                 telegraf:latest

/etc/opt/omi/conf# docker service ps dkq2zy0opjvg
ID                  NAME                                            IMAGE               NODE                DESIRED STATE       CURRENT STATE         ERROR               PORTS
hzstnm31un4y        monitoring_telegraf.mgnljv2zuh74cfhn3n3z0znck   telegraf:latest     node01               Running             Running 11 days ago                
qms85linbux6        monitoring_telegraf.8k70kl8rzbz0zqu8345wivmpo   telegraf:latest     node02               Running             Running 11 days ago  

[Security] Helm-Chart-Update needed for "Omigod" vulnerability

Due to CVE-2021-38645, CVE-2021-38649, CVE-2021-38648, and CVE-2021-38647 ( https://msrc-blog.microsoft.com/2021/09/16/additional-guidance-regarding-omi-vulnerabilities-within-azure-vm-management-extensions/ ), a new version of the ciprod-Docker-Image was released ( mcr.microsoft.com/azuremonitor/containerinsights/ciprod:microsoft-oms-latest , from the same website).

The Helm-Chart Docker-Provider/charts/azuremonitor-containers/ (version 2.8.1) is incompatible with this image. There are (at least) two problems:

  1. The health check script /opt/livenessprobe.sh does not exist in the new image. This causes the deployments created by the Helm chart to be restarted ca. every 3-4 minutes due to the failing health check.
  2. Even when I remove the Kubernetes Healthcheck altogether, no data gets uploaded to Log Analytics, but there is not any error message in the logs - in fact, according to the logs, everything should be fine. When I return to the previous version used by the Helm chart ( same image name, tag "ciprod02232021" ) everything starts working again, but the security vulnerability is back.

The helm chart in the "ci_prod" branch (2.8.3) also points to a vulnerable version of ciprod (ciprod04222021)

Becuase of this, we need an updated Helm chart URGENTLY.

Metrics from insights/containers & insights/pods are no longer available

I am unable to see metrics from insights/containers & insights/pods namespaces in the Metrics explorer inside my AKS cluster.
We are using mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod08052021 image version.

In the past I was able to see metrics and I was using oomKilledContainerCount.
Looking over the release notes I didn't see specific changes targeting this area.

Thanks in advance.

Does OMS Agent log individual container cpu/mem usage?

I have a bunch of Grafana dashboards showing graphs for container_cpu_usage_seconds_total and container_memory_usage_bytes filtered by our applications - I've had a look and I'm exporting the Prometheus data into Log Anayltics, but I don't think the OMS agent exports any container_* stats?

I am trying to create a metric chart on a dashboard which shows container utilisation - is this data retrievable from Container Insights? I can't seem to find anything at the container level, only node level

Thanks

OMS agent does not exclude namespaces after configuration

This is a duplicate of #737 in which I did post a comment (as no solution is mentioned in there) but got no answer yet.
Since it is a closed issue I don't think many people will look into it, so let me post a new issue.

I do have an AKS cluster setup with Container Insights enabled.
My log analytics workspace contains alot of logs that I don't use so I want to limit the collected logs.

I did create the ConfigMap based on this template in the kube-system which is deployed to my cluster.
When calling kubectl edit configmap container-azm-ms-agentconfig -n kube-system I get the following:
Screenshot ConfigMap edit

I do have a 3 separate namespaces: kube-system, grafana-namespace and apps-namespace.
I only want to capture the last one's logs.

While checking one of the omsagent-... pods' logs, I get the following:

Both stdout & stderr log collection are turned off for namespaces: '*.csv2,*_kube-system_*.log,*_grafana-namespace_*.log'
****************End Config Processing********************
****************Start Config Processing********************
config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values
config::Successfully parsed mounted config map

So the configuration itself has no errors in there, and is applied properly.

Now when I check the Log analytics workspace, I still get data in there that is referring to kube-system and grafana-namespace.
So, for example this query returns results for the last 5 minutes while the ConfigMap is already deployed for a week or so:

KubePodInventory
| where Namespace == "kube-system"

When reading the Microsoft docs the ConfigMap should reduce the ingestion of logs if you exclude a namespace.

The main question is, what did I do wrong in the configuration or I am wrong in thinking that KubePodInventory shouldn't contain any data for the excluded namespaces.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.