servicemeshinterface / smi-metrics Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 17.0 145 KB

Expose SMI Metrics

Home Page: https://smi-spec.io

License: Apache License 2.0

Dockerfile 0.52% Makefile 3.30% Shell 1.57% Go 92.43% Starlark 0.38% HTML 1.81%

cncf servicemesh

smi-metrics's People

Contributors

Stargazers

Watchers

Forkers

pothulapati ivanayov slack albertwo1978 jeremyrickard daxmc99 mayankshah1607 adleong bridgetkromhout lachie83 michellen draychev laashub-soa idvoretskyi isgasho devopstoday11 life1347

smi-metrics's Issues

Allow Cross Namespace edge queries for Istio

PR #54 adds support to make edge queries as cross-namespaced.

Istio's edge queries should also support the same to have a consistent experience!

Move prometheus related code out of linkerd

Right Now, prometheus related code lives in the linkerd pkg, as more meshes are added. They should be able to reuse.

Negative Values in the latency metrics.

Now when there are no requests happening for a particular component, The default latency values are returning in negative as shown here which is not desirable.

Zero values can be a better default?

Smi-Metrics fails on kubernetes 1.19

SMI-Metrics fails to run on Kubernetes 1.19 clusters stating error trying to reach service: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=

ApiService dosen't run on 1.15

The v1alpha1.metrics.smi-spec.io ApiService fails on Kubernetes clusters with verison 1.15 marking as FailedDiscoveryCheck

  conditions:
  - lastTransitionTime: "2019-06-28T16:20:22Z"
    message: 'failing or missing response from https://10.100.82.130:443: bad status
      from https://10.100.82.130:443: 404'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

Looks like this has something to do with the the service not exposing the information about it self as the pods and services run successfully.

Move image to a real location

The images are using my personal docker account right now, feels like that should be something more general. @slack any good places we could put the images? I'm happy to create an SMI docker hub account.

Add Tests for Istio pkg

Right now there are no tests for Istio, To make the development process easier, adding tests would be the way to go!

Add support for Consul Connect

Consul Connect depends on prometheus for metrics, there is an issue in Connect repo to add kubernetes metadata as labels but its still in works, that would make this process easy.

If not, we can talk to the k8s API and then get the pod metada and query prometheus for that.

Problem installing with default instructions

From the base of the repo I am recieving this error

helm template chart --set adapter=linkerd | kubectl apply -f -
apiservice.apiregistration.k8s.io/v1alpha1.metrics.smi-spec.io configured
Error from server (Invalid): error when creating "STDIN": Secret "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": ConfigMap "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": ServiceAccount "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": RoleBinding.rbac.authorization.k8s.io "RELEASE-NAME-smi-metrics" is invalid: subjects[0].name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": Service "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
Error from server (Invalid): error when creating "STDIN": Deployment.apps "RELEASE-NAME-smi-metrics" is invalid: [metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.serviceAccountName: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]

Are there any scripts around to update the RELEASE-NAME variable?

Template response latency queries.

Right now, Response Latency Queries use the same template with different percentiles i.e 99 , 90 , etc.

This can be templated, so that the configuration files are more readable.

TrafficMesh object referenced in TrafficMetrics

In, https://github.com/servicemeshinterface/smi-spec/blob/master/traffic-metrics.md#example-implementation:

Is that supposed to be TrafficMetrics?

Fix Tilt workflow

Tilt is broken currently, mainly because now there are multiple helm configurations i.e based on the helm chart.

We need a way to let users create relevant workflow, One option we have is to use a mesh env variable and add relevant helm options based on the env value.
https://docs.tilt.dev/api.html#modules.os.environ

@grampelberg

Add CI

Let's get travis up and running for PRs and master (and add the requisite badges).

Bump k8s dependencies to 1.15

Now that 1.15 is out, all the k8s dependencies should be bumped. This might be an issue with deislabs/smi-sdk-go that will make this more difficult.

How to serve multiple API versions

It is desirable to serve more than one API version at a time to allow clients a deprecation window in which they can gracefully upgrade to the next API version. How should this be accomplished?

One way would be to duplicate most of the controller code such that the copy uses the v1alpha2 types and adds any v1alpha2 functionality while the original uses the v1alpha1 types. The router would then pick the v1alpha1 or v1alpha2 handler depending on the api version of the request. Users would need to have 2 ApiService objects (one for each version) which both point to the same service.

Another way would be to simply upgrade the controller code to v1alpha2 and drop support for v1alpha1. Users would then run two different releases of the smi-metrics controller, one at the v1alpha1 release and one at the v1alpha2 release. They would create two different ApiService objects and have each one point to the appropriate controller.

Service shouldnt 404 on `/` endpoint

Right now the APIService only serves requests from /apis/group/version, and 404's on / .

We can do something like what kubernetes does, on / and provide information about sub resources that users can ask about! That would provide awesome user experience on the SMI-Metrics API

Move to Github Actions

Now that Github Actions is out of beta, and most projects are already using it.
I think its better if we move.

Fix Charts directory

Currently we publish installation helm charts as a github release, but the charts in the repo are templated and thus not usable directly causing confusions like #42 ,etc

I think the best way is to update the release workflow to also change relevant files in the repo, so that users can directly use the charts dir when they clone the repo.

push smi-metrics image to DockerHub repo on every merge to master

Fix make build

Right now, make build fails with fatal: no tag exactly matches, replace the tagging and Image name functionality with env variables that users can pass.

Update Istio docs with configuration options through helm

For Istio to run, It requires updates to some custom resources. It would be easy for users if there was a helm instruction that changes these values during the installation.

Support for edges at Kind level

There are common situations where you need to request edge information across a Kind within a namespace. For example, if I want to create an edges graph I need information about every deployment in the namespace. Right now, the only way to do this is to request edges for each kind found in the namespace. Instead, I propose we add a special edges endpoint.

For example, /apis/metrics.smi-spec.io/v1alpha1/namespaces/default/deployments/ could have the endpoint of an edge of /apis/metrics.smi-spec.io/v1alpha1/namespaces/default/deployments/$edges. The special character is required to avoid any overlap with a possible resource name.

Any thoughts on this proposal?

Support Metrics from Istio

Once #3 is done, As the prometheus queries will be pluggable and Istio also stores aggregate data in Prometheus.

Installation Configuration (specific to istio i.e prometheus url, etc) and a new prom queries ConfigMap should be able to get metrics from istio.

cc @grampelberg

move repo to https://github.com/servicemeshinterface

Making Queries Configurable

Right Now, The queries are present in the queries.go file and are specific to linkerd.

To make this repo support more implementations, we need a way to make the prom quieries pluggable. This can be solved by having each implementation have a ConfigMap with the queries.

The corresponding ConfigMap is installed and dynamically loaded as a configuration field into Viper during initialisation.

@grampelberg

Telemetry v2 for Istio

Now Istio moved a two new telemetry model, where the metrics are not configurable yet and Istio only gives back some default metrics.

There are also some changes in the default metrics like fluxcd/flagger#478

This can act like a placeholder Issue for the Telemetry v2 Discussion

Very large values return from API vs Prometheus

I have been noticing some discrepancies between the values reported by smi-metrics vs Prometheus directly.

For example

kubectl get --raw /apis/metrics.smi-spec.io/v1alpha1/namespaces/linkerd/deployments/linkerd-web/ | jq
{
  "kind": "TrafficMetrics",
  "apiVersion": "metrics.smi-spec.io/v1alpha1",
  "metadata": {
    "name": "linkerd-web",
    "namespace": "linkerd",
    "selfLink": "/apis/metrics.smi-spec.io/v1alpha1/namespaces/linkerd/deployments/linkerd-web",
    "creationTimestamp": "2020-01-28T23:58:10Z"
  },
  "timestamp": "2020-01-28T23:58:10Z",
  "window": "30s",
  "resource": {
    "kind": "Deployment",
    "namespace": "linkerd",
    "name": "linkerd-web"
  },
  "edge": {
    "direction": "from",
    "resource": null
  },
  "metrics": [
    {
      "name": "p99_response_latency",
      "unit": "ms",
      "value": "195500m"
    },
    {
      "name": "p90_response_latency",
      "unit": "ms",
      "value": "155"
    },
    {
      "name": "p50_response_latency",
      "unit": "ms",
      "value": "58333m"
    },
    {
      "name": "success_count",
      "value": "25498m"
    },
    {
      "name": "failure_count",
      "value": "0"
    }
  ]
}

Debug logs from the same time

time="2020-01-29T00:00:37Z" level=debug msg="querying prometheus" query="sum(\n          increase(\n            response_total{\n              classification=\"success\",\n              namespace=~\"linkerd\",\n              deployment=~\"linkerd-web\",\n              dst_deployment=~\".+\"\n            }[30s]\n          )\n        ) by (\n          deployment,\n          dst_deployment,\n          namespace,\n          dst_namespace\n        )"
time="2020-01-29T00:00:37Z" level=debug msg="query results" query="sum(\n          increase(\n            response_total{\n              classification=\"success\",\n              namespace=~\"linkerd\",\n              deployment=~\"linkerd-web\",\n              dst_deployment=~\".+\"\n            }[30s]\n          )\n        ) by (\n          deployment,\n          dst_deployment,\n          namespace,\n          dst_namespace\n        )" result="{deployment=\"linkerd-web\", dst_deployment=\"linkerd-controller\", dst_namespace=\"linkerd\", namespace=\"linkerd\"} => 32.998350082495875 @[1580256037.374]"

Prometheus query

I'm struggling to understand what might be the cause of this. This only happens occasionally. I was able to get reasonable values with repeated queries to the APIService.
Maybe this is why
https://github.com/deislabs/smi-metrics/blob/4109ac7c83538ad13cb5681bbefd1aa91209a244/pkg/prometheus/client.go#L102

Publish smi-metrics helm Chart

Currently, even though we build and release the helm chart after each release, through the same Github workflow. we don't publish the helm chart anywhere. Users who want to try the component, will have to download and generate the template. By publishing it somewhere, it becomes very easy for users.

@stefanprodan suggested to use https://github.com/marketplace/actions/helm-publisher, which feels like the most simplest and does the job pretty well. I'm planning to create a gh-pages branch with the current release helm pkgs and also update the CI to do this automatically whenever we perform a release. WDYT?

@michelleN @stefanprodan

Add support for Maesh

Maesh is a new Service Mesh from Contanious, based on Trafeik proxy. It also uses prometheus internally for metrics, Adding smi-metrics support would be great!

Configurable window value on a per query basis

A default value of 30 seconds for all queries may not be desired for every query. This should be made configurable with an optional additional parameter to every endpoint. For example,
/apis/metrics.smi-spec.io/v1alpha1/namespaces/{Namespace}/{Kind}/{ResourceName} should support an optional query parameter ?window=30s. This would allow the window to be set per request.
Additionally, the window size should also be globally configurable in the config map.