servicemeshinterface / smi-metrics Goto Github PK
View Code? Open in Web Editor NEWExpose SMI Metrics
Home Page: https://smi-spec.io
License: Apache License 2.0
Expose SMI Metrics
Home Page: https://smi-spec.io
License: Apache License 2.0
PR #54 adds support to make edge queries as cross-namespaced.
Istio's edge queries should also support the same to have a consistent experience!
Right Now, prometheus related code lives in the linkerd
pkg, as more meshes are added. They should be able to reuse.
Now when there are no requests happening for a particular component, The default latency values are returning in negative as shown here which is not desirable.
Zero values can be a better default?
Right now, The spec defines a CRD
but we only use it for output and this repo is very similar to that of kubernetes-metrics. There's a bit of confusion about the approach here especially because of this being different from the implementations of other specs of SMI.
More documentation about the use of metrics.smi-spec.io
APIService would be helpful for users.
SMI-Metrics fails to run on Kubernetes 1.19 clusters stating error trying to reach service: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=
The v1alpha1.metrics.smi-spec.io
ApiService fails on Kubernetes clusters with verison 1.15 marking as FailedDiscoveryCheck
conditions:
- lastTransitionTime: "2019-06-28T16:20:22Z"
message: 'failing or missing response from https://10.100.82.130:443: bad status
from https://10.100.82.130:443: 404'
reason: FailedDiscoveryCheck
status: "False"
type: Available
Looks like this has something to do with the the service not exposing the information about it self as the pods and services run successfully.
The images are using my personal docker account right now, feels like that should be something more general. @slack any good places we could put the images? I'm happy to create an SMI docker hub account.
Right now there are no tests for Istio, To make the development process easier, adding tests would be the way to go!
Consul Connect depends on prometheus for metrics, there is an issue in Connect repo to add kubernetes metadata as labels but its still in works, that would make this process easy.
If not, we can talk to the k8s API and then get the pod metada and query prometheus for that.
From the base of the repo I am recieving this error
helm template chart --set adapter=linkerd | kubectl apply -f -
apiservice.apiregistration.k8s.io/v1alpha1.metrics.smi-spec.io configured
Error from server (Invalid): error when creating "STDIN": Secret "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": ConfigMap "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": ServiceAccount "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": RoleBinding.rbac.authorization.k8s.io "RELEASE-NAME-smi-metrics" is invalid: subjects[0].name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
Error from server (Invalid): error when creating "STDIN": Service "RELEASE-NAME-smi-metrics" is invalid: metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')
Error from server (Invalid): error when creating "STDIN": Deployment.apps "RELEASE-NAME-smi-metrics" is invalid: [metadata.name: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.spec.serviceAccountName: Invalid value: "RELEASE-NAME-smi-metrics": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')]
Are there any scripts around to update the RELEASE-NAME
variable?
Right now, Response Latency Queries use the same template with different percentiles i.e 99
, 90
, etc.
This can be templated, so that the configuration files are more readable.
Tilt is broken currently, mainly because now there are multiple helm configurations i.e based on the helm chart.
We need a way to let users create relevant workflow, One option we have is to use a mesh
env variable and add relevant helm options based on the env value.
https://docs.tilt.dev/api.html#modules.os.environ
Let's get travis up and running for PRs and master (and add the requisite badges).
Now that 1.15 is out, all the k8s dependencies should be bumped. This might be an issue with deislabs/smi-sdk-go that will make this more difficult.
It is desirable to serve more than one API version at a time to allow clients a deprecation window in which they can gracefully upgrade to the next API version. How should this be accomplished?
One way would be to duplicate most of the controller code such that the copy uses the v1alpha2 types and adds any v1alpha2 functionality while the original uses the v1alpha1 types. The router would then pick the v1alpha1 or v1alpha2 handler depending on the api version of the request. Users would need to have 2 ApiService objects (one for each version) which both point to the same service.
Another way would be to simply upgrade the controller code to v1alpha2 and drop support for v1alpha1. Users would then run two different releases of the smi-metrics controller, one at the v1alpha1 release and one at the v1alpha2 release. They would create two different ApiService objects and have each one point to the appropriate controller.
Now that Github Actions is out of beta, and most projects are already using it.
I think its better if we move.
Currently we publish installation helm charts as a github release, but the charts in the repo are templated and thus not usable directly causing confusions like #42 ,etc
I think the best way is to update the release workflow to also change relevant files in the repo, so that users can directly use the charts dir when they clone the repo.
Right now, make build
fails with fatal: no tag exactly matches
, replace the tagging and Image name functionality with env variables that users can pass.
For Istio to run, It requires updates to some custom resources. It would be easy for users if there was a helm instruction that changes these values during the installation.
There are common situations where you need to request edge information across a Kind within a namespace. For example, if I want to create an edges graph I need information about every deployment in the namespace. Right now, the only way to do this is to request edges for each kind found in the namespace. Instead, I propose we add a special edges endpoint.
For example, /apis/metrics.smi-spec.io/v1alpha1/namespaces/default/deployments/
could have the endpoint of an edge of /apis/metrics.smi-spec.io/v1alpha1/namespaces/default/deployments/$edges
. The special character is required to avoid any overlap with a possible resource name.
Any thoughts on this proposal?
Once #3 is done, As the prometheus queries will be pluggable and Istio also stores aggregate data in Prometheus.
Installation Configuration (specific to istio i.e prometheus url, etc) and a new prom queries ConfigMap should be able to get metrics from istio.
cc @grampelberg
Right Now, The queries are present in the queries.go file and are specific to linkerd.
To make this repo support more implementations, we need a way to make the prom quieries pluggable. This can be solved by having each implementation have a ConfigMap with the queries.
The corresponding ConfigMap is installed and dynamically loaded as a configuration field into Viper during initialisation.
Now Istio moved a two new telemetry model, where the metrics are not configurable yet and Istio only gives back some default metrics.
There are also some changes in the default metrics like fluxcd/flagger#478
This can act like a placeholder Issue for the Telemetry v2 Discussion
I have been noticing some discrepancies between the values reported by smi-metrics vs Prometheus directly.
For example
kubectl get --raw /apis/metrics.smi-spec.io/v1alpha1/namespaces/linkerd/deployments/linkerd-web/ | jq
{
"kind": "TrafficMetrics",
"apiVersion": "metrics.smi-spec.io/v1alpha1",
"metadata": {
"name": "linkerd-web",
"namespace": "linkerd",
"selfLink": "/apis/metrics.smi-spec.io/v1alpha1/namespaces/linkerd/deployments/linkerd-web",
"creationTimestamp": "2020-01-28T23:58:10Z"
},
"timestamp": "2020-01-28T23:58:10Z",
"window": "30s",
"resource": {
"kind": "Deployment",
"namespace": "linkerd",
"name": "linkerd-web"
},
"edge": {
"direction": "from",
"resource": null
},
"metrics": [
{
"name": "p99_response_latency",
"unit": "ms",
"value": "195500m"
},
{
"name": "p90_response_latency",
"unit": "ms",
"value": "155"
},
{
"name": "p50_response_latency",
"unit": "ms",
"value": "58333m"
},
{
"name": "success_count",
"value": "25498m"
},
{
"name": "failure_count",
"value": "0"
}
]
}
Debug logs from the same time
time="2020-01-29T00:00:37Z" level=debug msg="querying prometheus" query="sum(\n increase(\n response_total{\n classification=\"success\",\n namespace=~\"linkerd\",\n deployment=~\"linkerd-web\",\n dst_deployment=~\".+\"\n }[30s]\n )\n ) by (\n deployment,\n dst_deployment,\n namespace,\n dst_namespace\n )"
time="2020-01-29T00:00:37Z" level=debug msg="query results" query="sum(\n increase(\n response_total{\n classification=\"success\",\n namespace=~\"linkerd\",\n deployment=~\"linkerd-web\",\n dst_deployment=~\".+\"\n }[30s]\n )\n ) by (\n deployment,\n dst_deployment,\n namespace,\n dst_namespace\n )" result="{deployment=\"linkerd-web\", dst_deployment=\"linkerd-controller\", dst_namespace=\"linkerd\", namespace=\"linkerd\"} => 32.998350082495875 @[1580256037.374]"
I'm struggling to understand what might be the cause of this. This only happens occasionally. I was able to get reasonable values with repeated queries to the APIService.
Maybe this is why
https://github.com/deislabs/smi-metrics/blob/4109ac7c83538ad13cb5681bbefd1aa91209a244/pkg/prometheus/client.go#L102
Currently, even though we build and release the helm chart after each release, through the same Github workflow. we don't publish the helm chart anywhere. Users who want to try the component, will have to download and generate the template. By publishing it somewhere, it becomes very easy for users.
@stefanprodan suggested to use https://github.com/marketplace/actions/helm-publisher, which feels like the most simplest and does the job pretty well. I'm planning to create a gh-pages
branch with the current release helm pkgs and also update the CI to do this automatically whenever we perform a release. WDYT?
Maesh is a new Service Mesh from Contanious, based on Trafeik proxy. It also uses prometheus internally for metrics, Adding smi-metrics support would be great!
A default value of 30 seconds for all queries may not be desired for every query. This should be made configurable with an optional additional parameter to every endpoint. For example,
/apis/metrics.smi-spec.io/v1alpha1/namespaces/{Namespace}/{Kind}/{ResourceName}
should support an optional query parameter ?window=30s
. This would allow the window to be set per request.
Additionally, the window size should also be globally configurable in the config map.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.