Giter Site home page Giter Site logo

observatorium / api Goto Github PK

View Code? Open in Web Editor NEW
53.0 14.0 62.0 13.09 MB

The Observatorium API

License: Apache License 2.0

Dockerfile 0.25% Makefile 3.68% Go 85.53% Shell 2.03% Jsonnet 2.94% HTML 0.25% Open Policy Agent 0.09% Yacc 4.81% Lua 0.42%
observability monitoring tracing prometheus jaeger thanos loki

api's Introduction

Observatorium

Build Status Slack

Configuration for Multi-Tenant, Flexible, Scalable, Observability Backend

Observatorium allows you to run and operate effectively a multi-tenant, easy to operate, scalable open source observability system on Kubernetes. This system will allow you to ingest, store and use common observability signals like metrics, logging and tracing. Observatorium is a "meta project" that allows you to manage, integrate and combine multiple well-established existing projects like Thanos, Loki, Tempo/Jaeger, Open Policy Agent etc under a single consistent system with well-defined tenancy APIs and signal correlation capabilities.

As active maintainers and contributors to the underlying projects, we created a reference configuration, with extra software that connects those open source solutions into one unified and easy to use service. It adds missing gaps between those projects like consistency, multi-tenancy, security and resiliency pieces that are needed for a robust backend.

Read more on High Level Architecture docs.

Context

As the Red Hat Monitoring Team, we were focusing on the Observability software and concepts since the CoreOS acquisition. From the beginning, one of our main goals was to establish a stable in-cluster metric collection, querying, and alerting for OpenShift clusters. With the growth of managed OpenShift (OSD) clusters, the scope of the team goal has extended: we had to develop a scalable, global, metric stack that can be run in local as well as a central location for monitoring and telemetry purposes. We also worked together with Red Hat Logging and Tracing teams to implement something similar for logging and tracing. We’re also working on Continuous Profiling aspects.

From the very beginning our teams were leveraging Open Source to accomplish all those goals. We believe that working with the communities is the best way to have long term, successful systems, share knowledge and establish solid APIs. You might have not seen us, but members of our teams have been actively maintaining and contributing to major Open Source standards and projects like Prometheus, Thanos, Loki, Grafana, kube-state-metrics (KSM), prometheus-operator, kube-prometheus, Alertmanager, cluster-monitoring-operator (CMO), OpenMetrics, Jaeger, ConProf, Cortex, SIG CNCF Observability, SIG K8s Instrumentation and more.

What's Included

  • Observatorium is primarily defined in Jsonnet, which allows great flexibility and reusability. The main configuration resources are stored in components directory, and they import further official resources like kube-thanos. Some Examples:

  • We are aware that not everybody speaks Jsonnet, and not everybody have it's own GitOps pipeline, so we designed alternative deployments based on the main Jsonnet resources. Operator project delivers Kubernetes plain Operator that operates Observatorium.

NOTE: Observatorium is set of cloud native, mostly stateless components that mostly do not have special operating logic. For those operations that required automation, specialized controllers were designed. Use Operator only if this is your primary installation logic or if you don't have CI pipeline.

NOTE2: Operator is in heavy progress. There are already plans to streamline its usage and redesign current CustomResourceDefinition in next version. Yet, it's currently used in production by many bigger users, so any changes will be done with care.

  • The Thanos Receive Controller is a Kubernetes controller written in Go that distributes essential tenancy configuration to the desired pods.

  • The API is the facet of Observatorium service. It's a lightweight proxy written in Go that helps with multi-tenancy, tenancy (isolation, cross tenancy requests, rate-limiting, roles, tracing). This proxy should be used for all external traffic with Observatorium.

  • OPA-AMS is our Go library for integrating Open Policy Agent with Red Hat authorization service for smooth OpenShift experience.

  • up is a useful Go service that periodically queries Observatorium and outputs vital metrics on the Observatorium read path healthiness and performance over time.

  • token-refresher is a simple Go CLI allowing to perform OIDC refresh flow.

Getting Started

Status: Work In Progress

While metric and logging part using Thanos and Loki is used in production at Red Hat,documentation, full design, user guides, different configurations support are in progress.

Stay Tuned!

Missing something or not sure?

Let us know! Visit our Slack channel or put a GitHub issue!

api's People

Contributors

aminesnow avatar brancz avatar bwplotka avatar clyang82 avatar coleenquadros avatar dependabot[bot] avatar douglascamata avatar esnible avatar jessicalins avatar jpkrohling avatar kakkoyun avatar krasi-georgiev avatar matej-g avatar metalmatze avatar morvencao avatar nitinpatil1992 avatar nyza99 avatar onprem avatar pavolloffay avatar pb82 avatar periklis avatar philipgough avatar red-gv avatar rubenvp8510 avatar saswatamcode avatar shwetaap avatar spaparaju avatar squat avatar tareqmamari avatar xperimental avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

api's Issues

oidcConfig: is ClientSecret used?

When navigating through the code to try to understand the OIDC flow, I haven't found if we are making use of the ClientSecret field from the oidcConfig struct (consequently, from the tenants file for example here and here)

For the login handler, seems like just ClientID is used.

Same for the callback handler when verifying here.

Is ClientSecret used somewhere else? In case not, we could clean up a bit the code to avoid confusion.

Add documentation

There is a need for more documentation:

  • What is this repo about? High level description.
  • What are the APIs exposed and how to use them. (Metrics read/receive, Logs...)
  • RBAC - tenants.
  • TLS, mTLS

Consider adopting the Prometheus JSON format for error responses

Referring to: https://prometheus.io/docs/prometheus/latest/querying/api/#format-overview

We chatted about this with @saswatamcode (off the back of #318) and we are finding the current approach with mixed JSON (which is returned in case of 2xx responses) and plain-text (returned for 4xx and 5xx) responses not satisfactory.

Ideally, we should enhance the API spec to specify that the error responses should be also JSON and adapt the API to return the Prometheus JSON format for errors. This would have added benefit of using the well-known Prometheus format, which would make user experience nicer, especially in the downstream web UI. For example, right now Thanos UI will fail to show an error response, since it expects JSON and cannot parse a plain-text response.

What Claim Name in openid tokens is used for representing groups?

We are trying to connect directly to the API with for instance Grafana (but it can be other clients for that matter). When working directly with a user it works by setting the “sub”: value in the token to the name of the user. During authorization the API then maps the username of the logged in user to an entry in the rbac. We use Forward OAuth Identity in Grafana and it works fine.
We do however not know how to make it work with groups. What field in the token should be used for groups (or roles?)? We have tried the most obvious “groups”, for instance: "groups": ["/grouptenant1"]. We have also tried to remove the path for example: "groups": ["grouptenant1"]
As a last resort we tried to add the group membership to the sub field. Since it then becomes an array and the code expects a string it of course is a failure.
So in short: How do we present group membership in the token so that the API can use it for authorization?”
We use keycloak to generate the token.

Rules API: Response code is internal server error even if caused by bad request

We are currently always returning 500 if (un)marhsalling of the rules provided in the YAML by the user fails. However, more often than not, the unmarshalling will fail due to user error, because for example the provided YAML is not valid. We should be able to detect these cases, and if that happens, return 400 instead. (This can also lead to false positives when it comes to monitoring, since it could artificially increase number of 500 server responses, even though these are user errors)

HTTP handler monitoring middleware isn't the topmost in the middleware stack

While working on #295, I noticed that the HTTP metrics middleware isn't mounted at the top of the stack. This may cause some HTTP requests to not be properly instrumented when a given middleware writes the response and stop the chain execution.

For instance, here we mount the authentication middlewares for each tenant. In case the tenant is using OIDC authentication, the OIDC middleware can make the request finish early for a number of reasons without calling the next middlewares, like if the Authorization header is invalid. When this happens, the rest of the middleware stack never gets executed.

Proposed solution

Mount the HTTP handler monitoring middleware at the top of the stack to ensure that all the HTTP requests get instrumented.

Liveness and Readiness probes fail with TLS configuration.

While working on PR-273 I got a working API Pod that never gets marked as ready.

The description revealed that both Liveness and Readiness probes fail:

 Warning  Unhealthy  2m52s (x3 over 3m52s)   kubelet, kind-control-plane  Liveness probe failed: HTTP probe failed with statuscode: 503
 Warning  Unhealthy  2m46s (x19 over 4m15s)  kubelet, kind-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

I had to work around it and enable TCP probes to enable the Pod, but this is not a suitable replacement as it does not actually check the application.

The deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: api
    app.kubernetes.io/instance: observatorium-xyz
    app.kubernetes.io/name: observatorium-api
    app.kubernetes.io/part-of: observatorium
    app.kubernetes.io/version: master-2020-05-27-v0.1.1-67-gb2eb292
  name: observatorium-xyz-observatorium-api
  namespace: observatorium
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: api
      app.kubernetes.io/instance: observatorium-xyz
      app.kubernetes.io/name: observatorium-api
      app.kubernetes.io/part-of: observatorium
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/component: api
        app.kubernetes.io/instance: observatorium-xyz
        app.kubernetes.io/name: observatorium-api
        app.kubernetes.io/part-of: observatorium
        app.kubernetes.io/version: master-2020-05-27-v0.1.1-67-gb2eb292
    spec:
      containers:
      - args:
        - --web.listen=0.0.0.0:8080
        - --web.internal.listen=0.0.0.0:8081
        - --logs.read.endpoint=http://127.0.0.1
        - --logs.write.endpoint=http://127.0.0.1
        - --metrics.read.endpoint=http://observatorium-xyz-cortex-query-frontend.observatorium.svc.cluster.local:9090
        - --metrics.write.endpoint=http://observatorium-xyz-thanos-receive.observatorium.svc.cluster.local:19291
        - --log.level=warn
        - --tls-cert-file=/mnt/certs/server.pem
        - --tls-private-key-file=/mnt/certs/server.key
        - --tls-client-ca-file=/mnt/certs/ca.pem
        - --tls-reload-interval=1m
        image: quay.io/observatorium/observatorium:master-2020-05-27-v0.1.1-67-gb2eb292
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /live
            port: 8081
            scheme: HTTP
          periodSeconds: 30
        name: observatorium-api
        ports:
        - containerPort: 8081
          name: internal
        - containerPort: 8080
          name: public
        readinessProbe:
          failureThreshold: 12
          httpGet:
            path: /ready
            port: 8081
            scheme: HTTP
          periodSeconds: 5
      volumes:
      - name: observatorium-api-tls-secret
        secret:
          secretName: observatorium-api-tls-secret

The issue should probably be handled in observatorium-api code.

Tracing: Update to latest OpenTelemetry version

We are currently on 0.18.0 but in the meantime there has already been a couple of stable releases, latest one being 1.3.0. Since the package organization, the public API etc. has change between version, this will require some manual 'reshuffling'.

[RFE] Ephemeral storage support for development environments

I have access to an on-demand OpenShift environment for testing (QuickLAB) but it does not support persistent storage. Ability to have ephemeral storage support (emptyDir or whatever) would be useful for development purposes. I would not expect to see this recommended or used in a production environment.

(Not sure what, if any, code changes would be required to support this, or if this is just a documentation problem.)

Thanks!

The observatorium-api manifest should allow more granular TLS configuration.

In a recent PR to observatorium/configuration, I added the option to mount a secret with a cert file, private key file, and ca file.

The first two are required to enable TLS for Observatorium API gateway while providing the latter will essentially enable mTLS.

As a followup to a discussion with @brancz, we aimed to support both TLS as well as mTLS in plain manifests and Observatorium operator.

This can be achieve by separating the sensitive information to a secret and the ca file to ConfigMap (and mount them both). The ConfigMap should be optional.

In a case where users provide a secret that contains cert file, private key but not a ca file in ConfigMap. The observatorium-api manifest will reject it due to manifest validations. Those should be fixed since it is essentially all withTLS parameters or none.

generate-tls-cert does not parse cmd line arguments

The generate-tls-cert tool gets used in integration tests for this repository, and since those don't run on top of kubernetes, defaults work fine.

However, for other use cases such as testing TLS based deployments as done in PR-273, the aforementioned defaults won't work:

caller=level.go:63 ts=2020-05-31T08:04:29.825399028Z name=up level=error caller=main.go:184 component=reader msg="failed to query" err="query request failed: Post https://observatorium-xyz-observatorium-api.observatorium.svc.cluster.local:8080/api/metrics/v1/api/v1/query: x509: certificate is valid for localhost, not observatorium-xyz-observatorium-api.observatorium.svc.cluster.local"

generate-tls-cert should process command line arguments to allow custom certificates.

jsonnet lib generates bogus tenant YAML when mTLS is set

When a tenant specifies a mTLS node with configMapName and caKey, the resulting tenants.yaml contains two entries for the same tenant, one with .mTLS.caPath and one with the remaining options. This causes the API to break with the following message:

$ kubectl logs -n observatorium observatorium-xyz-observatorium-api-6d7b475569-jjd82
2020/10/23 14:21:32 failed to parse CA certificate PEM for tenant "test"

Here's the full secret that is generated:

apiVersion: v1
kind: Secret
metadata:
  labels:
    app.kubernetes.io/component: api
    app.kubernetes.io/instance: observatorium-xyz
    app.kubernetes.io/name: observatorium-api
    app.kubernetes.io/part-of: observatorium
    app.kubernetes.io/version: master-2020-09-08-v0.1.1-142-g61a908f
  name: observatorium-xyz-observatorium-api
  namespace: observatorium
stringData:
  tenants.yaml: |-
    "tenants":
    - "id": "1610b0c3-c509-4592-a256-a1871353dbfa"
      "mTLS":
        "caPath": "/var/run/mtls/test/service-ca.crt"
      "name": "test"
    - "id": "1610b0c3-c509-4592-a256-a1871353dbfa"
      "mTLS":
        "caKey": "service-ca.crt"
        "configMapName": "service-ca-tls"
      "name": "test"
      "oidc":
        "clientID": "test"
        "clientSecret": "ZXhhbXBsZS1hcHAtc2VjcmV0"
        "issuerCAPath": "/var/run/mtls/test/service-ca.crt"
        "issuerURL": "https://dex.dex.svc.cluster.local:5556/dex"
        "usernameClaim": "email"

From what I could see, here's the offending code:
https://github.com/observatorium/observatorium/blob/7c57706b8380a63abe4713f92300c29a9a30f335/jsonnet/lib/observatorium-api.libsonnet#L295-L302

Unfortunately, my knowledge of jsonnet is insufficient to fix this issue.

"make test-interactive" gives "exec format error" on MacOS

"make test-interactive" gives "exec format error" on MacOS.

Almost certainly the problem is that the Makefile builds observatorium_api for the OS, but then packages the OS-specific binary in the Linux-specific container image.

Work around:

rm ./observatorium-api
OS=linux GOARCH=amd64 make test-interactive

Output without the work-around

16:13:17 Starting observatorium_api
16:13:18 observatorium_api: standard_init_linux.go:228: exec user process caused: exec format error
16:13:52 
Error: No such object: e2e_interactive-observatorium_api

    interactive_test.go:33: interactive_test.go:33:
        
         unexpected error: docker container observatorium_api failed to start: exit status 1

OIDC authenticator skips identity check if no username or group claim are present

The oidcAuthenticator.checkAuth function has logic that can facilitate the introduction of security issues.

It properly verifies the OIDC token (checks issuer and signature), but if there's no username or group claim configured for a tenant it will happily do no identity check and consider the token authorized.

So it enables something like this to happen:

  1. Tenant A and B use the same OIDC server to authenticate (will pass issuer and signature check).
  2. Tenant B has bad configuration in Observatorium, without username or group claims.
  3. Tenant A can send data to Tenant B (no identity check will happen).

The current logic is as follows:

func (a oidcAuthenticator) checkAuth(ctx context.Context, token string) (context.Context, string, int, codes.Code) {
    # Check the issuer and signature.
    idToken, err := a.verifier.Verify(oidc.ClientContext(ctx, a.client), token)
    ...
    # If there is a configured username claim for this tenant's authenticator, check it.
    if a.config.usernameClaim != "" {
        ...
    }
    ...
    # If there is a group claim for this tenant's authenticator, check it.
    if a.config.GroupClaim != "" {
        ...
    }
    # Bad logic here: no username or group claim for this tenant? Authorized without identity check!
    return ctx, "", http.StatusOK, codes.OK
}

I don't think it poses a security risk at the moment, but I do believe we should take preventive action and enforce that all tenants configure either username or group claims to avoid problems.

By default, we should consider everything unauthorized.

Consolidate `golangci-lint` enabled linters in the config

It seems that we are currently enabling all linters in our config, however it seems not all of them are useful to us, for example:

  • There are plenty of nolint:gochecknoglobals and nolint:gochecknoinits directives, although we are aware of the reasons we're using global / init() for particular source files in our code base
  • exhaustivestruct seems to be also only recommended for special cases, but we're enabling it without (I think) a good reason

The ideal outcome of resolving the issue: We have understanding of which linters cause more noise than add value and these linters are disabled

Adding a Postgres database dependency to test with Docker?

So far we have all tests running straight up binaries. I don't think that's feasible with Postgres anymore.
I'm happy to be proven otherwise, but I think running a simple Docker container makes a lot more sense for this going forward.

Isn't this a big deal to begin with or should we try to make it work without Docker?

WDYT?

test-interactive: Run thanos-rule-syncer

Currently, when running make test-interactive we can create/list rules via the api/v1/rules/raw endpoint.

However, we are not able to see those rules synced in api/v1/rules endpoint, because we don't run an instance of thanos-rule-syncer.

Adding this can improve user experience in general and allows a more complete testing flow (suggested by @kyoto)

Start versioning API releases

We discussed this in the last community meeting, in the context of now having a release process for obsctl (observatorium/obsctl#26).

There were couple of ideas floating around:

  • Having versioned releases could encourage higher adoption, i.e. show we're serious about progressing towards stable release
  • It would keep us as well accountable for moving towards the stable release aim
  • It could force us to be more mindful about what features we're adding and define what we want to have in a stable release
  • It could give us easier way to deprecate features (e.g. we still have a /legacy) path: We could indicate this by e.g. saying we'll drop a certain feature in 3 minor releases

To accomplish this we need:

  • Agree on versioning (most probably semantic versioning) and the starting version (0.1.0?)
  • Have a release process, release schedule

Builds are slow

Builds are slow. make test-interactive takes two minutes for me, with over 90 seconds spent on RUN git update-index --refresh; make build. Even if the change is just one line of code. I have seen it take over 5 minutes.

The make build part is fast enough.

Updating the index is perhaps needed, but we want to get as much as possible downloaded before the ADD . /opt which invalidates Docker's cache....

Downloading everything every time is causing trouble when there is a Github problem; today a make test-interactive failed with

#11 77.93 	github.com/spf13/afero: github.com/spf13/[email protected]: Get "https://proxy.golang.org/github.com/spf13/afero/@v/v1.2.2.zip": read tcp 172.17.0.2:38556->142.250.65.209:443: read: connection reset by peer

Accept only specified methods on endpoints

In the process of looking at #182 (comment) I discovered that currently we accept any method on all the endpoints, meaning users can use e.g. DELETE method on a query endpoint. This is not correct.

We should respond only to methods which are applicable to the endpoint, otherwise respond with 405.

Get rid of GOARCH=amd64 in Makefile

As far as I can tell, observatorium builds fine on linux ppc64le.
Any reason to force GOARCH to amd64 in the Makefile?

observatorium: deps main.go $(wildcard *.go) $(wildcard */*.go)
	CGO_ENABLED=0 GOOS=$(OS) GOARCH=amd64 GO111MODULE=on GOPROXY=https://proxy.golang.org go build -a -ldflags '-s -w' -o $@ .

TestRulesAPI failed due to missing the object e2e_rules_api-rules-minio

The test TestRulesAPI failed while we using CI pipeline to containerize Observatorium-api and break the pipeline, the two sections of logs are related to this, may I ask your support regarding this?

11:01:20 rules-minio: Pull complete
11:01:20 rules-minio: Digest: sha256:a150b34a87e44bb8d9a862c22dd2a737e1918990c6a9f0513f9314c6270921ab
11:01:20 rules-minio: Status: Downloaded newer image for minio/minio:RELEASE.2021-07-27T02-40-15Z
11:01:20 opa: {"current_version":"0.31.0","download_opa":"[https://openpolicyagent.org/downloads/v0.37.1/opa_linux_amd64","latest_version":"0.37.1","level":"info","msg":"OPA](https://openpolicyagent.org/downloads/v0.37.1/opa_linux_amd64%22,%22latest_version%22:%220.37.1%22,%22level%22:%22info%22,%22msg%22:%22OPA) is out of date.","release_notes":"[https://github.com/open-policy-agent/opa/releases/tag/v0.37.1","time":"2022-02-03T11:01:20Z](https://github.com/open-policy-agent/opa/releases/tag/v0.37.1%22,%22time%22:%222022-02-03T11:01:20Z)"}
11:01:20 Ports for container e2e_logs_read_write_tail-opa >> Local ports: map[http:8181] Ports available from host: map[http:49190]
11:01:20 opa: {"client_addr":"172.22.0.1:38834","level":"info","msg":"Received request.","req_id":1,"req_method":"GET","req_path":"/health","time":"2022-02-03T11:01:20Z"}
11:01:20 opa: {"client_addr":"172.22.0.1:38834","level":"info","msg":"Sent response.","req_id":1,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":3.09471,"resp_status":200,"time":"2022-02-03T11:01:20Z"}
11:01:20 Starting loki
11:01:20 loki: Unable to find image 'grafana/loki:2.3.0' locally
11:01:21 rules-minio: useradd: UID 0 is not unique

Error: No such object: e2e_rules_api-rules-minio

=== CONT  TestRulesAPI
    services.go:57: services.go:57:
        
         unexpected error: failed to get mapping for port as container e2e_rules_api-rules-minio exited: exit status 1: docker container rules-minio failed to start: exit status 1
        
11:01:52 Killing rules-minio
11:01:52 Error response from daemon: Cannot kill container: e2e_rules_api-rules-minio: No such container: e2e_rules_api-rules-minio

11:01:52 Unable to kill service rules-minio : exit status 1
11:01:52 Killing opa
11:01:53 Killing gubernator
11:01:53 Killing dex
--- FAIL: TestRulesAPI (53.75s)
FAIL
FAIL	[github.com/observatorium/api/test/e2e](http://github.com/observatorium/api/test/e2e)	95.152s
FAIL
Makefile:93: recipe for target 'test-e2e' failed
make: *** [test-e2e] Error 1

Only one CA certificate is supported in mtls ca file

Currently when loading certificate from mtls ca file, only the first certificate will be loaded to verify the incoming client certificates. If there are more than one certificates located in the ca file, others will be ignored.(

api/main.go

Lines 239 to 251 in cbff1da

block, _ := pem.Decode(t.MTLS.RawCA)
if block == nil {
skip.Log("tenant", t.Name, "err", "failed to parse CA certificate PEM")
tenantsCfg.Tenants[i] = nil
continue
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
skip.Log("tenant", t.Name, "err", fmt.Sprintf("failed to parse CA certificate: %v", err))
tenantsCfg.Tenants[i] = nil
continue
}
t.MTLS.ca = cert
)

It's a very common scenario that more than one certificates are required. e.g. assume thousands of clients are talking to api gateway now. The CA certificates renewed, there will be a period that some clients will use the certificates signed by new CA cert, but some still using the certificates signed by old CA cert.

Make gRPC interceptors to Support Authz/Authn

We can't implement the custom authz logic for Observatorum in OpenTelemetry, so we will need to proxy the authorized gRPC requests. Unfortunately, in order to provide authz, we also need authn. This means we will need to build a gRPC auth interceptor proxy into Observatorium. The proxies must be application agnostic, that is, they shouldn't know anything about the application-level proto for the endpoints to which they are forwarding. They should only look at frame headers for auth information.

Rules API: Validate tenant rules

As mentioned by @squat here:

"Note: this is a distinct problem from ensuring that the rules written by
a tenant are guaranteed to only include data from that tenant.
Currently, the rules from a tenant can select data from any tenant. This
is a distinct problem and should absolutely be tackled in a follow up."

Idea: Fail-fast if clientID/clientSecret missing

Observatorium-API refuses to start if IssuerURL is empty:

https://github.com/observatorium/api/blob/main/authentication/oidc.go#L78

It would also be helpful if Observatorium-API failed to start without clientID or clientSecret. I spent about 20 minutes struggling because I had left clientID blank.

With a bogus tenant.yaml containing

  oidc:
    clientID: ""
    "clientSecret": ""
    issuerURL: "https:/myoidc.apps.observability-d.mycluster.p1.openshiftapps.com/auth/realms/myrealm"

Observatorium came up fine, but GRPC clients who sent valid headers saw:

ERROR:
  Code: InvalidArgument
  Message: failed to authenticate: oidc: invalid configuration, clientID must be provided or SkipClientIDCheck must be set

That message makes no sense to clients, only to Observatorium implementors. I assume HTTP clients delivering logs would see something similar.

Trace Read tenancy architecture and implementation

Now that #252 has merged we can query traces through Observatorium.

Unfortunately, we don't have a full story for queries because Jaeger doesn't support tenancy yet. (For writing traces we handle tenancy at the OTel collector level, see observatorium/observatorium#461 ).

In order to make #267 meaningful we need to decide how to handle tenancy for queries. I see three approaches. I would like to start discussion for the correct approach.

  • Do nothing, and wait for the Jaeger community to implement jaegertracing/jaeger#3427
  • Allow templates beyond simple hostnames in -trace.read.endpoint. Using this approach we could say `-trace.read.endpoint="jaeger-{tenant}:14250", and Observatorium would send traces appropriately to multiple Jaeger instances.
  • Create an additional reverse proxy which forwards the same way, perhaps by generating an Envoy or Nginx config from JSONNET that would take the role for queries that OTel takes for posting traces.

@pavolloffay has suggested the template endpoint approach would be best. (When Jaeger gets multi-tenancy we would simply stop supplying templates to -trace.read.endpoint.)

I'd like to get this in before #267 so that I can test properly.

Integration test for logs tail/write keeps failing silently

It seems that this particular test keeps failing permanently, without an explicit failure message. This applies to running it both locally and on CircleCI (see e.g. recent run here - https://app.circleci.com/pipelines/github/observatorium/api/87/workflows/d8c59c8d-f3c8-4b45-b0f3-50700da87d7a/jobs/357 - but since it exits with code 0, I assume that is the reason is shows up as green in CI).

Relevant excerpt from output (locally run make test-integration):

-------------------------------------------
- Logs Read/Write tests: OK               -
-------------------------------------------
-------------------------------------------
- Logs Tail/Write tests                   -
-------------------------------------------
level=debug name=observatorium ts=2021-07-13T10:31:43.501425403Z caller=instrumentation.go:36 request=mgera.remote.csb/JKeZgC35Pg-000080 proto=HTTP/2.0 method=POST status=404 content=application/json path=/api/logs/v1/test-oidc/api/v1/push duration=678.177µs bytes=19
Can't use SSL_get_servername
depth=1 CN = observatorium
verify return:1
depth=0 CN = localhost
verify return:1
level=debug name=observatorium ts=2021-07-13T10:31:43.508231948Z caller=instrumentation.go:36 request=mgera.remote.csb/JKeZgC35Pg-000081 proto=HTTP/1.1 method=GET status=404 content= path=/api/logs/v1/test-oidc/api/v1/tail duration=146.55µs bytes=19
websocat: WebSocketError: Received unexpected status code
websocat: error running
level=info name=observatorium ts=2021-07-13T10:31:43.509220823Z caller=main.go:354 msg="caught interrupt"                                                                                                         
level=info name=observatorium ts=2021-07-13T10:31:43.509247696Z caller=main.go:598 msg="shutting down the HTTP server"                                                                                            
{"level":"info","msg":"Shutting down...","time":"2021-07-13T12:31:43+02:00"}
level=error ts=2021-07-13T10:31:43.509597488Z caller=transfer.go:192 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"                                                 
{"level":"info","msg":"Server shutdown.","time":"2021-07-13T12:31:43+02:00"}

Bogus log message "maxprocs: No GOMAXPROCS change to reset%!(EXTRA []interface {}=[])"

To reproduce, run api with --log.level debug, then shut down api with control-c.

level=info name=observatorium ts=2022-02-10T16:44:15.914672Z caller=main.go:393 msg="caught interrupt"
level=info name=observatorium ts=2022-02-10T16:44:15.914752Z caller=main.go:665 msg="shutting down the HTTP server"
level=debug name=observatorium ts=2022-02-10T16:44:15.917624Z caller=main.go:364 msg="maxprocs: No GOMAXPROCS change to reset%!(EXTRA []interface {}=[])"

Probing Liveness/Readyness is failing when tls-config is enabled.

Apparently, when enabling tls-config, /-/healty and /-/ready endpoints require some client-certificate as well as write and read endpoints. Therefore, when deployed on k8s/OCP the pod fails to deploy.

oc describe output:

Events:
  Type     Reason     Age               From                         Message
  ----     ------     ----              ----                         -------
  Normal   Scheduled  <unknown>         default-scheduler            Successfully assigned observatorium/observatorium-xyz-observatorium-api-gateway-78d458758c-pvtbx to crc-w6th5-master-0
  Normal   Pulled     1m                kubelet, crc-w6th5-master-0  Container image "quay.io/observatorium/observatorium:master-2020-04-16-v0.1.0-2-g2a4aa00" already present on machine
  Normal   Created    1m                kubelet, crc-w6th5-master-0  Created container observatorium-api-gateway
  Normal   Started    1m                kubelet, crc-w6th5-master-0  Started container observatorium-api-gateway
  Warning  Unhealthy  20s               kubelet, crc-w6th5-master-0  Liveness probe failed: HTTP probe failed with statuscode: 400
  Warning  Unhealthy  3s (x7 over 33s)  kubelet, crc-w6th5-master-0  Readiness probe failed: HTTP probe failed with statuscode: 400

oc logs output:

level=debug name=observatorium ts=2020-04-23T12:52:53.512151038Z caller=main.go:94 msg="maxprocs: Leaving GOMAXPROCS=[6]: CPU quota undefined"
level=info name=observatorium ts=2020-04-23T12:52:53.51238815Z caller=main.go:109 msg="starting observatorium"
level=info name=observatorium ts=2020-04-23T12:52:53.512610768Z caller=config.go:27 protocol=HTTP msg="enabling server side TLS"
level=info name=observatorium ts=2020-04-23T12:52:53.535028952Z caller=config.go:68 protocol=HTTP msg="server TLS client verification enabled"

Observatorium-API with TLS difficulties on MacOS

Observatorium-API requires TLS 1.3 which is not default on MacOS. In addition, the error message on Mac is a bit confusing:

First, I ran make test-interactive.
(Note that this test prints many lines of stuff, then the important bit with the ports, then endless lines of server logs. The docs should highlight that the user of this test needs to hunt through the output looking for.)

Opening http://127.0.0.1:63256 in browser.

You're all set up!
========================================
Observatorium API on host machine: 		127.0.0.1:63330 
Observatorium internal server on host machine: 	127.0.0.1:63331 
Thanos Query on host machine: 			127.0.0.1:63256 
Loki on host machine: 				127.0.0.1:63298 

I wanted to test the Observatorium API.

curl 127.0.0.1:63330
Client sent an HTTP request to an HTTPS server.

(It might be worthwhile to put https:// on the endpoints output above).

curl https://127.0.0.1:63330/ 
curl: (35) error:1400442E:SSL routines:CONNECT_CR_SRVR_HELLO:tlsv1 alert protocol version

After three hours of hair-pulling, and openssl s_client debugging, the problem became obvious:

curl --tlsv1.3 https://127.0.0.1:63330/                        
curl: (4) LibreSSL was built without TLS 1.3 support

Observatorium-Api requires TLS 1.3, and MacOS curl doesn't output anything useful.

Following the instructions on https://learnings.bolmaster2.com/posts/curl-openssl-tlsv1.3-on-macos.html I was able to get a openssl-based curl and verify Observatorium-api is fine:

brew install curl-openssl
/usr/local/opt/curl/bin/curl --insecure https://127.0.0.1:63330/
{
  "paths": [
    "/api/logs/v1/{tenant}/*",
    "/api/metrics/v1/{tenant}/*",
    "/api/v1/{tenant}/*",
    "/oidc/{tenant}/*",
    "/{tenant}"
  ]
}

My suggestion is either configure Observatorium-API to tolerate TLS 1.2 or include a section for Mac users explaining that SSL routines:CONNECT_CR_SRVR_HELLO:tlsv1 alert protocol version means that a 3rd-party openssl-based curl is needed.

My Chrome could have handled this. I went to https://127.0.0.1:63330/ in Chrome and was prompted for a cert. If I had clicked 'cancel' I would have seen everything was fine but I was uncertain which cert to supply so I didn't try.

Add OpenAPI spec for logs

Problem statement

The current OpenAPI spec is missing support for the observatorium/api logs API. This make writing client code (e.g. in obsctl) error-prone and cumbersome. In addition the logs api remains as such undocumented as per OpenAPI spec can be used to generate user-friendly documentation (See observatorium/observatorium#483)

Proposed solution

Add support for the following logs API routes in the OpenAPI spec:

  • POST /api/logs/v1/<tenant>/loki/api/v1/push
  • GET /api/logs/v1/<tenant>/loki/api/v1/labels
  • GET /api/logs/v1/<tenant>/loki/api/v1/label/<name>/values/
  • GET /api/logs/v1/<tenant>/loki/api/v1/series
  • GET /api/logs/v1/<tenant>/loki/api/v1/query_range
  • GET /api/logs/v1/<tenant>/loki/api/v1/query
  • GET /api/logs/v1/<tenant>/loki/api/v1/tail

Note: The logs handler supports also /api/prom/* routes for backward-compatibility with Grafana 6. We proposed to omit support for these calls in the OpenAPI spec and limit our support vector only to Loki's V1 API.

cc @nyza99

Support compression for the public grpc API

#199 introduced gRPC ingestion API for OTLP (traces). The API does seems to support compression (OTELcol by default uses gzip).

2022-03-18T11:41:19.208Z	error	exporterhelper/queued_retry.go:183	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "name": "otlp", "error": "Permanent error: rpc error: code = Unimplemented desc = grpc: Decompressor is not installed for grpc-encoding \"gzip\"", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:135
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78
2022-03-18T11:41:29.208Z	error	exporterhelper/queued_retry.go:183	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "name": "otlp", "error": "Permanent error: rpc error: code = Unimplemented desc = grpc: Decompressor is not installed for grpc-encoding \"gzip\"", "dropped_items": 49}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/traces.go:135
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/queued_retry_inmemory.go:118
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
	go.opentelemetry.io/[email protected]/exporter/exporterhelper/internal/bounded_memory_queue.go:78

OTELcol config

# kubectl port-forward service/observatorium-xyz-observatorium-api 8090:8090 -n observatorium
# docker run --rm -it --net=host -v $PWD:/tmp/conf ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector:0.47.0 --config /tmp/conf/otelconf.yaml
extensions:

receivers:
  otlp:
    protocols:
      grpc:
      http:
  jaeger:
    protocols:
      thrift_binary:
      thrift_compact:
      thrift_http:
      grpc:

processors:

exporters:
  otlp:
    endpoint: localhost:8090
    compression: gzip
    headers:
      X-Tenant: "eyJhbGciOiJSUzI1NiIsImtpZCI6InB1YmxpYzo3ZTg1MTliYS03Y2M3LTQwM2EtYWE1OC0zZWEwYTdlOWUyNjkiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOltdLCJjbGllbnRfaWQiOiJ1c2VyIiwiZXhwIjoxNjQ3NjA2OTUzLCJleHQiOnt9LCJpYXQiOjE2NDc2MDMzNTMsImlzcyI6Imh0dHA6Ly8xNzIuMTcuMC4xOjQ0NDQvIiwianRpIjoiZTgzZTY5MjEtMzM2Yy00MThjLWJkZTQtZDljZGVmZWM0Yjc0IiwibmJmIjoxNjQ3NjAzMzUzLCJzY3AiOltdLCJzdWIiOiJ1c2VyIn0.UiY5ejk9oa5QXa5xhtXVZ8-1N21x-jipwMJrC7t_fN_TbaMUSgphwcviCwHq_4hpRy1ApHsSNA4eNYwRS5uC7z0rL_gu1SVE9HpBZRrB3zXdUDovZ_qhBNxvstwAXCQKEZPYx6kH0bqk-LAqEOAJj3WRoYI5v8Pg2jwY0UbVXfCerlGiG5ez5yoGwPw4s9ixfd93EAI6jJCmUGp2bv-LucuQoP_e3Mjz6XtGvRriig9686DQ0WyVom_1mkImRZcNM_eQYbQ444WmT4RjipSIZ-O3RDT4q9nRYtLxwWgLVlSiq5e2Ei-imK4kVeuskxC3As5JS3YrQT1bIHacZWnMpZ6b7Jy4Mtkb0xyY7YFYORWouaEOdXOJWOABtsGjNe40NS2kzNnMrtL1bRIzILNyeNC1peDXcnEIVi4zT8E_KhE3Q9fwQ49u02m0uCl__VtNy6Md6Pz4FEb-9-aj-sfmPjsNnWrU3mXkR1OaezV00bwb7kkckYSEYR4WSRcaxEklZt30a6Gq_84nur2XjGDal_gDn47SmfMiw-gdZHZP82qKWt8ONBimoYpDAVrYhi0jWLl1pbE7HxcUnE3wVw_eA62YuumalvnBiVFzGa-Tzt4vPtW__ECNNpDJ3L18g1R3drGvuFat"
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      exporters: [otlp]

superfluous response.WriteHeader error messages in the API server logs

We are running Observatorium API server version quay.io/observatorium/observatorium:master-2021-01-29-v0.1.1-195-gfde740a in our test environment. It seems that application intermittently generates following errors(see below) . Unfortunately it is not clear what kind of request has caused this error or if it had any impact on the clients (Observatorium is being accessed from multiple applications). I do apologize for limited information but hope that it will provide some clues .

2021/03/15 12:51:36 http: superfluous response.WriteHeader call from github.com/go-chi/chi/middleware.Recoverer.func1.1 (recoverer.go:31)
Panic: net/http: abort Handler
goroutine 586073 [running]:
runtime/debug.Stack(0x1f, 0x0, 0x0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x25
github.com/go-chi/chi/middleware.Recoverer.func1.1(0xc00020e300, 0xee88a0, 0xc000358000)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/recoverer.go:28 +0x1bf
panic(0xcb75c0, 0xc00018a890)
	/usr/local/go/src/runtime/panic.go:969 +0x175
net/http/httputil.(*ReverseProxy).ServeHTTP(0xc00007b590, 0x7f402fc84538, 0xc0000aee80, 0xc00020e700)
	/usr/local/go/src/net/http/httputil/reverseproxy.go:338 +0x163d
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x7f402fc84538, 0xc0000aee40, 0xc00020e700)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:196 +0xe9
net/http.HandlerFunc.ServeHTTP(0xc000327440, 0x7f402fc84538, 0xc0000aee40, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func1(0x7f402fc84538, 0xc0000aed80, 0xc00020e700)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:68 +0x11c
net/http.HandlerFunc.ServeHTTP(0xc000327710, 0x7f402fc84538, 0xc0000aed80, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerRequestSize.func1(0x7f402fc84538, 0xc0000aed40, 0xc00020e700)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:163 +0xe9
net/http.HandlerFunc.ServeHTTP(0xc0003279e0, 0x7f402fc84538, 0xc0000aed40, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:100 +0xda
net/http.HandlerFunc.ServeHTTP(0xc000327bf0, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/authorization.WithAuthorizers.func1.1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/opt/authorization/http.go:39 +0x346
net/http.HandlerFunc.ServeHTTP(0xc00032c300, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi.(*ChainHandler).ServeHTTP(0xc00032c340, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/chain.go:31 +0x52
github.com/go-chi/chi.(*Mux).routeHTTP(0xc000071d40, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/mux.go:425 +0x28b
net/http.HandlerFunc.ServeHTTP(0xc0002c5cb0, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi.(*Mux).ServeHTTP(0xc000071d40, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e700)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/mux.go:70 +0x50c
net/http.StripPrefix.func1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2081 +0x1a2
net/http.HandlerFunc.ServeHTTP(0xc000327320, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2042 +0x44
main.stripTenantPrefix.func1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/opt/main.go:736 +0x182
net/http.HandlerFunc.ServeHTTP(0xc000333740, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi.(*Mux).Mount.func1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/mux.go:292 +0x122
net/http.HandlerFunc.ServeHTTP(0xc000330ba0, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/ratelimit.combine.func1.1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/opt/ratelimit/http.go:84 +0x2ae
net/http.HandlerFunc.ServeHTTP(0xc000330d00, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/authentication.WithTenantHeader.func1.1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/opt/authentication/http.go:44 +0x197
net/http.HandlerFunc.ServeHTTP(0xc000333830, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e600)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/authentication.(*OIDCProvider).Middleware.func1.1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e500)
	/opt/authentication/oidc.go:269 +0x6cf
net/http.HandlerFunc.ServeHTTP(0xc000194520, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e500)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/authentication.WithTenantMiddlewares.func1.1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e500)
	/opt/authentication/http.go:97 +0x115
net/http.HandlerFunc.ServeHTTP(0xc000330d20, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e500)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/authentication.WithTenant.func1(0x7f402fa4e088, 0xc0000aed00, 0xc00020e400)
	/opt/authentication/http.go:32 +0x1e7
net/http.HandlerFunc.ServeHTTP(0xc000330d40, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e400)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi.(*ChainHandler).ServeHTTP(0xc00032d480, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e400)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/chain.go:31 +0x52
github.com/go-chi/chi.(*Mux).routeHTTP(0xc000095560, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e400)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/mux.go:425 +0x28b
net/http.HandlerFunc.ServeHTTP(0xc0004322e0, 0x7f402fa4e088, 0xc0000aed00, 0xc00020e400)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/observatorium/observatorium/server.Logger.func1.1(0xee88a0, 0xc000358000, 0xc00020e400)
	/opt/server/instrumentation.go:19 +0x20e
net/http.HandlerFunc.ServeHTTP(0xc00041cf30, 0xee88a0, 0xc000358000, 0xc00020e400)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi/middleware.(*throttler).ServeHTTP(0xc00041ccc0, 0xee88a0, 0xc000358000, 0xc00020e400)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/throttle.go:94 +0x350
github.com/go-chi/chi/middleware.Timeout.func1.1(0xee88a0, 0xc000358000, 0xc00020e300)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/timeout.go:45 +0x1cf
net/http.HandlerFunc.ServeHTTP(0xc00043a760, 0xee88a0, 0xc000358000, 0xc00020e300)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi/middleware.StripSlashes.func1(0xee88a0, 0xc000358000, 0xc00020e300)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/strip.go:25 +0xee
net/http.HandlerFunc.ServeHTTP(0xc00043a780, 0xee88a0, 0xc000358000, 0xc00020e300)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi/middleware.Recoverer.func1(0xee88a0, 0xc000358000, 0xc00020e300)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/recoverer.go:35 +0x83
net/http.HandlerFunc.ServeHTTP(0xc00043a7a0, 0xee88a0, 0xc000358000, 0xc00020e300)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi/middleware.RealIP.func1(0xee88a0, 0xc000358000, 0xc00020e300)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/realip.go:34 +0x9d
net/http.HandlerFunc.ServeHTTP(0xc00043a7c0, 0xee88a0, 0xc000358000, 0xc00020e300)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi/middleware.RequestID.func1(0xee88a0, 0xc000358000, 0xc00020e200)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/middleware/request_id.go:72 +0x1e8
net/http.HandlerFunc.ServeHTTP(0xc00043a7e0, 0xee88a0, 0xc000358000, 0xc00020e200)
	/usr/local/go/src/net/http/server.go:2042 +0x44
github.com/go-chi/chi.(*Mux).ServeHTTP(0xc000095560, 0xee88a0, 0xc000358000, 0xc00020e100)
	/go/pkg/mod/github.com/go-chi/[email protected]+incompatible/mux.go:82 +0x2d1
net/http.serverHandler.ServeHTTP(0xc0003721c0, 0xee88a0, 0xc000358000, 0xc00020e100)
	/usr/local/go/src/net/http/server.go:2843 +0xa3
net/http.(*conn).serve(0xc000390640, 0xeeb0a0, 0xc0000ae980)
	/usr/local/go/src/net/http/server.go:1925 +0x8ad
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2969 +0x36c

tenant logout handler

There is no way to log out of Observatorium.

I propose a login/logout handler /oidc/{tenant}/logout to be added near https://github.com/observatorium/api/blob/main/authentication/oidc.go#L160 . A similar handler will also be needed for /openshift.

This handler would set the tenant OIDC cookie to "" with expiration at 1-1-1970, and then redirect to some Observatorium path, triggering the OIDC login flow to let the user log in as someone else.

(Update: Lucas suggests reusing the login endpoint /oidc/{tenant}/login, and it seems that the functionality I need is implemented there, as long as the authentication is OIDC.)

I tried Lucas's suggestion locally. If my tenant is using OIDC, I can implement "logout" by having the trace UI visit /v1/traces/{tenant}/login, which I've redirected to /oidc/{tenant}/login. (This is oidc.go's handlerPrefix+loginRoute.)

Unfortunately this is the wrong URL if the tenant is using OpenShift -- for those I need to redirect to /openshift/tenant/login. So perhaps Observatorium could have an /{tenant}/login endpoint that redirects based on the tenant's auth provider?

If this is desired I can probably implement it.

CircleCI: Seemingly random failure of `generate` job

In recent times, we have noticed an uptick in number of failed generate jobs in the CI workflow (Example: https://app.circleci.com/pipelines/github/observatorium/api/455/workflows/33bd7a06-a1ce-4c7a-a291-fddef416bc51/jobs/2211). In the early phase of the run, the CI will complain about missing configmap file:

/go/bin/jsonnet-v0.16.0 -m examples/manifests examples/main.jsonnet | xargs -I{} sh -c 'cat {} | /go/bin/gojsontoyaml-v0.0.0-20200602132005-3697ded27e8c > {}.yaml && rm -f {}' -- {
cat: examples/manifests/configmap: No such file or directory

Subsequently, the run will fail because the final change set will not include the expected configmap. We were not able to reproduce this issue locally, since the configmap is always present. This requires further investigation.

Wrong log "metrics.ui.endpoint is not specified"

In the log of observatorium on startup, there is an info log about missing argument:

level=info msg="--metrics.ui.endpoint is not specified, UI will not be accessible"

The log is wrong, as the mentioned argument is set in the pod:

Args:
      --web.listen=0.0.0.0:8080
      --metrics.ui.endpoint=http://observatorium-xyz-observatorium-api-gateway-thanos-query.observatorium.svc.cluster.local:9090
      --metrics.read.endpoint=http://observatorium-xyz-cortex-query-frontend.observatorium.svc.cluster.local:9090/api/v1
      --metrics.write.endpoint=http://observatorium-xyz-thanos-receive.observatorium.svc.cluster.local:19291/api/v1/receive
      --log.level=warn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.