envoyproxy / xds-relay Goto Github PK

View Code? Open in Web Editor NEW

131.0 14.0 29.0 2.56 MB

Caching, aggregation, and relaying for xDS compliant clients and origin servers

License: Apache License 2.0

Makefile 0.40% Go 98.20% Shell 1.40%

xds-relay's People

Contributors

Stargazers

Watchers

xds-relay's Issues

Admin REST API: set log level

This API lets users alter the log level while the server is running. One obstacle is that altering the log level isn't very feasible given the way the logger is initialized once today and passed as a parameter to other components.

Pasted from the tech spec:

POST /log_level/<level>
Change the log level of xDS Aggregator.

Arguments
level [mandatory] : The verbosity of the logs, one of either DEBUG, WARN, or ERROR.

Logger is concatenating named scope and subsequent `With()` lines

For example,
If upstream client is named and initialized first, with orchestrator initialized later...orchestrater gets named as a subscope upstream client.orchestrator with is undesired.

Contents of With() is also being passed to subsequent log lines, producing undesired side effects. In the below example, creating watch is outputting server address information which wasn't intended to be printed.

{"level":"info","ts":1589230516.361459,"logger":"upstream client","msg":"Initiating upstream connection","address":"127.0.0.1:18000"}
{"level":"info","ts":1589230516.3624716,"logger":"upstream client.orchestrator","msg":"Initializing server","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}
{"level":"info","ts":1589230516.4036033,"logger":"upstream client.orchestrator","msg":"creating watch","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}

Add `DeleteWatch` in Cache

When onCancel is called from go-control-plane to the Orchestrator, Orchestrator removes management of this watch, but the watch persists in the Cache until the TTL expires. We could more effectively manage resources if a DeleteWatch(req *v2.DiscoveryRequest) function existed.

Implement .yaml to Aggregation proto converter

Blocked by #4

Users should be able to bootstrap their initial configuration using .yaml. Define the transformation rules necessary to convert their bootstrap configuration into the proto defined in #4 .

Integration tests are exhibiting flaky behavior

Seems to be failing more frequently:

=== RUN   TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
    TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47: 
        	Error Trace:	upstream_client_test.go:47
        	Error:      	Setup failed: %s
        	Test:       	TestXdsClientGetsIncrementalResponsesFromUpstreamServer
        	Messages:   	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"

And,

=== RUN   TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL	github.com/envoyproxy/xds-relay/integration	7.554s
FAIL
make: *** [integration-tests] Error 1

Ignore nonce in request caching?

Cache is in the format aggregated_key: {Response, Requests[]}

Request is a pointer to the entire request object, which means that subsequent (N)ACKs (which increments the nonce) are mutating the cache.

This is not breaking because we clean up prior requests, but there is room for optimization here to not constantly mutate the cache on (N)ACKs.

Handle potential thundering herd problem on cache evict

On cache evict, orchestrator currently closes upstream and all downstream watches for the key that's getting evicted. This could be problematic if there's a lot of downstream connections for the same aggregated key. This will cause a "thundering herd" problem where all downstreams attempt to re-establish a connection with xds-relay. We need to evaluate how to handle this.

Parse env vars and config on server bootstrap

Use Cobra or other tooling so that we can start up xds-relay via command line flags or config.
ex:

xds-relay -c bootstrap.yml
xds-relay -v INFO --upstream-url example.com

Initially support log level, upstream url, server addr and port.

Potential race condition between cache evict & fanout routines

There is a potential race condition between the delete operation and the two operations (get response channel, add response) here:

xds-relay/internal/app/orchestrator/orchestrator.go

Lines 340 to 341 in 37af3c5

    
           if channel, ok := o.downstreamResponseMap.get(watch); ok { 
        
           	if err := channel.addResponse(convertToGcpResponse(resp, *watch)); err != nil {

Log requests and resources on mapper errors

Shutdown blocks waiting for all clients to be disconnected

Currently, during graceful shutdown, we wait for all clients to be disconnected before killing the server.

If we send one of the two signals to the server process the graceful shutdown process blocks on the call to server.GracefulStop. Here's what we see in the logs:

{"level":"info","ts":1595981344.7454,"msg":"received interrupt signal:%!(EXTRA string=interrupt)","json":{}}
{"level":"info","ts":1595981344.745436,"msg":"initiating admin server shutdown","json":{}}
{"level":"info","ts":1595981344.745485,"msg":"initiating grpc graceful stop","json":{}}

Extend cache TTL only on changes to Response

The intent of adding TTL to the cache was to prevent stale responses from staying in the cache more so than a traditional cache TTL to reduce memory, so TTL should only be extended when SetResponse is called on an existing item in the cache. TTL should not be changed when AddRequest or DeleteRequest or any changes to other values in the Resource struct occur.

Resource unmarshalling in cache_dump admin API

The resources in the DiscoveryResponse are still in their marshalled any.Any form when fetched with the cache_dump API (https://github.com/envoyproxy/xds-relay/pull/74/files#r426102814). The cache_dump API should inspect the type_URL to unmarshal the Any variable into the correct resource to make it more readable: https://envoyproxy.slack.com/archives/CTULFMN91/p1590087624055200?thread_ts=1590087384.054900&cid=CTULFMN91.

Properly handle (N)ACKs

It seems go-control-plane delegate handling of (N)ACKs to the cache implementation. For every request, the server will try to cancel the last watch, and create a new watch https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/v2/server.go#L402 . This cause duplicate pushing in xds-relay.

environment

It should be easy to reproduce, istiod can be replaced with other control plane

config.yaml

admin:
  address:
    address: 0.0.0.0
    port_value: 8889
server:
  address:
    address: 0.0.0.0
    port_value: 8888
origin_server:
  address:
    address: istiod
    port_value: 15010
logging:
  level: DEBUG
metrics_sink:
  statsd:
    root_prefix: abc
    flush_interval: 30s
    address:
      address: 127.0.0.1
      port_value: 12345
cache:
  ttl: 30s
  max_entries: 100000

rules.yaml

fragments:
  - rules:
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: eds
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.Cluster"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: cds

e2e tests are printing raw formatted strings in debug mode

Notice the %s and %vs

{"level":"debug","ts":1589243912.031295,"msg":"respond %s%v version %q with version %q[type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3]  v3]"}
{"level":"debug","ts":1589243912.0461426,"msg":"open watch %d for %s%v from nodeID %q, version %q[47 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3] test-id v3]"}

Use go-control-plane's PassthroughResponse

We recently added support to go-control-plane to "pass through" DiscoveryResponses received upstream directly to the downstream Envoy clients and should add that support to xds-relay.

Milestone and Work Stream Proposal

https://docs.google.com/document/d/1hTOTijkwZGJIIAwxfYF9danTny62zpz6fLEyj-a1eGU/edit?usp=sharing

Output envoy logs only in the case of failures

In e2e tests, in order to declutter the logs, we should output envoy logs only in case of failures.

One idea is to generate artifacts via github actions and attach them to the workflow run.

Add orchestrator interface

This is the component that is responsible for routing between the xDS clients and origin servers, calling mapper, cache, etc. and overall request/response management.

Add support for stats

Few xdsclient tests could benefit from adding a stat framework and assert on those.
ref: #46 (comment)

Configure grpc keepalive settings

xds-relay should configure server keepalive settings. It will help the grpc server handle stale streams and connections from sidecars.
https://godoc.org/google.golang.org/grpc/keepalive#ServerParameters

type ServerParameters struct {
    // MaxConnectionIdle is a duration for the amount of time after which an
    // idle connection would be closed by sending a GoAway. Idleness duration is
    // defined since the most recent time the number of outstanding RPCs became
    // zero or the connection establishment.
    MaxConnectionIdle time.Duration // The current default value is infinity.
    // MaxConnectionAge is a duration for the maximum amount of time a
    // connection may exist before it will be closed by sending a GoAway. A
    // random jitter of +/-10% will be added to MaxConnectionAge to spread out
    // connection storms.
    MaxConnectionAge time.Duration // The current default value is infinity.
    // MaxConnectionAgeGrace is an additive period after MaxConnectionAge after
    // which the connection will be forcibly closed.
    MaxConnectionAgeGrace time.Duration // The current default value is infinity.
    // After a duration of this time if the server doesn't see any activity it
    // pings the client to see if the transport is still alive.
    // If set below 1s, a minimum value of 1s will be used instead.
    Time time.Duration // The current default value is 2 hours.
    // After having pinged for keepalive check, the server waits for a duration
    // of Timeout and if no activity is seen even after that the connection is
    // closed.
    Timeout time.Duration // The current default value is 20 seconds.
}

MaxConnectionIdle : should be a few hours
MaxConnectionAge : default infinity works well
MaxConnectionAgeGrace: few minutes
Time: 2minutes
Timeout: default 20s works well

Implement retries on stream open failures from orchestrator -> origin server

Currently, we just error out. Implement configurable retry and back-off logic to re-attempt to open the stream on failures.

Currently we force close all downstream connections, forcing downstream clients to issue a retry. or we could just add it to the issue too

Add client interface to upstream control plane

Create a gRPC client for xDS request forwarding to the control plane server. This is just a pass through of the first xDS request if a stream is not already opened on the control plane. This should be similar to how Envoy currently forwards requests to the management server.

Store the marshaled Discovery Response in the cache

The cache currently stores the raw discovery response. Store alongside it the marshaled resource so that we don't have to re-serialize on every downstream fan out.

Setup CI for Go linting and tests

Setup Azure pipelines for:

Image build
Go linting
Automated tests

Flaky behavior in unit test

--- FAIL: TestGoldenPath (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x164d819]
goroutine 7 [running]:
testing.tRunner.func1.1(0x171cd40, 0x1f507e0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:940 +0x2f5
testing.tRunner.func1(0xc0002ac5a0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:943 +0x3f9
panic(0x171cd40, 0x1f507e0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/runtime/panic.go:969 +0x166
github.com/envoyproxy/xds-relay/internal/pkg/util/testutils.AssertCounterValue(0xc0002ac5a0, 0xc00052fd70, 0xc00003cba0, 0x22, 0x1)
	/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/pkg/util/testutils/stats.go:14 +0xa9
github.com/envoyproxy/xds-relay/internal/app/orchestrator.TestGoldenPath(0xc0002ac5a0)
	/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/app/orchestrator/orchestrator_test.go:159 +0x88e
testing.tRunner(0xc0002ac5a0, 0x186ff80)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:1042 +0x357
FAIL	github.com/envoyproxy/xds-relay/internal/app/orchestrator	0.223s

Discussion: https://envoyproxy.slack.com/archives/CTULFMN91/p1594079452076500

Pipe logs to filesink

The logs are only piped to stderr.
Since we take the file sink path in. bootstrap config, we can also pipe the logs to a file.

Admin REST API: stats

Pasted from the tech spec:

GET /stats
Output all server stats.

Output Sample


cache::key::helloworld_production_lds::rq_count: 4
cache::key::foobar_production_cds::rq_count: 3
controlplane::connections::cx_active::1
controlplane::connections::cx_connect_fail::0
controlplane::connections::cx_destroy::0
controlplane::connections::cx_total::1
controlplane::requests::rq_1xx::0
controlplane::requests::rq_2xx::2
controlplane::requests::rq_3xx::0
controlplane::requests::rq_4xx::0
controlplane::requests::rq_5xx::0
controlplane::requests::rq_active::2
controlplane::requests::rq_error::0
controlplane::requests::rq_success::0
controlplane::requests::rq_timeout::0
controlplane::requests::rq_total::2
controlplane:::hostname::mycontrolplane.lyft.net
downstream::connections::cx_active::7
downstream::connections::cx_connect_fail::0
downstream::connections::cx_total::7
downstream::requests::rq_1xx::0
downstream::requests::rq_2xx::7
downstream::requests::rq_3xx::0
downstream::requests::rq_4xx::0
downstream::requests::rq_5xx::0
downstream::requests::rq_active::7
downstream::requests::rq_error::0
downstream::requests::rq_success::0
downstream::requests::rq_timeout::0
downstream::requests::rq_total::7
server::live: 1
server::memory_allocated: 16802608
server::memory_heap_size: 34603008
server::total_connections: 239
server::uptime: 132786
server::version: 11541598

Support codepaths where discovery requests don't fall under a aggregation rule

See: https://github.com/envoyproxy/xds-relay/pull/53/files#diff-2958638164db92742469d2bb08d0001aR122-R128

There are possibilities that a discovery request won't get correctly bucketed into a aggregation rule defined on server bootstrap. Rather than dropping this request, we'd like to continue to propagate it upstream.

Admin REST API: config dump

Pasted from the tech spec:

GET /server_info
Output server configuration. This is the bootstrap information.

Output Sample


{
  "address": {
    "socket_address": {
      "address": "0.0.0.0",
      "port_value": 9991,
      "protocol": "TCP"
    }
  },
  "control_plane_cluster": {
    "name": "my-control-plane",
    "load_assignment": {
      "cluster_name": "my-control-plane",
      "endpoints": [
        {
          "lb_endpoints": [
            {
              "endpoint": {
                "address": {
                  "socket_address": {
                    "address": "my-control-plane.lyft.net",
                    "port_value": 80,
                    "protocol": "TCP"
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "logging": {
    "path": "/var/log/xds-aggregator",
    "level": "WARN"
  }
}

Implement State-of-the-world xDS cache API

Refer to the tech spec on Cache.

xDS Relay will maintain its own in-memory cache to keep the most recent response given the aggregated key.

Each key will maintain the following fields:

resp: the most recent discovery response from the control plane
watchers: The open watches (requests) awaiting a response.
stream_open: Identifies whether there is already a connection open with go-control-plane for this aggregated key. If the value is false, the first request will be propagated to the control plane.

Implement Cache interface and API for the state-of-the-world xDS Requests. This is a proposed set of APIs.

Exists(key: string): bool
- Return false if the key exists.

Fetch(key: string): *v3.Response
- Return the cached response if it exists, otherwise nil.

IsStreamOpen(key: string): bool
- Return false if no stream is opened against the upstream control plane for the given key.

SetStreamOpen(open: bool)
- Sets whether there is a stream open against the upstream control plane.

Set(resp: v3.Response, key: string): ([]*v3.Request, err)
- Set the cache response and return the list of open watches.

Clear(key: string)
- Discards the existing watches. This is meant to be used in tangent with the API above.

Set(req: v3.Request, key: string): (bool, err)
- Adds the watch to the cache. Returns whether a stream is open.

e2e tests exhibiting flaky behavior

I saw a test failure for one of the e2e tests on my local fork after merging master/upstream.

TestSnapshotCacheSingleEnvoyAndXdsRelayServer: e2e_test.go:106: 
1132
        Timed out after 1.000s.
1133
        Error: Unexpected non-nil/non-zero extra argument at index 1:
1134
        	<int>: 9
1135
--- FAIL: TestSnapshotCacheSingleEnvoyAndXdsRelayServer (1.23s)
1136
FAIL
1137
FAIL	github.com/envoyproxy/xds-relay/integration	2.266s
1138
FAIL
1139
make: *** [Makefile:25: e2e-tests] Error 1
1140
##[error]Process completed with exit code 2.

The flakiness is likely due to timeouts.

Support node metadata in aggregation rules

Support the freeform 'metadata' field in the aggregation rules.

API: https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/base.proto

Bootstrap project with Go skeleton

Include:

basic server
logging framework
go-control-plane and dependency management story
linters
Makefile

Configure grpc connection params in xds client

The xdsclient today doesn't have the correct grpc params.
Few params we should explore are: (link)

should the connection creation be blocking
retry policy
secure connection
connection management
explore other default params a grpc connection should have.

Set up integration test bed

Add an integration test framework to simulate the downstream / xds-relay / upstream workflow.

go-control-plane has a docker image that sets up a very basic control plane server and envoy sidecar, use for reference.

Implement xDS request -> cache key mapper

Given xDS Request and aggregation rules (#6), define the functionality to parse the request and generate the cache key.

Each cache key maps to a set of requests that expect the same response from the upstream control plane.

Cache output formatting

When using the /cache/ admin endpoint, the expiration time is printed as "0001-01-01T00:00:00Z", and some parts of the config are still in their marshalled format, e.g. "TypedConfig": {
"type_url": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager",
"value": "EgxpbmdyZXNzX2h0dHAqhwEKEmV..."

Design proposal

https://docs.google.com/document/d/1X9fFzqBZzrSmx2d0NkmjDW2tc8ysbLQS9LaRQRxdJZU/edit

Regex support in cache_dump admin API

See #75 for more details.

The cache_dump admin endpoint should support regex keys, i.e. dumping resources with keys that match the given regex from the cache.

Define Proto for request aggregation rules

Define and implement Aggregation “Keyer” API / Protobufs.

These are the aggregation rules necessary for users of xds-relay to define how to aggregate xDS requests.

Things to keep in mind are versioning, and extensibility to support matching on future request fields.

A possible implementation exists here along with more contextual detail.

Full cache dump support in cache_dump admin API

See #75 for more details.

The cache_dump endpoint should dump everything in the cache if no key is specified.

Add benchmark tests for marshal resources

https://github.com/envoyproxy/xds-relay/pull/53/files/144d1560dccd110617001f30afea2b6b6f2ed04e#r419719077

Setup stats for existing workflows

Now that we have a stats framework (#61), we should add stats to existing workflows in the cache, orchestrator, upstream, etc. components.

Modify default port of origin server

If unspecified, we currently assume that the management server is running on port 10, but that's a registered port.

Aggregation rules bootstrap file validator

Define a standalone process / tool for validating the .yaml bootstrap configuration in #6 .

This tooling is meant to be run by end users of xDS Relay in their CI or test systems.

Envoy has a route checker tool that can be used for inspiration.

Admin REST API: cache dump

Pasted from the tech spec:

GET /cache/<key>
Output the contents of the xDS Aggregator Cache.

Arguments
key [optional] : The cache key. Regex suffixes are supported.

Output Sample
Note: The DiscoveryResponse listed below is expanded into the DiscoveryRequest and DiscoveryResponse proto formats. It’s not listed here to keep the document compact.
Query: GET /cache/foo_production*

{ 
   "foo_production_lds": { 
      "response": *v3.DiscoveryResponse,
      "watchers": [ *v3.DiscoveryRequest, *v3.DiscoveryRequest, ... ]
      "stream_open": false,
   },
   "foo_production_eds":{ 
      "response": nil,
      "watchers": [ *v3.DiscoveryRequest ]
      "stream_open": true,
   },
}

	if channel, ok := o.downstreamResponseMap.get(watch); ok {
	if err := channel.addResponse(convertToGcpResponse(resp, *watch)); err != nil {

envoyproxy / xds-relay Goto Github PK

xds-relay's People

Contributors

Stargazers

Watchers

Forkers

xds-relay's Issues

Recommend Projects

Recommend Topics

Recommend Org