Giter Site home page Giter Site logo

xds-relay's People

Contributors

eapolinario avatar greut avatar jessicayuen avatar jyotimahapatra avatar lisaludique avatar mattklein123 avatar nak3 avatar pcalley avatar samrabelachew avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xds-relay's Issues

Admin REST API: set log level

This API lets users alter the log level while the server is running. One obstacle is that altering the log level isn't very feasible given the way the logger is initialized once today and passed as a parameter to other components.

Pasted from the tech spec:

POST /log_level/<level>
Change the log level of xDS Aggregator.

Arguments
level [mandatory] : The verbosity of the logs, one of either DEBUG, WARN, or ERROR.

Logger is concatenating named scope and subsequent `With()` lines

For example,
If upstream client is named and initialized first, with orchestrator initialized later...orchestrater gets named as a subscope upstream client.orchestrator with is undesired.

Contents of With() is also being passed to subsequent log lines, producing undesired side effects. In the below example, creating watch is outputting server address information which wasn't intended to be printed.

{"level":"info","ts":1589230516.361459,"logger":"upstream client","msg":"Initiating upstream connection","address":"127.0.0.1:18000"}
{"level":"info","ts":1589230516.3624716,"logger":"upstream client.orchestrator","msg":"Initializing server","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}
{"level":"info","ts":1589230516.4036033,"logger":"upstream client.orchestrator","msg":"creating watch","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}

Add `DeleteWatch` in Cache

When onCancel is called from go-control-plane to the Orchestrator, Orchestrator removes management of this watch, but the watch persists in the Cache until the TTL expires. We could more effectively manage resources if a DeleteWatch(req *v2.DiscoveryRequest) function existed.

Implement .yaml to Aggregation proto converter

Blocked by #4

Users should be able to bootstrap their initial configuration using .yaml. Define the transformation rules necessary to convert their bootstrap configuration into the proto defined in #4 .

Integration tests are exhibiting flaky behavior

Seems to be failing more frequently:

=== RUN   TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
    TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47: 
        	Error Trace:	upstream_client_test.go:47
        	Error:      	Setup failed: %s
        	Test:       	TestXdsClientGetsIncrementalResponsesFromUpstreamServer
        	Messages:   	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"

And,

=== RUN   TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL	github.com/envoyproxy/xds-relay/integration	7.554s
FAIL
make: *** [integration-tests] Error 1

Ignore nonce in request caching?

Cache is in the format aggregated_key: {Response, Requests[]}

Request is a pointer to the entire request object, which means that subsequent (N)ACKs (which increments the nonce) are mutating the cache.

This is not breaking because we clean up prior requests, but there is room for optimization here to not constantly mutate the cache on (N)ACKs.

Handle potential thundering herd problem on cache evict

On cache evict, orchestrator currently closes upstream and all downstream watches for the key that's getting evicted. This could be problematic if there's a lot of downstream connections for the same aggregated key. This will cause a "thundering herd" problem where all downstreams attempt to re-establish a connection with xds-relay. We need to evaluate how to handle this.

Parse env vars and config on server bootstrap

Use Cobra or other tooling so that we can start up xds-relay via command line flags or config.
ex:

  • xds-relay -c bootstrap.yml
  • xds-relay -v INFO --upstream-url example.com

Initially support log level, upstream url, server addr and port.

Shutdown blocks waiting for all clients to be disconnected

Currently, during graceful shutdown, we wait for all clients to be disconnected before killing the server.

If we send one of the two signals to the server process the graceful shutdown process blocks on the call to server.GracefulStop. Here's what we see in the logs:

{"level":"info","ts":1595981344.7454,"msg":"received interrupt signal:%!(EXTRA string=interrupt)","json":{}}
{"level":"info","ts":1595981344.745436,"msg":"initiating admin server shutdown","json":{}}
{"level":"info","ts":1595981344.745485,"msg":"initiating grpc graceful stop","json":{}}

Extend cache TTL only on changes to Response

The intent of adding TTL to the cache was to prevent stale responses from staying in the cache more so than a traditional cache TTL to reduce memory, so TTL should only be extended when SetResponse is called on an existing item in the cache. TTL should not be changed when AddRequest or DeleteRequest or any changes to other values in the Resource struct occur.

Properly handle (N)ACKs

It seems go-control-plane delegate handling of (N)ACKs to the cache implementation. For every request, the server will try to cancel the last watch, and create a new watch https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/v2/server.go#L402 . This cause duplicate pushing in xds-relay.

environment

It should be easy to reproduce, istiod can be replaced with other control plane

config.yaml

admin:
  address:
    address: 0.0.0.0
    port_value: 8889
server:
  address:
    address: 0.0.0.0
    port_value: 8888
origin_server:
  address:
    address: istiod
    port_value: 15010
logging:
  level: DEBUG
metrics_sink:
  statsd:
    root_prefix: abc
    flush_interval: 30s
    address:
      address: 127.0.0.1
      port_value: 12345
cache:
  ttl: 30s
  max_entries: 100000

rules.yaml

fragments:
  - rules:
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: eds
    - match:
        request_type_match:
          types:
          - "type.googleapis.com/envoy.api.v2.Cluster"
      result:
        and_result:
          result_predicates:
          - request_node_fragment:
              # filed: 2
              field: NODE_CLUSTER
              action:
                exact: true
          - string_fragment: cds

e2e tests are printing raw formatted strings in debug mode

Notice the %s and %vs

{"level":"debug","ts":1589243912.031295,"msg":"respond %s%v version %q with version %q[type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3]  v3]"}
{"level":"debug","ts":1589243912.0461426,"msg":"open watch %d for %s%v from nodeID %q, version %q[47 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3] test-id v3]"}

Use go-control-plane's PassthroughResponse

We recently added support to go-control-plane to "pass through" DiscoveryResponses received upstream directly to the downstream Envoy clients and should add that support to xds-relay.

Add orchestrator interface

This is the component that is responsible for routing between the xDS clients and origin servers, calling mapper, cache, etc. and overall request/response management.

Configure grpc keepalive settings

xds-relay should configure server keepalive settings. It will help the grpc server handle stale streams and connections from sidecars.
https://godoc.org/google.golang.org/grpc/keepalive#ServerParameters

type ServerParameters struct {
    // MaxConnectionIdle is a duration for the amount of time after which an
    // idle connection would be closed by sending a GoAway. Idleness duration is
    // defined since the most recent time the number of outstanding RPCs became
    // zero or the connection establishment.
    MaxConnectionIdle time.Duration // The current default value is infinity.
    // MaxConnectionAge is a duration for the maximum amount of time a
    // connection may exist before it will be closed by sending a GoAway. A
    // random jitter of +/-10% will be added to MaxConnectionAge to spread out
    // connection storms.
    MaxConnectionAge time.Duration // The current default value is infinity.
    // MaxConnectionAgeGrace is an additive period after MaxConnectionAge after
    // which the connection will be forcibly closed.
    MaxConnectionAgeGrace time.Duration // The current default value is infinity.
    // After a duration of this time if the server doesn't see any activity it
    // pings the client to see if the transport is still alive.
    // If set below 1s, a minimum value of 1s will be used instead.
    Time time.Duration // The current default value is 2 hours.
    // After having pinged for keepalive check, the server waits for a duration
    // of Timeout and if no activity is seen even after that the connection is
    // closed.
    Timeout time.Duration // The current default value is 20 seconds.
} 

MaxConnectionIdle : should be a few hours
MaxConnectionAge : default infinity works well
MaxConnectionAgeGrace: few minutes
Time: 2minutes
Timeout: default 20s works well

Add client interface to upstream control plane

Create a gRPC client for xDS request forwarding to the control plane server. This is just a pass through of the first xDS request if a stream is not already opened on the control plane. This should be similar to how Envoy currently forwards requests to the management server.

Flaky behavior in unit test

--- FAIL: TestGoldenPath (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x164d819]
goroutine 7 [running]:
testing.tRunner.func1.1(0x171cd40, 0x1f507e0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:940 +0x2f5
testing.tRunner.func1(0xc0002ac5a0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:943 +0x3f9
panic(0x171cd40, 0x1f507e0)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/runtime/panic.go:969 +0x166
github.com/envoyproxy/xds-relay/internal/pkg/util/testutils.AssertCounterValue(0xc0002ac5a0, 0xc00052fd70, 0xc00003cba0, 0x22, 0x1)
	/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/pkg/util/testutils/stats.go:14 +0xa9
github.com/envoyproxy/xds-relay/internal/app/orchestrator.TestGoldenPath(0xc0002ac5a0)
	/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/app/orchestrator/orchestrator_test.go:159 +0x88e
testing.tRunner(0xc0002ac5a0, 0x186ff80)
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
	/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:1042 +0x357
FAIL	github.com/envoyproxy/xds-relay/internal/app/orchestrator	0.223s

Discussion: https://envoyproxy.slack.com/archives/CTULFMN91/p1594079452076500

Pipe logs to filesink

The logs are only piped to stderr.
Since we take the file sink path in. bootstrap config, we can also pipe the logs to a file.

Admin REST API: stats

Pasted from the tech spec:

GET /stats
Output all server stats.

Output Sample


cache::key::helloworld_production_lds::rq_count: 4
cache::key::foobar_production_cds::rq_count: 3
controlplane::connections::cx_active::1
controlplane::connections::cx_connect_fail::0
controlplane::connections::cx_destroy::0
controlplane::connections::cx_total::1
controlplane::requests::rq_1xx::0
controlplane::requests::rq_2xx::2
controlplane::requests::rq_3xx::0
controlplane::requests::rq_4xx::0
controlplane::requests::rq_5xx::0
controlplane::requests::rq_active::2
controlplane::requests::rq_error::0
controlplane::requests::rq_success::0
controlplane::requests::rq_timeout::0
controlplane::requests::rq_total::2
controlplane:::hostname::mycontrolplane.lyft.net
downstream::connections::cx_active::7
downstream::connections::cx_connect_fail::0
downstream::connections::cx_total::7
downstream::requests::rq_1xx::0
downstream::requests::rq_2xx::7
downstream::requests::rq_3xx::0
downstream::requests::rq_4xx::0
downstream::requests::rq_5xx::0
downstream::requests::rq_active::7
downstream::requests::rq_error::0
downstream::requests::rq_success::0
downstream::requests::rq_timeout::0
downstream::requests::rq_total::7
server::live: 1
server::memory_allocated: 16802608
server::memory_heap_size: 34603008
server::total_connections: 239
server::uptime: 132786
server::version: 11541598

Admin REST API: config dump

Pasted from the tech spec:

GET /server_info
Output server configuration. This is the bootstrap information.

Output Sample


{
  "address": {
    "socket_address": {
      "address": "0.0.0.0",
      "port_value": 9991,
      "protocol": "TCP"
    }
  },
  "control_plane_cluster": {
    "name": "my-control-plane",
    "load_assignment": {
      "cluster_name": "my-control-plane",
      "endpoints": [
        {
          "lb_endpoints": [
            {
              "endpoint": {
                "address": {
                  "socket_address": {
                    "address": "my-control-plane.lyft.net",
                    "port_value": 80,
                    "protocol": "TCP"
                  }
                }
              }
            }
          ]
        }
      ]
    }
  },
  "logging": {
    "path": "/var/log/xds-aggregator",
    "level": "WARN"
  }
}

Implement State-of-the-world xDS cache API

Refer to the tech spec on Cache.

xDS Relay will maintain its own in-memory cache to keep the most recent response given the aggregated key.

Each key will maintain the following fields:

  • resp: the most recent discovery response from the control plane
  • watchers: The open watches (requests) awaiting a response.
  • stream_open: Identifies whether there is already a connection open with go-control-plane for this aggregated key. If the value is false, the first request will be propagated to the control plane.

Implement Cache interface and API for the state-of-the-world xDS Requests. This is a proposed set of APIs.

Exists(key: string): bool
- Return false if the key exists.

Fetch(key: string): *v3.Response
- Return the cached response if it exists, otherwise nil.

IsStreamOpen(key: string): bool
- Return false if no stream is opened against the upstream control plane for the given key.

SetStreamOpen(open: bool)
- Sets whether there is a stream open against the upstream control plane.

Set(resp: v3.Response, key: string): ([]*v3.Request, err)
- Set the cache response and return the list of open watches.

Clear(key: string)
- Discards the existing watches. This is meant to be used in tangent with the API above.

Set(req: v3.Request, key: string): (bool, err)
- Adds the watch to the cache. Returns whether a stream is open.

e2e tests exhibiting flaky behavior

I saw a test failure for one of the e2e tests on my local fork after merging master/upstream.

TestSnapshotCacheSingleEnvoyAndXdsRelayServer: e2e_test.go:106: 
1132
        Timed out after 1.000s.
1133
        Error: Unexpected non-nil/non-zero extra argument at index 1:
1134
        	<int>: 9
1135
--- FAIL: TestSnapshotCacheSingleEnvoyAndXdsRelayServer (1.23s)
1136
FAIL
1137
FAIL	github.com/envoyproxy/xds-relay/integration	2.266s
1138
FAIL
1139
make: *** [Makefile:25: e2e-tests] Error 1
1140
##[error]Process completed with exit code 2.

The flakiness is likely due to timeouts.

Set up integration test bed

Add an integration test framework to simulate the downstream / xds-relay / upstream workflow.

go-control-plane has a docker image that sets up a very basic control plane server and envoy sidecar, use for reference.

Implement xDS request -> cache key mapper

Given xDS Request and aggregation rules (#6), define the functionality to parse the request and generate the cache key.

Each cache key maps to a set of requests that expect the same response from the upstream control plane.

Cache output formatting

When using the /cache/ admin endpoint, the expiration time is printed as "0001-01-01T00:00:00Z", and some parts of the config are still in their marshalled format, e.g. "TypedConfig": {
"type_url": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager",
"value": "EgxpbmdyZXNzX2h0dHAqhwEKEmV..."

Define Proto for request aggregation rules

Define and implement Aggregation “Keyer” API / Protobufs.

These are the aggregation rules necessary for users of xds-relay to define how to aggregate xDS requests.

Things to keep in mind are versioning, and extensibility to support matching on future request fields.

A possible implementation exists here along with more contextual detail.

Admin REST API: cache dump

Pasted from the tech spec:

GET /cache/<key>
Output the contents of the xDS Aggregator Cache.

Arguments
key [optional] : The cache key. Regex suffixes are supported.

Output Sample
Note: The DiscoveryResponse listed below is expanded into the DiscoveryRequest and DiscoveryResponse proto formats. It’s not listed here to keep the document compact.
Query: GET /cache/foo_production*

{ 
   "foo_production_lds": { 
      "response": *v3.DiscoveryResponse,
      "watchers": [ *v3.DiscoveryRequest, *v3.DiscoveryRequest, ... ]
      "stream_open": false,
   },
   "foo_production_eds":{ 
      "response": nil,
      "watchers": [ *v3.DiscoveryRequest ]
      "stream_open": true,
   },
}

Handle cache set failure scenario

Within the orchestrator, If cache set fails when setting the response, we may need to issue a retry upstream. Currently the fallback is to rely on a future response, but that isn't ideal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.