envoyproxy / xds-relay Goto Github PK
View Code? Open in Web Editor NEWCaching, aggregation, and relaying for xDS compliant clients and origin servers
License: Apache License 2.0
Caching, aggregation, and relaying for xDS compliant clients and origin servers
License: Apache License 2.0
This API lets users alter the log level while the server is running. One obstacle is that altering the log level isn't very feasible given the way the logger is initialized once today and passed as a parameter to other components.
Pasted from the tech spec:
POST /log_level/<level>
Change the log level of xDS Aggregator.
Arguments
level [mandatory] : The verbosity of the logs, one of either DEBUG, WARN, or ERROR.
For example,
If upstream client
is named and initialized first, with orchestrator
initialized later...orchestrater gets named as a subscope upstream client.orchestrator
with is undesired.
Contents of With()
is also being passed to subsequent log lines, producing undesired side effects. In the below example, creating watch
is outputting server address information which wasn't intended to be printed.
{"level":"info","ts":1589230516.361459,"logger":"upstream client","msg":"Initiating upstream connection","address":"127.0.0.1:18000"}
{"level":"info","ts":1589230516.3624716,"logger":"upstream client.orchestrator","msg":"Initializing server","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}
{"level":"info","ts":1589230516.4036033,"logger":"upstream client.orchestrator","msg":"creating watch","address":"127.0.0.1:18000","address":"127.0.0.1:9991"}
When onCancel is called from go-control-plane to the Orchestrator, Orchestrator removes management of this watch, but the watch persists in the Cache until the TTL expires. We could more effectively manage resources if a DeleteWatch(req *v2.DiscoveryRequest) function existed.
Seems to be failing more frequently:
=== RUN TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47:
Error Trace: upstream_client_test.go:47
Error: Setup failed: %s
Test: TestXdsClientGetsIncrementalResponsesFromUpstreamServer
Messages: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"
And,
=== RUN TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL github.com/envoyproxy/xds-relay/integration 7.554s
FAIL
make: *** [integration-tests] Error 1
Cache is in the format aggregated_key: {Response, Requests[]}
Request
is a pointer to the entire request object, which means that subsequent (N)ACKs (which increments the nonce) are mutating the cache.
This is not breaking because we clean up prior requests, but there is room for optimization here to not constantly mutate the cache on (N)ACKs.
On cache evict, orchestrator currently closes upstream and all downstream watches for the key that's getting evicted. This could be problematic if there's a lot of downstream connections for the same aggregated key. This will cause a "thundering herd" problem where all downstreams attempt to re-establish a connection with xds-relay. We need to evaluate how to handle this.
Use Cobra or other tooling so that we can start up xds-relay via command line flags or config.
ex:
xds-relay -c bootstrap.yml
xds-relay -v INFO --upstream-url example.com
Initially support log level, upstream url, server addr and port.
There is a potential race condition between the delete operation and the two operations (get response channel, add response) here:
xds-relay/internal/app/orchestrator/orchestrator.go
Lines 340 to 341 in 37af3c5
Currently, during graceful shutdown, we wait for all clients to be disconnected before killing the server.
If we send one of the two signals to the server process the graceful shutdown process blocks on the call to server.GracefulStop
. Here's what we see in the logs:
{"level":"info","ts":1595981344.7454,"msg":"received interrupt signal:%!(EXTRA string=interrupt)","json":{}}
{"level":"info","ts":1595981344.745436,"msg":"initiating admin server shutdown","json":{}}
{"level":"info","ts":1595981344.745485,"msg":"initiating grpc graceful stop","json":{}}
The intent of adding TTL to the cache was to prevent stale responses from staying in the cache more so than a traditional cache TTL to reduce memory, so TTL should only be extended when SetResponse is called on an existing item in the cache. TTL should not be changed when AddRequest or DeleteRequest or any changes to other values in the Resource struct occur.
The resources in the DiscoveryResponse are still in their marshalled any.Any form when fetched with the cache_dump API (https://github.com/envoyproxy/xds-relay/pull/74/files#r426102814). The cache_dump API should inspect the type_URL to unmarshal the Any variable into the correct resource to make it more readable: https://envoyproxy.slack.com/archives/CTULFMN91/p1590087624055200?thread_ts=1590087384.054900&cid=CTULFMN91.
It seems go-control-plane delegate handling of (N)ACKs to the cache implementation. For every request, the server will try to cancel the last watch, and create a new watch https://github.com/envoyproxy/go-control-plane/blob/master/pkg/server/v2/server.go#L402 . This cause duplicate pushing in xds-relay.
environment
It should be easy to reproduce, istiod
can be replaced with other control plane
config.yaml
admin:
address:
address: 0.0.0.0
port_value: 8889
server:
address:
address: 0.0.0.0
port_value: 8888
origin_server:
address:
address: istiod
port_value: 15010
logging:
level: DEBUG
metrics_sink:
statsd:
root_prefix: abc
flush_interval: 30s
address:
address: 127.0.0.1
port_value: 12345
cache:
ttl: 30s
max_entries: 100000
rules.yaml
fragments:
- rules:
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.ClusterLoadAssignment"
result:
and_result:
result_predicates:
- request_node_fragment:
# filed: 2
field: NODE_CLUSTER
action:
exact: true
- string_fragment: eds
- match:
request_type_match:
types:
- "type.googleapis.com/envoy.api.v2.Cluster"
result:
and_result:
result_predicates:
- request_node_fragment:
# filed: 2
field: NODE_CLUSTER
action:
exact: true
- string_fragment: cds
Notice the %s
and %v
s
{"level":"debug","ts":1589243912.031295,"msg":"respond %s%v version %q with version %q[type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3] v3]"}
{"level":"debug","ts":1589243912.0461426,"msg":"open watch %d for %s%v from nodeID %q, version %q[47 type.googleapis.com/envoy.api.v2.ClusterLoadAssignment [cluster-v3-3] test-id v3]"}
We recently added support to go-control-plane to "pass through" DiscoveryResponses received upstream directly to the downstream Envoy clients and should add that support to xds-relay.
This is the component that is responsible for routing between the xDS clients and origin servers, calling mapper, cache, etc. and overall request/response management.
Few xdsclient tests could benefit from adding a stat framework and assert on those.
ref: #46 (comment)
xds-relay should configure server keepalive settings. It will help the grpc server handle stale streams and connections from sidecars.
https://godoc.org/google.golang.org/grpc/keepalive#ServerParameters
type ServerParameters struct {
// MaxConnectionIdle is a duration for the amount of time after which an
// idle connection would be closed by sending a GoAway. Idleness duration is
// defined since the most recent time the number of outstanding RPCs became
// zero or the connection establishment.
MaxConnectionIdle time.Duration // The current default value is infinity.
// MaxConnectionAge is a duration for the maximum amount of time a
// connection may exist before it will be closed by sending a GoAway. A
// random jitter of +/-10% will be added to MaxConnectionAge to spread out
// connection storms.
MaxConnectionAge time.Duration // The current default value is infinity.
// MaxConnectionAgeGrace is an additive period after MaxConnectionAge after
// which the connection will be forcibly closed.
MaxConnectionAgeGrace time.Duration // The current default value is infinity.
// After a duration of this time if the server doesn't see any activity it
// pings the client to see if the transport is still alive.
// If set below 1s, a minimum value of 1s will be used instead.
Time time.Duration // The current default value is 2 hours.
// After having pinged for keepalive check, the server waits for a duration
// of Timeout and if no activity is seen even after that the connection is
// closed.
Timeout time.Duration // The current default value is 20 seconds.
}
MaxConnectionIdle : should be a few hours
MaxConnectionAge : default infinity works well
MaxConnectionAgeGrace: few minutes
Time: 2minutes
Timeout: default 20s works well
Currently, we just error out. Implement configurable retry and back-off logic to re-attempt to open the stream on failures.
Currently we force close all downstream connections, forcing downstream clients to issue a retry. or we could just add it to the issue too
Create a gRPC client for xDS request forwarding to the control plane server. This is just a pass through of the first xDS request if a stream is not already opened on the control plane. This should be similar to how Envoy currently forwards requests to the management server.
The cache currently stores the raw discovery response. Store alongside it the marshaled resource so that we don't have to re-serialize on every downstream fan out.
Setup Azure pipelines for:
--- FAIL: TestGoldenPath (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x164d819]
goroutine 7 [running]:
testing.tRunner.func1.1(0x171cd40, 0x1f507e0)
/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:940 +0x2f5
testing.tRunner.func1(0xc0002ac5a0)
/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:943 +0x3f9
panic(0x171cd40, 0x1f507e0)
/Users/runner/hostedtoolcache/go/1.14.4/x64/src/runtime/panic.go:969 +0x166
github.com/envoyproxy/xds-relay/internal/pkg/util/testutils.AssertCounterValue(0xc0002ac5a0, 0xc00052fd70, 0xc00003cba0, 0x22, 0x1)
/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/pkg/util/testutils/stats.go:14 +0xa9
github.com/envoyproxy/xds-relay/internal/app/orchestrator.TestGoldenPath(0xc0002ac5a0)
/Users/runner/work/xds-relay/xds-relay/go/src/github.com/envoyproxy/xds-relay/internal/app/orchestrator/orchestrator_test.go:159 +0x88e
testing.tRunner(0xc0002ac5a0, 0x186ff80)
/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
/Users/runner/hostedtoolcache/go/1.14.4/x64/src/testing/testing.go:1042 +0x357
FAIL github.com/envoyproxy/xds-relay/internal/app/orchestrator 0.223s
Discussion: https://envoyproxy.slack.com/archives/CTULFMN91/p1594079452076500
The logs are only piped to stderr.
Since we take the file sink path in. bootstrap config, we can also pipe the logs to a file.
Pasted from the tech spec:
GET /stats
Output all server stats.
Output Sample
cache::key::helloworld_production_lds::rq_count: 4
cache::key::foobar_production_cds::rq_count: 3
controlplane::connections::cx_active::1
controlplane::connections::cx_connect_fail::0
controlplane::connections::cx_destroy::0
controlplane::connections::cx_total::1
controlplane::requests::rq_1xx::0
controlplane::requests::rq_2xx::2
controlplane::requests::rq_3xx::0
controlplane::requests::rq_4xx::0
controlplane::requests::rq_5xx::0
controlplane::requests::rq_active::2
controlplane::requests::rq_error::0
controlplane::requests::rq_success::0
controlplane::requests::rq_timeout::0
controlplane::requests::rq_total::2
controlplane:::hostname::mycontrolplane.lyft.net
downstream::connections::cx_active::7
downstream::connections::cx_connect_fail::0
downstream::connections::cx_total::7
downstream::requests::rq_1xx::0
downstream::requests::rq_2xx::7
downstream::requests::rq_3xx::0
downstream::requests::rq_4xx::0
downstream::requests::rq_5xx::0
downstream::requests::rq_active::7
downstream::requests::rq_error::0
downstream::requests::rq_success::0
downstream::requests::rq_timeout::0
downstream::requests::rq_total::7
server::live: 1
server::memory_allocated: 16802608
server::memory_heap_size: 34603008
server::total_connections: 239
server::uptime: 132786
server::version: 11541598
See: https://github.com/envoyproxy/xds-relay/pull/53/files#diff-2958638164db92742469d2bb08d0001aR122-R128
There are possibilities that a discovery request won't get correctly bucketed into a aggregation rule defined on server bootstrap. Rather than dropping this request, we'd like to continue to propagate it upstream.
Pasted from the tech spec:
GET /server_info
Output server configuration. This is the bootstrap information.
Output Sample
{
"address": {
"socket_address": {
"address": "0.0.0.0",
"port_value": 9991,
"protocol": "TCP"
}
},
"control_plane_cluster": {
"name": "my-control-plane",
"load_assignment": {
"cluster_name": "my-control-plane",
"endpoints": [
{
"lb_endpoints": [
{
"endpoint": {
"address": {
"socket_address": {
"address": "my-control-plane.lyft.net",
"port_value": 80,
"protocol": "TCP"
}
}
}
}
]
}
]
}
},
"logging": {
"path": "/var/log/xds-aggregator",
"level": "WARN"
}
}
Refer to the tech spec on Cache.
xDS Relay will maintain its own in-memory cache to keep the most recent response given the aggregated key.
Each key will maintain the following fields:
Implement Cache interface and API for the state-of-the-world xDS Requests. This is a proposed set of APIs.
Exists(key: string): bool
- Return false if the key exists.
Fetch(key: string): *v3.Response
- Return the cached response if it exists, otherwise nil.
IsStreamOpen(key: string): bool
- Return false if no stream is opened against the upstream control plane for the given key.
SetStreamOpen(open: bool)
- Sets whether there is a stream open against the upstream control plane.
Set(resp: v3.Response, key: string): ([]*v3.Request, err)
- Set the cache response and return the list of open watches.
Clear(key: string)
- Discards the existing watches. This is meant to be used in tangent with the API above.
Set(req: v3.Request, key: string): (bool, err)
- Adds the watch to the cache. Returns whether a stream is open.
I saw a test failure for one of the e2e tests on my local fork after merging master/upstream.
TestSnapshotCacheSingleEnvoyAndXdsRelayServer: e2e_test.go:106:
1132
Timed out after 1.000s.
1133
Error: Unexpected non-nil/non-zero extra argument at index 1:
1134
<int>: 9
1135
--- FAIL: TestSnapshotCacheSingleEnvoyAndXdsRelayServer (1.23s)
1136
FAIL
1137
FAIL github.com/envoyproxy/xds-relay/integration 2.266s
1138
FAIL
1139
make: *** [Makefile:25: e2e-tests] Error 1
1140
##[error]Process completed with exit code 2.
The flakiness is likely due to timeouts.
Support the freeform 'metadata' field in the aggregation rules.
API: https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/base.proto
Include:
The xdsclient today doesn't have the correct grpc params.
Few params we should explore are: (link)
Add an integration test framework to simulate the downstream / xds-relay / upstream workflow.
go-control-plane has a docker image that sets up a very basic control plane server and envoy sidecar, use for reference.
Given xDS Request and aggregation rules (#6), define the functionality to parse the request and generate the cache key.
Each cache key maps to a set of requests that expect the same response from the upstream control plane.
When using the /cache/
admin endpoint, the expiration time is printed as "0001-01-01T00:00:00Z", and some parts of the config are still in their marshalled format, e.g. "TypedConfig": {
"type_url": "type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager",
"value": "EgxpbmdyZXNzX2h0dHAqhwEKEmV..."
See #75 for more details.
The cache_dump admin endpoint should support regex keys, i.e. dumping resources with keys that match the given regex from the cache.
Define and implement Aggregation “Keyer” API / Protobufs.
These are the aggregation rules necessary for users of xds-relay to define how to aggregate xDS requests.
Things to keep in mind are versioning, and extensibility to support matching on future request fields.
A possible implementation exists here along with more contextual detail.
See #75 for more details.
The cache_dump endpoint should dump everything in the cache if no key is specified.
Now that we have a stats framework (#61), we should add stats to existing workflows in the cache, orchestrator, upstream, etc. components.
If unspecified, we currently assume that the management server is running on port 10, but that's a registered port.
Define a standalone process / tool for validating the .yaml
bootstrap configuration in #6 .
This tooling is meant to be run by end users of xDS Relay in their CI or test systems.
Envoy has a route checker tool that can be used for inspiration.
Pasted from the tech spec:
GET /cache/<key>
Output the contents of the xDS Aggregator Cache.
Arguments
key [optional] : The cache key. Regex suffixes are supported.
Output Sample
Note: The DiscoveryResponse listed below is expanded into the DiscoveryRequest and DiscoveryResponse proto formats. It’s not listed here to keep the document compact.
Query: GET /cache/foo_production*
{
"foo_production_lds": {
"response": *v3.DiscoveryResponse,
"watchers": [ *v3.DiscoveryRequest, *v3.DiscoveryRequest, ... ]
"stream_open": false,
},
"foo_production_eds":{
"response": nil,
"watchers": [ *v3.DiscoveryRequest ]
"stream_open": true,
},
}
Otherwise clients are perpetually waiting and will not re-request.
Within the orchestrator, If cache set fails when setting the response, we may need to issue a retry upstream. Currently the fallback is to rely on a future response, but that isn't ideal.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.