The aws-app-mesh-roadmap from aws

Simplify external service egress traffic setup

As discussed in #74, the current way to model an external service that a service within the mesh can route to is by modeling the external service as a VirtualNode. For example, if you had two services named Service-A and Service-B, and Service-B was an external service (e.g. gitlab) hosted at the DNS name gitlab.my-intranet.com. If you wanted the VirtualNode representing Service-A to be able to egress traffic to Service-B, you would model your mesh configuration as:

Service-A:

{
    "meshName": "foo",
    "virtualNodeName": "service-a",
    "spec": {
        "listeners": [
            {
                "portMapping": {
                    "port": 8080,
                    "protocol": "http"
                }
            }
        ],
        "serviceDiscovery": {
            "dns": {
                "serviceName": "service-a.foo-mesh.local"
            }
        },
        "backends": [
            "gitlab.my-intranet.com"
        ]
    }
}

Service-B:

{
    "meshName": "foo",
    "virtualNodeName": "service-b",
    "spec": {
        "listeners": [
            {
                "portMapping": {
                    "port": 80,
                    "protocol": "tcp"
                },
                "healthCheck": {
                    "protocol": "tcp",
                    "healthyThreshold": 2,
                    "unhealthyThreshold": 2,
                    "timeoutMillis": 2000,
                    "intervalMillis": 5000
                }
            }
        ],
        "serviceDiscovery": {
            "dns": {
                "serviceName": "gitlab.my-intranet.com"
            }
        }
    }
}

The VirtualNode model contains many specifications which would not normally apply to an external service not within the control of the mesh (such as backends), while others still do (such as health checks).

This issue is to track the investigation of a general simplification of modeling external entities within the mesh.

Enable ECS with other networking modes

In ECS awsvpc networking mode doesn't meet everyone's needs. Enabling other network modes would be great.

End to end encryption of traffic with ACM managed certs

Reusing a VirtualNode name results in stale Envoy configuration

Summary

When deleting and re-creating a VirtualNode with the same name (e.g. my-virtual-node under a Mesh named my-mesh), an Envoy identified as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node) which connects shortly after re-creating the VirtualNode may receive the previous VirtualNode's configuration. This may commonly occur if the Envoy is connected in less than 10 seconds from the time the VirtualNode was re-created.

This issue is closely related to #49.

We are working on a solution for this bug that will occasionally check for this state and request that the Envoy reconnect to receive updated configuration.

Steps to reproduce

Create a VirtualNode
Connect an Envoy identified as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node) and ensure it's working properly
Disconnect the Envoy
Delete the VirtualNode
Create a new VirtualNode using the same name and and connect an Envoy within a few seconds after creation (or at the same time)

Expected behavior: The Envoy receives the configuration for the new VirtualNode
Actual behavior: The Envoy receives the configuration for the previous VirtualNode by the same name

Work-arounds

When creating the new VirtualNode, create it using a different name which has never been used.
When identifying the Envoy, set APPMESH_VIRTUAL_NODE_NAME to the VirtualNode's UID as returned by the CreateVirtualNode API, instead of the ARN or truncated resource name.

Commit X-Ray tracer plugin to Envoy upstream

We are developing an Envoy tracing driver for AWS X-Ray (https://github.com/awslabs/aws-app-mesh-examples/issues/5). This tracing driver will be upstream to Envoy.

ECS integration with App Mesh in the ECS console

Integration with Consul Service Discovery

Primary mechanism will be via AWS Cloud Map. Work with Hashicorp to build two-way sync between Consul and Cloud Map
.

Bring Envoy from official release

You should be able to use the offical release of Envoy.

Region expansion

Expand to all AWS regions.

Current region list

HTTP Header based routing

Open Source the App Mesh Envoy Image build, release, and validation tools

As part of #16 , we want customers to be able to

Build our canonical images from source
Bring your own Envoy (BYOE!) to our existing build scripts

Integration with EKS

The integration will happen primarily with a controller running in the customer's cluster on the master instances, managed by EKS. The controller will watch the Kubernetes API of the customer's cluster and react to certain objects being created or modified. It will create the necessary components in AppMesh and CloudMap.

Initial support will be for a single AppMesh mesh and a single CloudMap namespace per cluster (though many clusters can share a mesh/namespace). Customers can provide an existing mesh/namespace as well.

An additional component that will be used on the customer's worker nodes is the App Mesh CNI(https://github.com/awslabs/aws-app-mesh-examples/issues/15). Its responsibility is to enter the network namespace of a new pod and set up iptables rules to route incoming and outgoing traffic through envoy. This takes the place of an init container, and is preferred to avoid having to run privileged containers altogether.

Optionally, a mutating admission webhook could be employed to inject envoy as a sidecar container into pods that are launched in the cluster.

Accessing host instance

Hello, I was wondering what would be the correct way to access the host on which a service is running. It would be cumbersome to model each host as a TCP virtual node and virtual service and add them all as backends to each service to account for dynamic task placement.

My initial thought is to add them to the APPMESH_EGRESS_IGNORED_IP since that's how the metadata services are reached, but I wanted to know if there was a better approach.

Thanks!

Retry Policy

A Retry Policy in App Mesh enables clients to protect themselves from intermittent network failures, or intermittent server-side failures. A Retry Policy is an immutable entity in App Mesh that allows users to specify the conditions under which a retry is attempted, including HTTP status codes that will trigger a retry. A Retry Policy also has parameters specifying how many times to retry, and the timeout to use per retry.

Once a Retry Policy is created, it can be attached to one or more Virtual Nodes as part of the backends. Each backend in a Virtual Node can have its own retry policy.

Integration with AWS Lambda

Envoy configuration valid for 7 days after VirtualNode is deleted

Summary

When deleting a VirtualNode in a Mesh, the resulting Envoy configuration for that VirtualNode will remain available to an Envoy which identifies itself as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node). Envoys which are connected to the Envoy Management Service endpoint identified as that VirtualNode will remain connected and may receive improper configuration.

Note: Other Envoys identified as separate VirtualNodes, who may have previously relied on the deleted VirtualNode as part of a backend definition, will be updated with the correct configuration.

The period that this configuration is available after deleting a VirtualNode is approximately 7 days. We are working to reduce this time.

Steps to reproduce
Scenario 1: A connected Envoy remains connected after deletion of the VirtualNode

Create a VirtualNode
Connect an Envoy which is identified as that VirtualNode
Delete the VirtualNode

Expected behavior: The Envoy no longer receives configuration updates and is disconnected from the Envoy Management Service endpoint.
Actual behavior: The Envoy remains connected and may receive improper configuration.

Scenario 2: An Envoy connects after deletion of the VirtualNode_

Create a VirtualNode
Connect an Envoy which is identified as that VirtualNode and ensure it's working properly
Disconnect the Envoy
Delete the VirtualNode
Reconnect the Envoy

Expected behavior: The Envoy is not allowed to connect to the Envoy Management Service, and receives an appropriate error code (e.g. NOT_FOUND)
Actual behavior: The Envoy remains connected and may receive improper configuration.

Work-around

Make sure your Envoys are disconnected, and the associated ECS tasks, EKS pods, or applications running on EC2 are not serving traffic, then delete the VirtualNode.

Use App Mesh for ingress routing

Hosted EDS implementation with AWS Cloud Map

Details: AWS Cloud Map to act as cross-service service registry for service endpoints and metadata. ECS already integrates with Cloud Map and we plan to build EKS connector to Cloud Map.

Mutual TLS with customer provided certificates

Update 2020-09-18: See #34 (comment) for API proposal
Update 2020-11-23: Available in preview. See #34 (comment)
Update 2021-02-04: Generally Available. See #34 (comment)

Wrong URL on website.

I'm not sure if there is here the right place to open this issue, but App mesh website is send to invalid URL when click on " Get started with AWS App Mesh". It's sending to :
https://us-west-2.awsc-integ.aws.amazon.com/appmesh/get-started?region=us-west-2

CloudFormation

Add App Mesh to CloudFormation so that customers can easily automate App Mesh setup.

Fix 503 responses from Envoy when adding a new route

Describe the bug

See envoyproxy/envoy#5174 for the detailed description.

Platform
ALL

Expected behavior
Should not return 503 on route change.

Additional context
envoyproxy/envoy#5174

Updates to routes have no effect on running envoys

What happened?

Updates to routes have no effect on running envoys

What you expected to happen?

Tasks must be restarted for route changes to take effect.

How to reproduce it (as minimally and precisely as possible)?

Not easy to reproduce as it happens sporadically. Seems to be an issue with XDS protocol

Create two virtual-nodes with a route between them (A→B).
Verify the route is working by making requests to A.
Create a new version for B (say B2).
Update the route by changing weights.
Check if A is now routing some traffic to B2 based on weights.

Observed that it does not, though you may need to try to change the routes a few times and leave A running for an extended period of time (>30mins).

Issue gets resolved if I restart A as it gets new configuration.

Enable custom filters for Envoy

Customers should be able to build their own filters into Envoy and we should allow config of those filters.

Traffic Mirroring (Shadowing)

Tell us about your request
This feature request is for implementing traffic mirroring (also referred to as shadowing). Traffic mirroring allows one service to send the same traffic to more than one upstream service while still only using a single upstream service for the authoritative response. Other services which are receiving mirrored traffic can be tested for bugs and performance regressions prior to serving real traffic and becoming the authoritative upstream.

Which integration(s) is this request for?
All

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When working with microservices, developers and infrastructure engineers often need to test their new versions against real traffic before shifting all live traffic over to the new version. This increases confidence in code changes and allows teams to find bugs during periods of change.

Are you currently working around this issue?
App Mesh does not currently supporting traffic mirroring, so teams may work around the issue by replaying old traffic patterns from previous logs collected.

Additional context
Envoy Proxy supports traffic mirroring on routes.

Tag Based Resources

Implement tagging of App Mesh resources so that our customers can have a consistent management and authorization experience.

Open-source App Mesh Envoy Management Service (EMS)

The Envoy Management Service (EMS) is App Mesh's managed Aggregated Discovery Service (ADS) that customer Envoys connect to for dynamic configuration.

Emit DogStatsD-compatible metrics

App Mesh will allow customers to enable DogStatsD metrics to a local sidecar agent or remote agent. The agent must be capable of ingesting DogStatsD metrics; examples include the CloudWatch agent and the DataDog agent. The metrics will be tagged/dimensioned to allow for aggregating metrics across multiple endpoints as desired.

Circuit Breaker Policy

Custom Envoy

Will it be possible to deploy to App Mesh a custom built Envoy binary that is compatible with App Mesh of course e.g. with SigV4 etc? Custom filters for Envoy require a custom Envoy.

End to end encryption of traffic with customer provided certs

App Mesh Console

[BUG] Version numbers in Envoy resource names cause connection failures, changing metric names

Describe the bug
Resources vended via the Envoy Management Service contain unique version numbers that may change when new configuration is generated. This may cause TCP connections proxied by Envoy to fail, or HTTP connections to be prematurely drained due to resource replacement. It also causes some of Envoy's generated metrics to contain the version numbers, which makes it difficult to track a given statistic through Envoy configuration changes.

Platform
All

To Reproduce
Steps to reproduce the TCP connection failure behavior:

Create a service mesh with a gateway VirtualNode and TCP and HTTP backend VirtualNodes
Setup your gateway source code to call the TCP backend VirtualNode on a new request
Make a request to your gateway VirtualNode and note that the call to the TCP backend fails

Steps to reproduce the metrics behavior:

Create any service mesh on App Mesh (e.g. the colorapp example provided in this repository)
Note that some metrics generated by Envoy are appended with unique version numbers
Make any change to a VirtualNode or Route and note that some metric names will change

Expected behavior

The gateway VirtualNode with a proxied TCP connection successfully establishes a connection.
Metrics generated by Envoy do not contain unique version numbers that change when new configuration is vended to the Envoy Proxy.

Additional Context
Here are some examples of statistics generated by Envoy which have unique version numbers in them:

$ curl http://colorgateway.appmesh-example.local:9901/stats
...
cluster.cds|egress|AppMeshExample|colorgateway-vn|colorteller-black-vn|http|9080|22459664.external.upstream_rq_completed: 1
...
http.ingress.AppMeshExample.colorgateway-vn.rds.rds|ingress|AppMeshExample|colorgateway-vn|http|9080|31467114.config_reload: 1

Cookie based routing

VPC Endpoint/Private Link for App Mesh Envoy xDS API

It would be great not to have to setup an IGW, NATGW or HTTP proxy for communication between Envoy sidecars and the App Mesh xDS API (e.g. appmesh-envoy-management.us-east-1.amazonaws.com, see here).

Updating mesh egress filter does not update running Envoys

Summary

When updating the value of the new mesh egress filter, any Envoy which is currently connected to the App Mesh ADS endpoint will not immediately receive the updated setting. The Envoy will receive the updated configuration after a maximum period of 30 minutes, or after the Envoy disconnects and reconnects to the ADS endpoint.

Steps to reproduce

Set the mesh egress filter type to ALLOW_ALL to allow all external egress traffic from VirtualNodes.
Connect an Envoy identified by a VirtualNode for that Mesh.
Update the mesh egress filter type to DROP_ALL to disallow external egress traffic from VirtualNodes.

Expected behavior: The Envoy should receive the updated setting within a matter of seconds.
Actual behavior: The Envoy does not receive the updated setting until it disconnects and reconnects to the App Mesh ADS endpoint (which happens automatically every 30 minutes).

Workaround: You can force the Envoy to update its configuration by restarting it (via the ECS task, EKS pod, or similar). Note that this will need to handled carefully for Envoys serving production traffic (i.e. issue a rolling restart).

A fix has been proposed and will be rolled out over the coming days to address this issue.

Implement Fault Injection

Service call auditing

Send our customers API call data to CloudTrail so customers can reliably audit their account activity.

Commit SigV4 auth addition to Envoy upstream

Clarify usage of ServiceNames

Per discussions in #49 and #71, the usage of ServiceNames in the mesh are not abundantly clear in the current APIs and documentation.

The current use of ServiceNames as described by @ivitjuk:

Linking together client and server VirtualNodes. This is done by client VirtualNodes including the ServiceNames they want to speak to in the spec.backends section of their config. Please note that client VirtualNodes do not reference server VirtualNodes directly by their name. They do it via ServiceNames. The way we link ServiceName entries from the client's backends section to the server VirtualNodes is by first finding a VirtualRouter that supports the specific ServiceName. We do that by looking at the spec.serviceNames of a VirtualRouter. Once we find the VirtualRouter we can also find Routes associated with it by looking at the virtualRouterName field of a Route. From the Route we can find server VirtualNodes in the spec.httpRoute.action.weightedTargets.virtualNode field.
Virtual host and route matching in the Envoy config. Once we have linked client VirtualNode with it's server VirtualNodes we can configure Envoy on the client side to be able to speak with the servers. This is done by injecting clusters in the client's Envoy config and adjusting routes to point to those clusters. At the Envoy route_config level, we use ServiceName in the virtual_hosts.domains field, so that routes get applied to the correct ServiceNames.
Endpoint discovery. This is how we find out endpoints of a server VirtualNode. We use server's VirtualNode ServiceName to perform DNS discovery. In the future there will be more options aside of DNS, but currently this is the only way we perform service discovery.

This task is for tracking clarification work against the usage of ServiceNames in the APIs and documentation.

Resource-based authorization in IAM

App Mesh will enable authorization at the resource level, including resource prefixes. This will allow customers to create IAM policies and roles for specific resources or groups of resources in App Mesh. These roles can be assumed by multiple accounts, in order to enable multiple accounts to operate in the same mesh, with well-defined resource-level authorizations for each roles.

App Ports should not be required

Describe the bug
Currently aws-appmesh-proxy-route-manager assumes that it is setting up iptables rules for a service and requires the environment variable APPMESH_APP_PORTS to be set. However, there is a use-case for using envoy as a side-car to an application that is client only and has no open ingress ports. In this case, the user would have to specify a fake port in order to launch the pod.

Platform
EKS, ECS

To Reproduce
Steps to reproduce the behavior:

Run aws-appmesh-proxy-route-manager without APPMESH_APP_PORTS set.

Expected behavior
APPMESH_APP_PORTS should be able to be unset, perhaps with a warning message printed.

Implement Health checks

GRPC routing

AWS Cloud Map selectors

Enable a virtual node definition to include additional attributes beyond the service name when configuring the service registration details. This will enable routing to different ECS/Fargate task sets or k8s deployments under the same service name.

Setup iptables via CNI plugin

Create a CNI plugin that can be used to route network traffic instead of using containerized proxymanager script that required extra privlages.

Support AWS X-Ray Tracing

App Mesh will allow customers to enable X-Ray tracing on a per-mesh basis. Once enabled, customers can view X-Ray segment data and configure sampling through the AWS X-Ray Console, API, or CLI. If X-Ray Tracing is enabled, App Mesh will emit X-Ray tracing segments to the X-Ray agent, running either as a sidecar in the task/pod or running elsewhere in the customer's account.

Access logging

Allow customers to enable HTTP/TCP access logging for a Virtual Node. Access logs will be written to a deterministic location in the Envoy container. This location can be shared with a log ingestion sidecar such as fluentd, or (for ECS and EKS) shared with the host and ingested by an agent running on the host (e.g. CloudWatch agent).

TCP routing

Egress Configuration for Sidecars

I have seen #76 and related issues discussing egress configuration but it's still unclear to me how to properly setup egress to something completely outside of my cluster on the internet. My scenario is running a sidecar container next to my application container and the Envoy/App Mesh container that collects metrics and pushes them to an external service. Obviously this doesn't work by default because there's no backend defined to let that traffic leave the cluster. Is this a scenario that is covered in docs or an issue that I missed?

aws / aws-app-mesh-roadmap Goto Github PK

aws-app-mesh-roadmap's People

Contributors

Stargazers

Watchers

Forkers

aws-app-mesh-roadmap's Issues

Recommend Projects

Recommend Topics

Recommend Org