Giter Site home page Giter Site logo

aws-app-mesh-roadmap's People

Contributors

abby-fuller avatar bcelenza avatar bigdefect avatar cgchinmay avatar dastbe avatar hyandell avatar jayntiraj avatar karim-z avatar rajal-amzn avatar rishijatia avatar sshver avatar y0username avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-app-mesh-roadmap's Issues

Simplify external service egress traffic setup

As discussed in #74, the current way to model an external service that a service within the mesh can route to is by modeling the external service as a VirtualNode. For example, if you had two services named Service-A and Service-B, and Service-B was an external service (e.g. gitlab) hosted at the DNS name gitlab.my-intranet.com. If you wanted the VirtualNode representing Service-A to be able to egress traffic to Service-B, you would model your mesh configuration as:

Service-A:

{
    "meshName": "foo",
    "virtualNodeName": "service-a",
    "spec": {
        "listeners": [
            {
                "portMapping": {
                    "port": 8080,
                    "protocol": "http"
                }
            }
        ],
        "serviceDiscovery": {
            "dns": {
                "serviceName": "service-a.foo-mesh.local"
            }
        },
        "backends": [
            "gitlab.my-intranet.com"
        ]
    }
}

Service-B:

{
    "meshName": "foo",
    "virtualNodeName": "service-b",
    "spec": {
        "listeners": [
            {
                "portMapping": {
                    "port": 80,
                    "protocol": "tcp"
                },
                "healthCheck": {
                    "protocol": "tcp",
                    "healthyThreshold": 2,
                    "unhealthyThreshold": 2,
                    "timeoutMillis": 2000,
                    "intervalMillis": 5000
                }
            }
        ],
        "serviceDiscovery": {
            "dns": {
                "serviceName": "gitlab.my-intranet.com"
            }
        }
    }
}

The VirtualNode model contains many specifications which would not normally apply to an external service not within the control of the mesh (such as backends), while others still do (such as health checks).

This issue is to track the investigation of a general simplification of modeling external entities within the mesh.

Reusing a VirtualNode name results in stale Envoy configuration

Summary

When deleting and re-creating a VirtualNode with the same name (e.g. my-virtual-node under a Mesh named my-mesh), an Envoy identified as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node) which connects shortly after re-creating the VirtualNode may receive the previous VirtualNode's configuration. This may commonly occur if the Envoy is connected in less than 10 seconds from the time the VirtualNode was re-created.

This issue is closely related to #49.

We are working on a solution for this bug that will occasionally check for this state and request that the Envoy reconnect to receive updated configuration.

Steps to reproduce

  1. Create a VirtualNode
  2. Connect an Envoy identified as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node) and ensure it's working properly
  3. Disconnect the Envoy
  4. Delete the VirtualNode
  5. Create a new VirtualNode using the same name and and connect an Envoy within a few seconds after creation (or at the same time)

Expected behavior: The Envoy receives the configuration for the new VirtualNode
Actual behavior: The Envoy receives the configuration for the previous VirtualNode by the same name

Work-arounds

  1. When creating the new VirtualNode, create it using a different name which has never been used.
  2. When identifying the Envoy, set APPMESH_VIRTUAL_NODE_NAME to the VirtualNode's UID as returned by the CreateVirtualNode API, instead of the ARN or truncated resource name.

Region expansion

Expand to all AWS regions.

Current region list

  • US East (Ohio) - us-east-2
  • US East (N. Virginia) - us-east-1
  • US West (N. California) - us-west-1
  • US West (Oregon) - us-west-2
  • Asia Pacific (Hong Kong) - ap-east-1
  • Asia Pacific (Mumbai) - ap-south-1
  • Asia Pacific (Osaka) - ap-northeast-3
  • Asia Pacific (Seoul) - ap-northeast-2
  • Asia Pacific (Singapore) - ap-southeast-1
  • Asia Pacific (Sydney) - ap-southeast-2
  • Asia Pacific (Jakarta) - ap-southeast-3
  • Asia Pacific (Tokyo) - ap-northeast-1
  • Canada (Central) - ca-central-1
  • South America (São Paulo) - sa-east-1
  • China (Beijing) - cn-north-1
  • China (Ningxia) - cn-northwest-1
  • EU (Frankfurt) - eu-central-1
  • EU (Stockholm) - eu-north-1
  • EU (Ireland) - eu-west-1
  • EU (London) - eu-west-2
  • EU (Paris) - eu-west-3
  • EU (Milan) - eu-south-1
  • Middle East (Bahrain) - me-south-1
  • Israel (Tel Aviv) - il-central-1
  • Africa (Cape Town)
  • AWS GovCloud (US-East) - us-gov-east-1
  • AWS GovCloud (US) - us-gov-west-1

Integration with EKS

The integration will happen primarily with a controller running in the customer's cluster on the master instances, managed by EKS. The controller will watch the Kubernetes API of the customer's cluster and react to certain objects being created or modified. It will create the necessary components in AppMesh and CloudMap.

Initial support will be for a single AppMesh mesh and a single CloudMap namespace per cluster (though many clusters can share a mesh/namespace). Customers can provide an existing mesh/namespace as well.

An additional component that will be used on the customer's worker nodes is the App Mesh CNI(https://github.com/awslabs/aws-app-mesh-examples/issues/15). Its responsibility is to enter the network namespace of a new pod and set up iptables rules to route incoming and outgoing traffic through envoy. This takes the place of an init container, and is preferred to avoid having to run privileged containers altogether.

Optionally, a mutating admission webhook could be employed to inject envoy as a sidecar container into pods that are launched in the cluster.

Accessing host instance

Hello, I was wondering what would be the correct way to access the host on which a service is running. It would be cumbersome to model each host as a TCP virtual node and virtual service and add them all as backends to each service to account for dynamic task placement.

My initial thought is to add them to the APPMESH_EGRESS_IGNORED_IP since that's how the metadata services are reached, but I wanted to know if there was a better approach.

Thanks!

Retry Policy

A Retry Policy in App Mesh enables clients to protect themselves from intermittent network failures, or intermittent server-side failures. A Retry Policy is an immutable entity in App Mesh that allows users to specify the conditions under which a retry is attempted, including HTTP status codes that will trigger a retry. A Retry Policy also has parameters specifying how many times to retry, and the timeout to use per retry.

Once a Retry Policy is created, it can be attached to one or more Virtual Nodes as part of the backends. Each backend in a Virtual Node can have its own retry policy.

Envoy configuration valid for 7 days after VirtualNode is deleted

Summary

When deleting a VirtualNode in a Mesh, the resulting Envoy configuration for that VirtualNode will remain available to an Envoy which identifies itself as that VirtualNode name (e.g. mesh/my-mesh/virtualNode/my-virtual-node). Envoys which are connected to the Envoy Management Service endpoint identified as that VirtualNode will remain connected and may receive improper configuration.

Note: Other Envoys identified as separate VirtualNodes, who may have previously relied on the deleted VirtualNode as part of a backend definition, will be updated with the correct configuration.

The period that this configuration is available after deleting a VirtualNode is approximately 7 days. We are working to reduce this time.

Steps to reproduce
Scenario 1: A connected Envoy remains connected after deletion of the VirtualNode

  1. Create a VirtualNode
  2. Connect an Envoy which is identified as that VirtualNode
  3. Delete the VirtualNode

Expected behavior: The Envoy no longer receives configuration updates and is disconnected from the Envoy Management Service endpoint.
Actual behavior: The Envoy remains connected and may receive improper configuration.

Scenario 2: An Envoy connects after deletion of the VirtualNode_

  1. Create a VirtualNode
  2. Connect an Envoy which is identified as that VirtualNode and ensure it's working properly
  3. Disconnect the Envoy
  4. Delete the VirtualNode
  5. Reconnect the Envoy

Expected behavior: The Envoy is not allowed to connect to the Envoy Management Service, and receives an appropriate error code (e.g. NOT_FOUND)
Actual behavior: The Envoy remains connected and may receive improper configuration.

Work-around

Make sure your Envoys are disconnected, and the associated ECS tasks, EKS pods, or applications running on EC2 are not serving traffic, then delete the VirtualNode.

Hosted EDS implementation with AWS Cloud Map

Details: AWS Cloud Map to act as cross-service service registry for service endpoints and metadata. ECS already integrates with Cloud Map and we plan to build EKS connector to Cloud Map.

CloudFormation

Add App Mesh to CloudFormation so that customers can easily automate App Mesh setup.

Updates to routes have no effect on running envoys

What happened?

Updates to routes have no effect on running envoys

What you expected to happen?

Tasks must be restarted for route changes to take effect.

How to reproduce it (as minimally and precisely as possible)?

Not easy to reproduce as it happens sporadically. Seems to be an issue with XDS protocol

  1. Create two virtual-nodes with a route between them (A→B).
  2. Verify the route is working by making requests to A.
  3. Create a new version for B (say B2).
  4. Update the route by changing weights.
  5. Check if A is now routing some traffic to B2 based on weights.

Observed that it does not, though you may need to try to change the routes a few times and leave A running for an extended period of time (>30mins).

Issue gets resolved if I restart A as it gets new configuration.

Traffic Mirroring (Shadowing)

Tell us about your request
This feature request is for implementing traffic mirroring (also referred to as shadowing). Traffic mirroring allows one service to send the same traffic to more than one upstream service while still only using a single upstream service for the authoritative response. Other services which are receiving mirrored traffic can be tested for bugs and performance regressions prior to serving real traffic and becoming the authoritative upstream.

Which integration(s) is this request for?
All

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
When working with microservices, developers and infrastructure engineers often need to test their new versions against real traffic before shifting all live traffic over to the new version. This increases confidence in code changes and allows teams to find bugs during periods of change.

Are you currently working around this issue?
App Mesh does not currently supporting traffic mirroring, so teams may work around the issue by replaying old traffic patterns from previous logs collected.

Additional context
Envoy Proxy supports traffic mirroring on routes.

Tag Based Resources

Implement tagging of App Mesh resources so that our customers can have a consistent management and authorization experience.

Emit DogStatsD-compatible metrics

App Mesh will allow customers to enable DogStatsD metrics to a local sidecar agent or remote agent. The agent must be capable of ingesting DogStatsD metrics; examples include the CloudWatch agent and the DataDog agent. The metrics will be tagged/dimensioned to allow for aggregating metrics across multiple endpoints as desired.

Custom Envoy

Will it be possible to deploy to App Mesh a custom built Envoy binary that is compatible with App Mesh of course e.g. with SigV4 etc? Custom filters for Envoy require a custom Envoy.

[BUG] Version numbers in Envoy resource names cause connection failures, changing metric names

Describe the bug
Resources vended via the Envoy Management Service contain unique version numbers that may change when new configuration is generated. This may cause TCP connections proxied by Envoy to fail, or HTTP connections to be prematurely drained due to resource replacement. It also causes some of Envoy's generated metrics to contain the version numbers, which makes it difficult to track a given statistic through Envoy configuration changes.

Platform
All

To Reproduce
Steps to reproduce the TCP connection failure behavior:

  1. Create a service mesh with a gateway VirtualNode and TCP and HTTP backend VirtualNodes
  2. Setup your gateway source code to call the TCP backend VirtualNode on a new request
  3. Make a request to your gateway VirtualNode and note that the call to the TCP backend fails

Steps to reproduce the metrics behavior:

  1. Create any service mesh on App Mesh (e.g. the colorapp example provided in this repository)
  2. Note that some metrics generated by Envoy are appended with unique version numbers
  3. Make any change to a VirtualNode or Route and note that some metric names will change

Expected behavior

  1. The gateway VirtualNode with a proxied TCP connection successfully establishes a connection.
  2. Metrics generated by Envoy do not contain unique version numbers that change when new configuration is vended to the Envoy Proxy.

Additional Context
Here are some examples of statistics generated by Envoy which have unique version numbers in them:

$ curl http://colorgateway.appmesh-example.local:9901/stats
...
cluster.cds|egress|AppMeshExample|colorgateway-vn|colorteller-black-vn|http|9080|22459664.external.upstream_rq_completed: 1
...
http.ingress.AppMeshExample.colorgateway-vn.rds.rds|ingress|AppMeshExample|colorgateway-vn|http|9080|31467114.config_reload: 1

Updating mesh egress filter does not update running Envoys

Summary

When updating the value of the new mesh egress filter, any Envoy which is currently connected to the App Mesh ADS endpoint will not immediately receive the updated setting. The Envoy will receive the updated configuration after a maximum period of 30 minutes, or after the Envoy disconnects and reconnects to the ADS endpoint.

Steps to reproduce

  1. Set the mesh egress filter type to ALLOW_ALL to allow all external egress traffic from VirtualNodes.
  2. Connect an Envoy identified by a VirtualNode for that Mesh.
  3. Update the mesh egress filter type to DROP_ALL to disallow external egress traffic from VirtualNodes.

Expected behavior: The Envoy should receive the updated setting within a matter of seconds.
Actual behavior: The Envoy does not receive the updated setting until it disconnects and reconnects to the App Mesh ADS endpoint (which happens automatically every 30 minutes).

Workaround: You can force the Envoy to update its configuration by restarting it (via the ECS task, EKS pod, or similar). Note that this will need to handled carefully for Envoys serving production traffic (i.e. issue a rolling restart).

A fix has been proposed and will be rolled out over the coming days to address this issue.

Service call auditing

Send our customers API call data to CloudTrail so customers can reliably audit their account activity.

Clarify usage of ServiceNames

Per discussions in #49 and #71, the usage of ServiceNames in the mesh are not abundantly clear in the current APIs and documentation.

The current use of ServiceNames as described by @ivitjuk:

  1. Linking together client and server VirtualNodes. This is done by client VirtualNodes including the ServiceNames they want to speak to in the spec.backends section of their config. Please note that client VirtualNodes do not reference server VirtualNodes directly by their name. They do it via ServiceNames. The way we link ServiceName entries from the client's backends section to the server VirtualNodes is by first finding a VirtualRouter that supports the specific ServiceName. We do that by looking at the spec.serviceNames of a VirtualRouter. Once we find the VirtualRouter we can also find Routes associated with it by looking at the virtualRouterName field of a Route. From the Route we can find server VirtualNodes in the spec.httpRoute.action.weightedTargets.virtualNode field.
  2. Virtual host and route matching in the Envoy config. Once we have linked client VirtualNode with it's server VirtualNodes we can configure Envoy on the client side to be able to speak with the servers. This is done by injecting clusters in the client's Envoy config and adjusting routes to point to those clusters. At the Envoy route_config level, we use ServiceName in the virtual_hosts.domains field, so that routes get applied to the correct ServiceNames.
  3. Endpoint discovery. This is how we find out endpoints of a server VirtualNode. We use server's VirtualNode ServiceName to perform DNS discovery. In the future there will be more options aside of DNS, but currently this is the only way we perform service discovery.

This task is for tracking clarification work against the usage of ServiceNames in the APIs and documentation.

Resource-based authorization in IAM

App Mesh will enable authorization at the resource level, including resource prefixes. This will allow customers to create IAM policies and roles for specific resources or groups of resources in App Mesh. These roles can be assumed by multiple accounts, in order to enable multiple accounts to operate in the same mesh, with well-defined resource-level authorizations for each roles.

App Ports should not be required

Describe the bug
Currently aws-appmesh-proxy-route-manager assumes that it is setting up iptables rules for a service and requires the environment variable APPMESH_APP_PORTS to be set. However, there is a use-case for using envoy as a side-car to an application that is client only and has no open ingress ports. In this case, the user would have to specify a fake port in order to launch the pod.

Platform
EKS, ECS

To Reproduce
Steps to reproduce the behavior:

  1. Run aws-appmesh-proxy-route-manager without APPMESH_APP_PORTS set.

Expected behavior
APPMESH_APP_PORTS should be able to be unset, perhaps with a warning message printed.

AWS Cloud Map selectors

Enable a virtual node definition to include additional attributes beyond the service name when configuring the service registration details. This will enable routing to different ECS/Fargate task sets or k8s deployments under the same service name.

Setup iptables via CNI plugin

Create a CNI plugin that can be used to route network traffic instead of using containerized proxymanager script that required extra privlages.

Support AWS X-Ray Tracing

App Mesh will allow customers to enable X-Ray tracing on a per-mesh basis. Once enabled, customers can view X-Ray segment data and configure sampling through the AWS X-Ray Console, API, or CLI. If X-Ray Tracing is enabled, App Mesh will emit X-Ray tracing segments to the X-Ray agent, running either as a sidecar in the task/pod or running elsewhere in the customer's account.

Access logging

Allow customers to enable HTTP/TCP access logging for a Virtual Node. Access logs will be written to a deterministic location in the Envoy container. This location can be shared with a log ingestion sidecar such as fluentd, or (for ECS and EKS) shared with the host and ingested by an agent running on the host (e.g. CloudWatch agent).

Egress Configuration for Sidecars

I have seen #76 and related issues discussing egress configuration but it's still unclear to me how to properly setup egress to something completely outside of my cluster on the internet. My scenario is running a sidecar container next to my application container and the Envoy/App Mesh container that collects metrics and pushes them to an external service. Obviously this doesn't work by default because there's no backend defined to let that traffic leave the cluster. Is this a scenario that is covered in docs or an issue that I missed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.