expediagroup / styx Goto Github PK

Programmable, asynchronous, event-based reverse proxy for JVM.

License: Apache License 2.0

Makefile 0.12% Java 70.48% Groovy 0.01% Scala 8.95% JavaScript 0.60% HTML 0.22% CSS 0.18% ANTLR 0.03% Shell 0.32% Python 0.41% Lua 0.10% Dockerfile 0.03% Kotlin 18.55%

styx asynchronous java jvm netty http proxy nio load-balancer oss-portal-listed

styx's Introduction

Styx is a programmable reverse proxy for JVM (Java Virtual Machine). It can be used as a stand-alone application, or as a platform to build custom networking applications. It is non-blocking, fully asynchronous and event driven therefore very scalable.

Features

Styx Proxies HTTP requests to a configurable set of Backend Services typically a cluster of web servers (multiple origins) or load balancers (e.g. AWS ELB).

Requests are subjected to an interceptor chain (conceptualy a HTTP filter chain), which can respond or modify and pass through the request to the backend services. The interceptor chain can be easily extended by Plugins written in Java.

Styx additional capabilities:

Load balancing
Origin health checking
Retry mechanisms
Connection pooling
Admin dashboard (inspired and based on Hystrix Dashboard)
Performance and system metrics collection and reporting (Graphite compatible)

Most additional features can be extended via the Java Service Provider Interface (SPI) model.

The plugin and the SPI (Service Provider Interface) model enables developers to build custom HTTP applications easily on top of Styx, which takes care of common proxy server related functionality. Developers can then concentrate on the value-add business logic.

Use cases of Styx Plugins

Hotels.com - Built authentication, UI rendering, URL redirection and cookie cleaning plugins in front of backend services.
Expedia - Built routing and bot detection capabilities on top of Styx. More info here.
Homeaway - Built various plugins.

Useful Information

Quick Start

A quick-start guide can be found on our wiki.

User Manual

User guide explains how to run and operate Styx as a standalone application.

Developer Resources

Our Developer guide explains how to build applications on top of Styx. It also explains how to build and run Styx.

If you want to help to contribute to Styx project, please check Contributor guide to find out how to start.

Got a Question?

Dependencies

Oracle JDK 21
Apache Maven v3
Makefile (Optional)- There are fairly many ways of running Styx tests with different Maven profiles. Therefore, the shortcuts for most common usages are compiled into a separate (GNU) Makefile for developer's convenience. To take advantage of these shortcuts, a GNU Make build tool must be installed.

Notice

Not the Styx project that you were expecting to find?

Other open source projects called Styx on GitHub: Github Styx Projects

License

This project is licensed under the Apache License v2.0 - see the LICENSE.txt file for details.

Licencing terms for any derived work and dependant libraries are documented in NOTICE files.

styx's People

Contributors

Stargazers

Watchers

Forkers

iamchrisrice mikkokar alobodzki ggaeta1 caiodangelo kainee taer rickfast jinfei21 nersesam dios-david to-da xujin-incubator chakra-coder claudiocorridore noel-smith gezerk tlivings dvlato nitingupta183 hdisppod vivianlopes cole2295 eglizzy fantayeneh enterstudio vandinhnguyen22 keeeeeep dyb10101 mfrank2016 shafikhan zjpjohn xiuwyang salahqasem sd6292000 nikos912000 xydoublez herbertbhs02 mtyrrell yashvyas guptaabhinav206 pologood yuzhongzhu owenlindsell 1621740748 anuragbd girishpatel davepkennedy antimoroma manishsahni2000 chrisgresty tamast leharb vvishwakarma-hc wbautistajr thaeralajlouni ansyed97 oleksche autoappcreator tombart stillywud doytsujin olindsell misselvexu yejq ryanellison ahsansyed97 islombektoshev hejinxu kenluluuuluuuuu kvosper battaglia-v alseddnm rossilor95 bulksecuritygeneratorprojectv2 juliodeleon uttamtirunagari sullis richax

styx's Issues

HTTP messages should use CRLF as the line separator

The problem

HTTP messages should use CRLF as the end of line marker.

Detailed description

According to the HTTP spec, the CRLF sequence must be used as the line separator in all parts of a HTTP message except for the entity body. The entity body for "text types" is allowed to have CRLF, LF or CR, but the canonical form also has CRLF as a separator and it's the recommended one.
The safest and most cohesive approach for HTTP messages generated in STYX seems to be codifying the new line marker as CRLF in all the cases.

Relevant parts of the spec:

HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
protocol elements except the entity-body (see appendix 19.3 for
tolerant applications). The end-of-line marker within an entity-body
is defined by its associated media type, as described in section 3.7.

from RFC2616 https://tools.ietf.org/html/rfc2616#section-2.2

MIME's canonical form requires that media subtypes of the "text" type
use CRLF as the text line break. HTTP allows the transfer of text
media with plain CR or LF alone representing a line break, when such
line breaks are consistent for an entire representation. An HTTP
sender MAY generate, and a recipient MUST be able to parse, line
breaks in text media that consist of CRLF, bare CR, or bare LF. In
addition, text media in HTTP is not limited to charsets that use
octets 13 and 10 for CR and LF, respectively. This flexibility
regarding line breaks applies only to text within a representation
that has been assigned a "text" media type; it does not apply to
"multipart" types or HTTP elements outside the payload body (e.g.,
header fields).

from RFC 7231 https://tools.ietf.org/html/rfc7231#section-3.1.1.3

Improve FlowControllingHttpContentProducer state machine

The problem

Confusing warning messages from the FlowControllingHttpContentProducer when origin server closes the TCP connection:

    WARN  com.hotels.styx.client.netty.connectionpool.FlowControllingHttpContentProducer 2018-03-03 02:02:02,000 [Styx-Client-Worker-14-Thread] 
    - message="Inappropriate event=ChannelIdleEvent", prefix=myserver.com/1.1.5.1:22222 -> /2.1.1.2:55555, Request(method=POST, url=/some/page.html, i0=00000000-0000-0000-0000-000000000000), state=TERMINATED, receivedChunks=0, receivedBytes=0, emittedChunks=0, emittedBytes=0

This suggests the TCP connection is closed due to idleness. But in fact the warning also pops up when the channel has been subjected to a "delayed closure". This happens when the origin server closes the TCP connection to indicate response content length (See RFC 7230, Chapter 3.3.3. Message Body Length, item 7.) Tomcat apps have been seen to behave this way when they respond with HTTP error status code (4xx or 5xx).

A clarification regarding "delayed closure". It is the Flow Controller state machine associated to the TCP connection that is subjected to the delayed closure. The TCP connection itself is closed immediately. The Flow Controller state machine regulates the delivery of the HTTP message body to the rest of the styx honouring the back pressure and Rx.Java observable protocol. The purpose of the delayed closure is to allow styx server some time to consume the response in order to proxy it back.

Detailed description

Back in the days when we realised it was necessary to implement the delayed closure as described above, we hijacked the FlowController's "idle state event" signal to inform the state machine to tear down any resources. This is in NettyToStyxResponsePropagator.java:

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
        ensureContentProducerIsCreated(ctx);

        ctx.channel().eventLoop().schedule(
                () -> contentProducer.ifPresent(producer -> producer.idleStateEvent(
                        new ResponseTimeoutException(
                                origin,
                                "channelInactive",
                                producer.receivedBytes(),
                                producer.receivedChunks(),
                                producer.emittedBytes(),
                                producer.emittedChunks()))),
                idleTimeoutMillis,
                MILLISECONDS);

        TransportLostException cause = new TransportLostException(ctx.channel().remoteAddress(), origin);
        contentProducer.ifPresent(producer -> producer.channelInactive(cause));
    }

The misleading error message is caused by this re-use of the idle state event, and the time has come to address this issue.

Acceptance criteria

Introduce a new delayed teardown (or something similar) event to replace the usage of idle state event to tear down the Flow Controller state machine.
The handling of delayed teardown in the Flow Controller state machine:
- The 'delayed teardown' must be handled in BUFFERING_COMPLETED and EMITTING_BUFFERED_CONTENT states.
- In BUFFERING_COMPLETED state: release all buffered content and transition to TERMINATED state.
- In EMITTING_BUFFERED_CONTENT state, release all buffered content, emit error on content observable, and transition to TERMINATED state.
- In COMPLTED state, swallow the event to prevent an unnecessary warning message.
- Leave it unhandled in all other states, so that an informational warning message is logged.
The 'idle state event' is only used to inform the state machine about prolonged inactivity on the TCP connection.

Allow connection pool to gradually drain down

The problem

When an origin is removed from configuration, the associated connection pool is closed. The closure is rather abrupt as it closes each borrowed connection straight away:

    @Override
    public void close() {
        removeEachAndProcess(waitingSubscribers, subscriber ->
                subscriber.onError(new RuntimeException("Connection pool closed")));
        removeEachAndProcess(availableConnections, Connection::close);
        borrowedConnections.forEach(Connection::close);
    }

Desired behaviour

Ideally, the connection pool should be drained down without affecting the ongoing connections.

When closed, the connection pool should go into a draining state where it stops accepting new connections, and waits until all borrowed connections are returned back to the pool. After which it then finally considered fully closed.

That is, the signature of the close method should probably change:

    public Future<Void> close() { ... }

Detailed description

See above.

Acceptance criteria

Send N requests to an origin. Remove origin from connection. The N requests should successfully complete.
The origin is removed from the load balancer as soon as the connection pool closure is triggered. That is, it should not get any traffic while it is in the draining state.

Mutual TLS Authentication support

The problem

We have clients that require mutual authentication https://en.wikipedia.org/wiki/Mutual_authentication while connecting over internet.

Detailed description

Styx does not support that feature at that time.

Acceptance criteria

We should be able to upload client certs / root + intermediate certs on the Styx server and should be able to force client certificate verification.
Ideally verification could be either required = client has to provide valid cert otherwise 400 http code will be returned back. Or optional = ask client for a cert and let the request to proceed if there is no cert provided. It would be an analogue to ON and OPTIONAL options http://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_verify_client
Be able to include client cert details on the request to a downstream service (http headers for example)

[1.0] Ensure that reference counting works well with the new API.

The problem

Styx uses Netty reference counted ByteBuf for content streams. Netty reference counts are exposed to Styx consumers.

Styx should protect its consumers from accidentally leaking byte buffers.

The dangerous areas are

Removing a body content stream
Transforming the content stream

Detailed Description

Consider this code snippet to transform an UTF8 HTTP content stream to UTF-16. It leaks both buffers buf1A and buf1B. The buffers leak because the content transformation function, implemented as lambda, never releases them.

    ByteBuf buf1A = copiedBuffer("buf-1a", UTF_8);
    ByteBuf buf1B = copiedBuffer("buf-1b", UTF_8);
    
    // The HTTP request body as UTF 8 encoded byte stream:
    StyxObservable<ByteBuf> utf8Body = StyxObservable.from(ImmutableList.of(buf1A, buf1B));
    
    HttpRequest req1 = HttpRequest.put("/")
            .body(utf8Body)
            .build();

    // The UTF 8 response body transformed to UTF-16:
    StyxObservable<ByteBuf> utf16Body = req1.body().map(buf -> copiedBuffer(buf.toString(UTF_8), UTF_16));
    
    FullHttpRequest req2 = req1.newBuilder()
            .body(utf16Body)
            .build()
            .toFullRequest(100)
            .asCompletableFuture()
            .get();
    
    System.out.println("result: " + Arrays.toString(r2.body()));
    System.out.println("buf1 refcount: " + buf1A.refCnt());
    System.out.println("buf1 refcount: " + buf1B.refCnt());

Acceptance criteria

Ideally, the Netty ByteBuf is wrapped with Styx facade to avoid leaking 3rd party dependencies in the API.
Styx to take care of, or provide some helper methods, to automatically release the reference counts.

Implement per host HTTP client

The problem

StyxHttpClient is a HTTP client for communicating to backend service apps. It load balances, retries, in addition to many other things. Some complexity has crept in over the time. To reduce this complexity it makes sense to refactor the act of sending a HTTP request out into a separate "per-host" HTTP client.

As an added benefit, having a separate "per-host" HTTP client allows us to open it up in the Styx routing object model. That is, the routing configuration would be able to route to individual hosts, instead of just applications (which are clusters of individual hosts). This would be a separate increment though.

Detailed description

Implement a new HostHttpClient with following functionality:

Connection pooling
Release content buffers on error
Run the HttpRequestOperation, which is to
- Add/remove relevant Netty handlers
- Write HTTP request to origin
- Request/response logging
- Finalise response processing - return borrowed connection back to pool, or close the connection in face of failure.

This is, the HostHttpClient will encapsulate Transport and HttpRequestOperation functionality.

The Styx HTTP Client, or the backend service client will then be responsible of:

Apply rewrites
Apply origins restriction feature
Sticky session handling
Run the load balancer
Retry handling
Metrics recording
Adds response ID

Acceptance criteria

HostHttpClient implements the HttpClient interface
sendRequest returns an observable of HttpResponse
A subscription to the returned observable triggers the request to be sent.
Unsubscribing from the observable triggers request cancellation.
When request is cancelled, the underlying TCP connection is terminated.
When response (and its body) is fully consumed, the client returns the underlying TCP connection back to the connection pool.
if errors occur, the underlying TCP connection is returned back to pool.
Client supports optional *flow-control to prevent hogging up memory.

Licence header in incorrect format

The problem

When using the code formatter in IntelliJ IDEA (and possibly other IDEs), undesirable modifications are made to the licence header.

Detailed description

The licence header in each source file is written as a Javadoc comment. As a result, when code-formatting IDE features are used, they often modify the header, e.g. adding HTML <p> tags. This in turn prevents the licence header plugin from recognising the existing header and causes it to add a new one on top.

The maven plugin used is com.mycila.license-maven-plugin which can be found in the root pom.xml file. Other relevant files are src/license/header-definitions.xml and src/license/templates/APACHE-2.txt.

Acceptance criteria

Licence header should not be modified by IntelliJ IDEA code formatter.

Move unique ID supplier class from styx-api into styx-server module

The problem

Move the UniqueIdSuppliers and FixedUniqueIdSupplierTest classes away from the api module into styx-server module.

These two classes are unnecessary in the API module and as such we don't want to support such feature for our API consumers.

Acceptance criteria

UniqueIdSuppliers no longer appears in the styx API.

Confusing message when placeholder tries to use value containing $

The problem

When PlaceholderResolver.replacePlaceholder is called with a placeholderValue that contains a $ (i.e. during configuration loading), it results in a confusing error message.

Detailed description

When PlaceholderResolver.replacePlaceholder is called with a placeholderValue that contains a $, the String.replaceAll method interprets it as referring to a matching group.

This results in a confusing error message:

for a non-numeric value after $ we get java.lang.IllegalArgumentException: Illegal group reference
for a numeric value after $ we get something akin to java.lang.IndexOutOfBoundsException: No group 1

Our options are to either escape the symbol or continue to fail, but with a better error message:

Pro of escaping: value is allowed through to Styx classes, which can decide whether to allow it or not on a case-by-case basis
Con of escaping: when it is injected into a value that should be a non-string (e.g.), Jackson will provide the error, which may be out-of-context
Con of escaping: multiple points of failure

Acceptance criteria

Determine whether values containing $ should be accepted
If we accept them, escape the symbol so that it does not cause an exception
If we do not, throw an exception with a message containing the placeholder and replacement

Config validator: Support for MAP types

The problem

Config validator is unable to validate plugins.all or services.factories attributes. This is because those attributes conceptually mappings of strings (or names) to values, but schema DSL only supports rigid record types (object) whose attribute names are all known a priori. In case of map types, the keys are not known a priori.

For this reason plugins.all and services.factories are declared as opaque types and are not validated.

Acceptance criteria

The plugins.all attribute is validated.
The services.factories attribute is validated.
The schema DSL shall have a new map type for named mappings.

An example DSL could look like:

schema(
    field("factories", mapOf(object(
            field("class", string()),
            field("config", object(opaque()))
        )))
)

Note that the "key" type is omitted because for styx configurations this is always String (for now).

Sensible defaults for Styx backend TLS settings.

The problem

Enabling TLS for backend services doesn't work out of the box. Styx insist that trust store is specified (tlsSetings.trustStorePath), even if it would not be used. This is quite burdensome as an empty trust store must be created just for the sake of establishing a TLS connection to backends.

Detailed description

Styx should be more in line with default Java behaviour as described in JSSE Reference Guide:

When trustStorePath is absent, Styx should use a trust store specified by javax.net.ssl.trustStore system property.
When trustStorePath is absent, and javax.net.ssl.trustStore is not set, then use the default system truststore. This would be, for JDK provider, $JAVA_HOME/lib/security/jssecacerts, or $JAVA_HOME/lib/security/cacerts.

Acceptance criteria

Styx will start up with the following tlsSettings configuration, and either use the truststore configured via javax.net.ssl.trustStore system property, or alternatively revert to default keystore.

    - id: "tls11"
      path: "/v11"
      tlsSettings:
        trustAllCerts:       true
        sslProvider:         OPENSSL     # Also supports JDK
      origins:
      - { id: "app1", host: "mybackend:443" }

Startup script does not fail if the order of the arguments is incorrect

The problem

When running the startup script with several arguments (for instance, adding -e) the passed in config filename is ignored silently if it is not the last argument.

Detailed description

The startup script expects the config filename to be the last argument for the command, but if it is not, instead of failing it will just ignore that config filename completely.
Example:

bin/startup -e conf/styx-env.sh conf2/environment/default.yaml starts the server correctly using the chosen config files.
bin/startup conf2/environment/default.yaml -e conf/styx-env.sh starts the server but it ignores the specified config filename and uses the default one (conf/default.yml

The script should be fixed so it will return an error when the order of the arguments is incorrect.

Acceptance criteria

When the config filename is not the last argument for the command, an error plus the usage instructions will be returned:
bin/startup conf2/environment/default.yaml -e conf/styx-env.sh will return an error immediately.

Investigate origins reload when backend services path changes.

Background

Styx doesn't reload backend service changes correctly when the change is limited to the path prefix attribute only.

This issue can be demonstrated by modifying the following e2e test case:

...2e-suite/src/test/scala/com/hotels/styx/admin/FileBasedOriginsFileChangeMonitorSpec.scala

Modify this test so that only the only change to the backend service is the path component. I would expect the test case to pass, but it fails.

Acceptance Criteria

I understand this issue is brief on the details. However raised it regardless for now with whatever information is available before it gets forgotten.

Investigate why path attribute change doesn't behave as expected.
Feel free to trash this issue if it turns out to be desired behaviour.

Unable to turn off health checks

The problem

Removing the healthCheck block from the Backend Service configuration (in the origins file) does not disable the health checks.

Detailed description

To reproduce:

Configure backend service as follows:

- id: "app"
  path: "/"
  connectionPool:
    maxConnectionsPerHost: 45
    maxPendingConnectionsPerHost: 15
    socketTimeoutMillis: 120000
    connectTimeoutMillis: 1000
    pendingConnectionTimeoutMillis: 8000
  responseTimeoutMillis: 60000
  origins:
  - { id: "app1", host: "localhost:9090" }
  - { id: "app2", host: "localhost:9091" }

Ensure nothing is listening on ports 9090 or 9091.
Start styx, and observe how health checker will mark the origins as inactive. But of course it shouldn't because health checking had been disabled.

Acceptance criteria

This was a regression, suggesting we were missing an e2e test case for this. Ensure there is an e2e test for this.
When healthCheck block is absent, styx doesn't perform health checks, and unreachable origins will show up as active.

Styx Logging Documentation

Opening this based on advice from @mikkokar

It would be nice if the logging documentation (Styx+ + Logging page, any other relevant documentation, for that matter) was migrated from the old Hotels.com internal Styx Wiki to GitHub. I have found it quite valuable, and have needed to reference it more than once.

Choose supported TLS protocol versions for styx client

Summary

As an infrastructure architect, I need to enforce all connections originating from Styx to use TLS 1.2 only, so that:

no component connected to styx accidentally reverts down to earlier insecure versions.
I can demonstrate that my software stack is compliant to possible regulatory requirements

Description

Implement client side support for "protocols" configuration. This allows Styx to enforce a specific TLS version to be used.

Acceptance Criteria

User can choose the TLS protocol versions used for each backend service.
The "protocols" attribute is part of the backend service TlsSettings configuration.
Usage of "protocols" attribute is documented in the Styx User Guide.

Sticky sessions remain pinned into to failed origins

The problem

Sticky session stays pinned down to an origin even after that origin goes down.

The problem was introduced by our refactoring to prepare opening up an SPI interface for extending various styx components.

Sticky session pins down a nominated origin for an ongoing HTTP session. But when that origin goes down, the session must be migrated over to some other origin. The styx retry feature took care of this.

Detailed description

Root cause: retry passes the control back to the sticky session handler, which always tries again the same origin. That's why outcome.nextOrigin() always returns None in RetryOnErrorHandler code:

         if (!outcome.shouldRetry() || !outcome.nextOrigin().isPresent()) {

Acceptance criteria

When an origin fails, any sticky sessions pinned to it must be migrated over to healthy origins

Out of direct memory exception from Netty is sometimes mapped to 502 Bad Gateway

Description

The io.netty.util.internal.OutOfDirectMemoryError from Styx Client is sometimes mapped to 502 Bad Gateway HTTP response code. Instead, it should always be mapped to 500 Internal Server Error.

This happens especially when the exception bubbles up to NettyToStyxResponsePropagator, which wraps all exceptions to BadHttpResponseExceptions. The OutOfDirectMemoryError should of course be mapped to a ResourceExhaustedException.

Acceptance Criteria

NettyToStyxResponsePropagator maps the io.netty.util.internal.OutOfDirectMemoryError to ResourceExhaustedException, with underlying cause set to OutOfDirectMemoryError.
The exception mapper in HttpPipelineHandler maps the ResourceExhaustedException(cause: OutOfDirectMemoryError) to 500 Internal Server Error HTTP status code.
All other instances of ResourceExhaustedException are still mapped to 503 Service Unavailable status code.

Config validator: Improve field type detection

The problem

(1) The config schema validator relies on fasterxml JsonNode class to determine data types of the deserialised JSON objects. Sadly, fasterxml doesn't always get it correct, resulting perfectly valid JSON documents failing the validation. This has been seen with boolean fields, which erroneously have been mistaken for a string:

Unexpected field type. Field 'request-logging.inbound.enabled' should be BOOLEAN, but it is STRING

(2) Styx configuration supports placeholders. The placeholder substitution may change the apparent data type. For example an integer value may be mistaken for a string, invalidating just like in the above example for booleans.

The solution

Whenever a String value is encountered, but a boolean is expected, the validator should attempt to parse the string value as a boolean. If successful, the field type is considered as boolean.

Whenever a STRING value is encountered, but an integer is expected, the validator should attempt to parse the string value as an integer. If successful, the field type is considered as integer.

Acceptance criteria

(/) Accepts YAML boolean values as per http://yaml.org/type/bool.html.

(/) Placeholder replacement can yield integer and boolean values.

(/) When incorrect type is detected, print the actual field value from the deserialised yaml object into the error message. For example:

Unexpected field type. Field 'request-logging.inbound.enabled' should be BOOLEAN, but it is STRING (enabled='true')

Add metric for server side TLS handshake failures.

The problem

There is no clear metric for failed TLS handshake attempts. At the moment SSL handshake failures go under styx.exception.io_netty_handler_codec_DecoderException metric which is a catch-all for all sorts of decoder problems, and also not specific enough for the TLS handshake issues.

As a devops engineer, I want to know how many failed TLS handshake attempts there has been, so that I can get a clear indication on the configuration problems, and to take a remedial action sooner.

Acceptance criteria

Add a metric, perhaps in connections scope, such as connections.tls.handshake.failures. Increment this metric on every failed TLS handshake.
Add the new metric in the styx user manual: https://github.com/HotelsDotCom/styx/blob/master/docs/user-guide/metrics.md
Add the new metric in the styx user manual under "Troubleshooting TLS connectivity issues".
https://github.com/HotelsDotCom/styx/blob/master/docs/user-guide/configure-tls.md

Restrict health checks by time since response

The problem

The scheduled health-checks in Styx are performed periodically in Styx, however, if we are receiving valid responses from an origin, we already know that the origin is healthy, so we don't really need to health-check it.

Detailed description

We know that an origin is healthy if it is sending back valid responses, so we do not need to perform an additional health-check unless the time since the last response has been longer than our health-check interval.

There may be some unusual use cases in which this logic does not apply, so we could add an opt-in feature to force explicit health-checks.

Acceptance criteria

Only perform health-check if the time elapsed since the last valid response exceeds the health-check interval.
Provide an optional feature, disabled by default, that forces health-checks to be performed regardless of responses.

Allow test-api StyxServer Builder to take PluginFactory

The problem

Currently, com.hotels.styx.testapi.StyxServer.Builder can take a Plugin through a builder method. However, this requires the plugin to have already been instantiated. This prevents the test-api from being able to test how plugin factory behaviour affects Styx, and also denies the plugin access to the Environment.

We need to add another method that allows PluginFactory to be set in the builder and encourage its use.

Acceptance criteria

New method in Builder takes PluginFactory argument
When starting Styx, the plugin factory is executed.

Add admin endpoint for querying specific metrics

The problem

We currently have an endpoint that returns the entire set of metrics, but it may be useful to have an endpoint that when queried can return an arbitrary metric specified in the HTTP request. This could allow more sophisticated health-checking options for those that need it.

Detailed description

The URL pattern will be /admin/metrics/<METRIC_NAME_OR_PREFIX>, so if the URL was:
/admin/metrics/foo.bar
it would return a metric named foo.bar as well as metrics beginning with this prefix such as foo.bar.baz.

These can be returned as a JSON object by simply filtering the map of metrics for those that match the pattern and converting it to JSON.

If the metric does not exist it can return 404.

Acceptance criteria

Add an admin endpoint that allows the user to drill down to a specific metric
Returns metric matching exact name if it exists
Returns metric whose name has specified prefix
If no metrics are found, returns 404

Document use of test-api for creating plugins

The problem

Styx has a good test-api to use when writing plugins, but there is no documentation showing how to test plugins with this API.

Acceptance criteria

Write this documentation

Remove fasterxml objects from the BackendService etc. classes

The problem

The BackendService API object exposes com.fasterxml annotations:

import com.fasterxml.jackson.annotation.JsonProperty;
import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
import com.fasterxml.jackson.databind.annotation.JsonPOJOBuilder;

Remove them from the interface.

Detailed description

The yaml serialisation ought to move out of the API module. This can be implemented somewhere in styx-proxy module. We can use an intermediate form, perhaps a java bean based, to perform deserialisation (from yaml configuration) and serialisation (to admin interface).

Acceptance criteria

The com.fasterxml.* imports no longer appear in BackendService & related API classes.
The admin interface remains unaffected.

[1.0] Remove obsolete method from HttpRequest.Builder

The problem

HttpRequest.Builder.body(String) should be removed.

Detailed description

HttpRequest is to be our streaming class, but this method sets a full body.

Original TODO from @mikkokar :

This method is bad because it just takes string instead of stream of chunks. Also it implicitly assumes UTF_8 encoding.

It will also be necessary to update any code using it.

Acceptance criteria

Method is removed
It is still possible to accomplish the same goal, if required
Tests using this that should be using FullHttpRequest anyway updated

socketTimeoutMillis property in the connection pool configuration is not honoured.

The problem

The property socketTimeoutMillis in the connection settings is completely ignored in Styx code (it never tries to set a timeout at all).

Detailed description

com.hotels.styx.api.client.Connection.Settings.socketTimeoutMillis is not read in any part of Styx source code (except for some test cases that verify the config itself), and thus it's never honoured.

Acceptance criteria

com.hotels.styx.api.client.Connection.Settings.socketTimeoutMillis should be honoured, setting the timeout as expected.
Documentation should be updated to reflect this change.

Origin metrics for request cancellations count all requests

The problem

The metric <prefix>.origins.<origin-id>.requests.cancelled.count counts all requests, not just cancelled requests.

Detailed description

StyxHttpClient assumes that any unsubscription is a cancellation and updates the origin cancellation metrics accordingly.

In reality, RxJava calls unsubscribe whenever an observable terminates.

We need a way to allow StyxHttpClient to distinguish cancellations from other situations.

Acceptance criteria

Unsubscriptions that are part of the normal flow should not be recorded as cancellations
Real cancellations should still be recorded

Example plugin pom file should be self-containing

Summary

As a Styx plugin developer, I want the pom.xml file in the example plugin to contain all relevant definitions, so that I can use it as a working template for my own plugins.

Description

The pom.xml cannot be used independently, so any developer who copies it into their project will find that it doesn't work. Several important definitions are missing from the file because they are inherited from the parent pom.xml file(s). Therefore the pom file cannot be used as a readily available template for other Styx plugins.

Acceptance Criteria

Able to successfully build a plugin by using the POM file as a template, like so:

$ cp -r styx/system-tests/example-styx-plugin/ myPlugin
$ cd myPlugin
$ vi pom.xml
.. Remove styx parent pom
.. Change the artifactId, groupId, and version
$ mvn compile

Create unit tests for StyxCoreObservable in 1.0 branch

The problem

StyxCoreObservable does not have associated unit tests.

Detailed description

It is important to have unit tests for StyxCoreObservable. This class can be indirectly created through the API, using StyxObservable.of, so we cannot rely on our own system tests.

Acceptance criteria

Create unit tests for StyxCoreObservable

Config validator: Add schema for routing configuration.

The problem

Add schema for validating routing configuration as described in
https://github.com/HotelsDotCom/styx/blob/master/docs/user-guide/configure-advanced-routing.md

Acceptance criteria

All supported routing configurations are validated.

Styx fails to start with the default configuration file

The problem

Version styx-0.7.5 does not start when using the "default.yml" bundled in the archive.

Detailed description

Version of styx: styx-0.7.5 . This did not happen with the previous version styx-0.7.4
When trying to run styx as per the quick start guide:

<YOUR_INSTALL_DIR>/bin/startup

That means that the process will use the default configuration file, so it is equivalent (and causes the same error) to this command:
<YOUR_INSTALL_DIR>/bin/startup <YOUR_INSTALL_DIR>/conf/default.yml

The process fails to start with the following exception:

11:08:05.919 [main] INFO  c.h.s.i.c.yaml.YamlConfig - Loading configuration from file: path=/styx/conf/default.yml
Mar 16, 2018 11:08:06 AM com.hotels.styx.infrastructure.logging.LOGBackConfigurer initLogging
INFO: Initializing LOGBack from [/styx/conf/logback.xml]. If you are watching the console output, it may stop after this point, if configured to only write to file.
WARN  2018-03-16 11:08:06 [c.h.s.StyxServer] [build=release.label_IS_UNDEFINED] [main] - Could not acquire build number from release version: Version{releaseTag=STYX.0.7-SNAPSHOT}
INFO  2018-03-16 11:08:06 [c.h.s.s.ServiceFactoriesConfig] [build=release.label_IS_UNDEFINED] [main] - Service 'backendServiceRegistry' is ENABLED
ERROR 2018-03-16 11:08:06 [c.h.s.StyxServer] [build=release.label_IS_UNDEFINED] [main] - Error in Styx instance creation.
[exceptionClass=com.hotels.styx.serviceproviders.ServiceFactoryConfig, exceptionMethod=loadService, exceptionID=1efa334c]
com.hotels.styx.api.configuration.ConfigurationException: Error creating service
	at com.hotels.styx.serviceproviders.ServiceFactoryConfig.loadService(ServiceFactoryConfig.java:91) ~[styx-proxy-0.7-SNAPSHOT.jar:na]
.......	
Caused by: com.hotels.styx.api.configuration.ConfigurationException: empty [services.registry.factory.config.originsFile] config value for factory class FileBackedBackendServicesRegistry.Factory
	at com.hotels.styx.proxy.backends.file.FileBackedBackendServicesRegistry$Factory.requireNonEmpty(FileBackedBackendServicesRegistry.java:172) ~[styx-proxy-0.7-SNAPSHOT.jar:na]
	at java.util.Optional.map(Optional.java:215) ~[na:1.8.0_60]

Acceptance criteria

Styx should start up successfully when using the default yaml config file.

Add 4xx server response codes to styx documentation

The problem

The 400, 408, etc (if any others?) response code is not documented in https://github.com/HotelsDotCom/styx/blob/master/docs/user-guide/response-codes.md.

Acceptance criteria

The user guide response codes page contains the following information:

The circumstances under which styx responds with 400.
The circumstances under which styx responds with 408.

The relevant user manual page: https://github.com/HotelsDotCom/styx/blob/master/docs/user-guide/response-codes.md.

Refactor OriginsInventory, BackendServiceRouter for Styx 1.0

The problem

OriginsInventory is class that couples too many responsibilities in one place.

A consequence of this is that when the BackendServiceRouter is informed of changes to the back-end services, it entirely recreates the OriginsInventory for the modified service(s).

Creating and destroying these objects at runtime can cause further issues, as OriginsInventory is responsible for consuming and publishing events.

The risk is that an old OriginsInventory will publish an outdated snapshot after the new OriginsInventory, thus misinforming subscribers.

"Quick & dirty" solutions only introduce other problems (like having a gap in which no OriginsInventory exists to handle events).

We must refactor our code to both reduce complexity and prevent bugs.

Detailed description

We can use Styx 1.0's ConfigStore class to help us with the decoupling, by storing our back-end service config in there and allowing various other classes to watch it for change events.

For example a class to start and stop origins monitoring could watch the ConfigStore and then the monitors could publish the results of their health checks back to the ConfigStore.

Acceptance criteria

Eliminate the risk of multiple conflicting "sources of truth" regarding origins
Decouple the various responsibilities of OriginsInventory
Simplify the behaviour executed when back-end services are reconfigured

Race condition in e2e test: ExpiringConnectionSpec

The problem

ExpiringConnectionSpec fails intermittently due to race condition. The failure manifests especially in Travis builds. All other builds seem unaffected.

Root Cause

Below is a test code in verbatim up to the point of failure. It primes the connection pool with one active connection, which after the expiry timeout is automatically disconnected. The latter section of the test (not shown) asserts that the connection indeed got disconnected.

  it("Should expire connection after 1 second") {
    val request = get(styxServer.routerURL("/app1")).build()

    val response1 = waitForResponse(client.sendRequest(request))

    assertThat(response1.status(), is(OK))

    eventually(timeout(2.seconds)) {
      styxServer.metricsSnapshot.gauge(s"origins.appOne.generic-app-01.connectionspool.available-connections").get should be(1)
      styxServer.metricsSnapshot.gauge(s"origins.appOne.generic-app-01.connectionspool.connections-closed").get should be(0)
    }

...
    }
  }

Due to a race condition in the test code, the connection doesn't always get returned back to the connection pool. Instead the connection sometimes gets disconnected.

The race condition is caused by the client which disconnects the TCP connection after receiving the full response (response1). Sometimes (especially in Travis builds) Styx sees the client TCP connection being disconnected before it has had a chance to return the origin bound TCP connection back to pool. In this case Styx closes the origin bound TCP connection instead of putting it back to the pool. Here is a detailed description:

Test case sends request to Styx server using client object.
Styx establishes a TCP connection to the backend service origin, and proxies the request. The origin responds.
Styx proxies the response back to the test case client. In Styx HttpPipelineHandler class manages the delivery of the response back to client. It considers the response to have been fully sent only after every chunk of content has been successfully written to the underlying (client-bound) TCP socket (Netty channel future for each write operation has successfully completed).
However, the client in the meanwhile has received the request in full, and disconnected the (client/styx) TCP connection. However, sometimes the HttpPipelineHandler receives a channelDisconnected event from Netty before it gets a notification that all channel futures have been successfully completed.
Styx will therefore consider this as a proxy failure and cancels the ongoing request. This has an effect of disconnecting the styx/origin TCP connection which therefore disconnected instead of returned back to the connection pool.
Assertions in the eventually block fail.

Acceptance criteria

Fix the race condition by using a client that keeps the TCP connection alive.

Error in Dashboard generation

The problem

An exception is thrown when trying to generate downstream.origins.origin[].requests.latency.mean

This exception only occurs very rarely and may be the result of a race condition.

java.lang.ArrayIndexOutOfBoundsException: null
	at org.HdrHistogram.AbstractHistogramIterator.next(AbstractHistogramIterator.java:100) ~[HdrHistogram-2.1.4.jar:2.1.4]
	at org.HdrHistogram.RecordedValuesIterator.next(RecordedValuesIterator.java:18) ~[HdrHistogram-2.1.4.jar:2.1.4]
	at org.HdrHistogram.AbstractHistogram.getMean(AbstractHistogram.java:1121) ~[HdrHistogram-2.1.4.jar:2.1.4]
	at org.HdrHistogram.AbstractHistogram.getStdDeviation(AbstractHistogram.java:1134) ~[HdrHistogram-2.1.4.jar:2.1.4]
	at com.hotels.styx.api.metrics.codahale.SlidingWindowHistogramReservoir$HistogramSnapshot.getStdDev(SlidingWindowHistogramReservoir.java:139) ~[styx-api-0.7-916.jar:na]
	at com.hotels.styx.metrics.reporting.graphite.GraphiteReporter.reportTimer(GraphiteReporter.java:308) [styx-proxy-0.7-916.jar:na]
	at com.hotels.styx.metrics.reporting.graphite.GraphiteReporter.doReport(GraphiteReporter.java:252) [styx-proxy-0.7-916.jar:na]
	at com.hotels.styx.metrics.reporting.graphite.GraphiteReporter.lambda$report$4(GraphiteReporter.java:236) [styx-proxy-0.7-916.jar:na]
	at java.util.TreeMap.forEach(TreeMap.java:1001) ~[na:1.8.0_60]
	at java.util.Collections$UnmodifiableMap.forEach(Collections.java:1505) ~[na:1.8.0_60]
	at com.hotels.styx.metrics.reporting.graphite.GraphiteReporter.report(GraphiteReporter.java:235) [styx-proxy-0.7-916.jar:na]
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:243) ~[metrics-core-3.2.6.jar:3.2.6]
	at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:182) ~[metrics-core-3.2.6.jar:3.2.6]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_60]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[na:1.8.0_60]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_60]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[na:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_60]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_60]
	at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_60]

Detailed description

This appears to be a thread-safety issue. Calling the snapshot.getMean() from multiple threads repeatedly causes the exception to be thrown.

See: https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/AbstractHistogramIterator.java#L103

The impact of this error appears to be relatively small as it will only prevent a single dashboard update.

Acceptance criteria

This exception should no longer occur

Provide a metric for direct memory monitoring

Summary

As a maintainer of a running styx server instance I'd like to have an information on what is a current direct memory usage of an instance.

Description

Problem with monitoring occured after migrating styx to newer netty version with this change:
Allow to create Unsafe ByteBuf implementations that not use a Cleaner. The netty change stopped using official java sdk classes to allocate direct memory, it has it's own mechanism to do that. It provided a performance improvement, but also stopped publishing directe memory usage metrics via mbeans.
Currently com.hotels.styx.StyxServer is registering a gauge in:

    private static void registerJvmMetrics(MetricRegistry metricRegistry) {
        RuntimeMXBean runtimeMxBean = getRuntimeMXBean();

        MetricRegistry scoped = metricRegistry.scope("jvm");
        scoped.register("bufferpool", new BufferPoolMetricSet(getPlatformMBeanServer()));
...

but for reasons described before it doesn't work anymore.

The information about direct memory usage can be taken from io.netty.buffer.PooledByteBufAllocator#metric io.netty.buffer.PooledByteBufAllocatorMetric#usedDirectMemory property.

Acceptance Criteria

Provide a codahale gauge that would report a value of usedDirectMemory property.

Javadocs should be available for viewing online

Summary

As a Styx plugin developer, I want to be able to view API documentation for Styx in my browser.

Description

We have documentation in Javadoc-formatted comments that should be made available online. We need to determine how to generate Javadocs when we make a release, upload them to a suitable location and display them for anyone who wants to view them.

Acceptable criteria

Javadocs for styx-api module are available at javadoc.io
Styx Wiki links to the JavaDocs hosted at javadoc.io.

Graphite Reporter: The graphite IP address gets cached.

The problem

Graphite IP address is resolved when the GraphiteReporterService is created, and cached ever since then. Thus the IP address never re-resolves if it changes behind the DNS name.

Plugin cannot consist of multiple JAR files

The problem

It is not possible to load a plugin that consists of multiple JAR files.

Detailed description

The implementation is bugged because it loads each JAR (i.e. dependency) into a separate classloader (and the classloaders do not have visibility of each other). This forces the plugin developer to merge all of the JARs together in order to make it work.

Acceptance criteria

Load all JARs in the plugin JAR directory into a single classloader
A Plugin classloader should use the system classloader as its parent, not the Styx classloader (so that it uses the plugin's version of its dependencies)

Built-in generic plugin - feature request/discussion

The problem

Sometimes a plugin will be desired to do something very basic, such as adding some headers to the request or response with set values. It is not very efficient to create a whole new project/dependency for something so simple.

The following details are more of a concept than a finished feature request, and should be discussed and likely amended before implementation.

Detailed description

It would be easier if there were a built-in Plugin (and PluginFactory) implementation that could be used to configure simple plugins without needing to write any new code.
Furthermore, this intrinsically allows us to minimise the impact of API changes on Plugin users.

Acceptance criteria

Allow plugin to be configured without writing code
Only needs to support the addition of headers

Styx user guide: Overarching backends/origins configuration document is missing

Styx user manual needs a new chapter explaining how to configure backend services and origins for styx. Currently it doesn't appear to have any.

Acceptance criteria:

Explain how & where to specify an origins file.
Explain overall structure of the origins configuration file.

Do not allow backend services to load or reload when path prefix is duplicated

The problem

When Styx tries to load at start-up, or when a reload is executed, it should check that every backend service has its own unique path prefix and refuse to start otherwise.

Acceptance criteria

If there is a duplicate path prefix at start-up Styx does not start and logs an error
If there is a duplicate path prefix at reload, Styx:
- does not accept the new backend-service config
- continues using the old config
- sends an error response (assuming HTTP command)
- logs an error

A change of backend service name (id) fails

Description

A change of backend service ID, while the path prefix remains unchanged, causes a backend service to disappear. This is because additions and removals of the backend services, at service file reload, are processed in a wrong order.

Detailed Description

Consider the following backend configuration:

   myApp:
      id: myApp
      path: /app
      ...

The above configuration is reloaded with the following:

   myApp:
      id: justApp
      path: /app
      ...

Reloading this change results justApp to lose all of its origins. This is due to a bug in the BackendServiceRouter. It first applies added or modified backend services before removing the old services.

The bug is caused by BackendServicesRouter processing the additions/updates before the removals. In this example the BackendServicesRouter gets a change notification with "justApp" as an added backend, and "myApp" as a removed backend. Because additions and removals are processed in the wrong order, justApp is first added at the path prefix "/app", removing the existing "myApp". However, right after "myApp" is removed from the same path prefix "/app" removing the newly inserted "justApp". In effect this leaves it without origins.

Graphite Reporter: retries too often

The problem

Graphite reporter retries are too aggressive.

When the Styx GraphiteReporter fails to report the metrics due to an error (IO-error), it tries again resuming from the point it failed. This however can cause a storm of retries possibly overwhelming the Graphite Server.

The graphite reporter needs to limit the number of retries towards the Graphite server.
Additionally, it is worth nothing that keeping a permanent connection to the server would prevent us from distributing the load among different instances of the Carbon/graphite server using DNS round robin.

Acceptance criteria

Limit the number of retries
Logs an error message for failed attempts (it already does)
Allow to close the connection after every 'batch of metrics' so the load can be distributed as expected.

Deprecate/remove unnecessary admin endpoints

The problem

We have these two admin endpoints /admin/healthcheck and /admin/status. These do not really provide specific enough information to be useful and are a source of confusion for users and developers.

Detailed description

/admin/healthcheck simply judges "health" by the rate of HTTP 500 Internal Server Error status codes returned. It uses a hard-coded rate of 1/minute as the threshold and makes no distinction between errors caused by Styx/plugins and errors proxied from origins.

/admin/status simply delegates to /admin/healthcheck but replaces the response body with either OK or NOT_OK depending on whether the response status is 200 or not.

These handlers are a relic of very early development and do not really provide any useful function, so they should be removed.

Acceptance criteria

Remove /admin/healthcheck
Remove /admin/status
Remove any code that only exists for them

Evaluate the use of netty's NameResolver and DnsCache instead of InetAddress classes.

The problem

The DNS caching solution in Java is flawed as it does not respect the TTL settings. It could also benefit from alternating the IP addresses in a round robin fashion.

Netty added in 4.1 the interfaces NameResolver and DnsCache (which are marked as unstable ) to provide its own implementation for hostname resolution that should fix these issues.

It seems worth exploring migrating our DNS resolution code to use those classes so Styx doesn't have to rely on performing the actual DNS requests (which can be costly) each time when we need to connect to a pool of servers behind a single host name .

[1.0] Find a way to test decoding failure scenarios with new API

The problem

PluginErrorHandlingSpec has tests disabled because they can no longer be performed, due to the API changes in 1.0.

If possible, a new way to test them should be found.

Acceptance criteria

Find a new way to perform these tests.

Cookie names should be treated case-sensitively

The problem

Cookies names are treated in a case-insensitive manner inside Styx, but this doesn't seem to agree with RFC 6265 (https://tools.ietf.org/html/rfc6265) and most popular browsers do treat them as case sensitive (see https://stackoverflow.com/questions/11311893/is-the-name-of-a-cookie-case-sensitive for details)

Acceptance criteria

Cookie names should be compared in a case sensitive manner.
Other attributes of the cookie and headers should remain case insensitive.

Admin page shows health check configuration even when it is absent

The problem

The admin page for origins configuration (/admin/configuration/origins) shows the health check configuration block even when it is disabled:

  "healthCheck" : {
    "uri" : null,
    "intervalMillis" : 5000,
    "timeoutMillis" : 2000,
    "healthyThreshold" : 2,
    "unhealthyThreshold" : 2
  },

Well ok, it shows that uri is null, but really it shouldn't show this block at all.

Acceptance criteria

When health check configuration is absent, it shouldn't show up in the origins configuration page at all.

expediagroup / styx Goto Github PK

styx's Introduction

Features

Use cases of Styx Plugins

Useful Information

Quick Start

User Manual

Developer Resources

Got a Question?

Links

Dependencies

Notice

License

styx's People

Contributors

Stargazers

Watchers

Forkers

styx's Issues

The problem

Detailed description

The problem

Detailed description

Acceptance criteria

The problem

Desired behaviour

Detailed description

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Detailed Description

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Detailed description

Acceptance criteria

Background

Acceptance Criteria

The problem

Detailed description

Acceptance criteria

Summary

Description

Acceptance Criteria

The problem

Detailed description

Acceptance criteria

Description

Acceptance Criteria

The problem

The solution

Acceptance criteria

The problem

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem

Acceptance criteria

The problem

Detailed description

Acceptance criteria

The problem