bazaarvoice / ostrich Goto Github PK

View Code? Open in Web Editor NEW

27.0 16.0 27.0 1.56 MB

SOA Library

License: Apache License 2.0

Java 100.00%

ostrich's Introduction

Introduction

Ostrich is a library that enables SOA architectures to be built quickly and easily.

Currently Ostrich requires that service consumers be JVM based. This will change in the future.

Quick Start Guides

ostrich's People

Contributors

Stargazers

Watchers

ostrich's Issues

Increase stability of TeamCity build

The TeamCity build randomly fails when no code has been changed (it's currently running hourly). We're currently experiencing 1-2 random failures per day.

I believe this is an issue between Curator's TestingServer and Apache's ZooKeeper library. I have a very simple test case that I've written that exposes this: https://gist.github.com/2891890. Given enough iterations this test will eventually get into an infinite loop. I believe this same problem is what is affecting our TeamCity build.

Rename EndpointListener to EndPointListener

... to be consistent with ServiceEndPoint.

HostDiscovery should implement Closeable

ServicePool should be closeable

When closed it should terminate any background health check threads.

Support caching of Service instances

Currently in the ServicePool every time we execute a callback we as the ServiceFactory<S> to create a new instance of a service. Depending on the implementation of the ServiceFactory this could be an expensive operation (it may need to establish a new connection to the remote server, etc.). We should offer the ability to cache these instances so that they don't have to be recreated every time.

This functionality should probably be something that individual service providers should control, not service consumers.

Integrate Chameleon into Ostrich [1]

Add a component to determine if exceptions are retryable or not

Use this in various places where we allow retries. This will make it so classes don't have to throw magic ServiceException instances to indicate that their exceptions are retryable.

Write documentation

We need several pieces of documentation:

A guide for service consumers
A guide for service providers
A guide for operators (describing how the system works, what connectivity is required, etc.)

update documentation

Docs need to be updated for the new project dependencies (i.e., use of curator directly instead of through ZooKeeperConnection).

Add a richer exception hierarchy

Right now ServiceException is thrown when most things go wrong. We should be more specific than that, and have subclasses that represent different things. The following cases are useful to users:

Knowing when no hosts were available for the request (e.g. HostDiscovery reporting empty set)
Knowing when all retries were exhausted
Telling the ServicePool when the exception that happened should result in a retry
Telling the ServicePool (likely having it infer) when the exception that happened was a programming error.

Create a isRetryableException method in ServiceFactory

Randomize the interval that health checks are polled on

We don't want to bombard a server that comes back up with a ton of health checks. It would be nice to space these out by waiting before the first check by some random amount of time.

Maybe some type of backoff strategy should be employed.

Stability Testing Automation [5]

Remove useage of Objects.toStringHelper

Objects.toStringHelper was deprecated in Guava v18, and was removed in v21 (~June 2016).

Objects.toStringHelper was replaced with MoreObjects.toStringHelper

See https://google.github.io/guava/releases/18.0/api/docs/src-html/com/google/common/base/Objects.html#line.157

Performance Testing

Change service path to be /ostrich

We should be in an application/library specific path instead of the very generic, and non-identifiable /services.

Stability Testing - Havoc!!!! [13]

Agent that simulates a rolling restart of ZK
Agent that always throws an exception
Service that registers and unregisters - system noise
Agent that never responds
Agent that creates an ip table network partition
Create DNS issues

Support partitioned services

It's highly likely that services will be partitioned (e.g. one node will only be able to service queries for a specific range of data). Ostrich needs to support people in authoring and consuming these types of services.

At the same time, a lot of services will only have a single partition. For these services complexity of using Ostrich shouldn't increase in a noticeable way.

At a high level this change will require:

Updating the ServicePool to receive a partition key. This should probably be opaque from the perspective of Ostrich.
Updating the LoadBalanceAlgorithm to support receiving a partition key, so that the load balancer can choose a suitable service end point to use.
Maybe support letting the ServiceFactory know the partition key? I'm not sure this is useful.

Make ServicePool creation simpler...

Currently when a user creates a ServicePool the code looks something like this:

ZooKeeperConnection connection = new ZooKeeperConfiguration()
            .setConnectString(connectString)
            .setRetryNTimes(new com.bazaarvoice.soa.zookeeper.RetryNTimes(3, 100))
            .connect();

ThreadFactory daemonThreadFactory = new ThreadFactoryBuilder()
            .setDaemon(true)
            .build();

ServicePool<CalculatorService> pool = new ServicePoolBuilder<CalculatorService>()
            .withHostDiscovery(new ZooKeeperHostDiscovery(connection, "calculator"))
            .withServiceFactory(new CalculatorServiceFactory())
            .withHealthCheckExecutor(Executors.newScheduledThreadPool(1, daemonThreadFactory))
            .build();

There are a few things that I consider wrong with this picture:

When creating the ZooKeeperHostDiscovery instance, the user needs to know where in ZooKeeper the registration nodes are being stored (e.g. the "calculator" parameter). The CalculatorServiceFactory object actually has that knowledge inside of it, so we shouldn't bleed that information to the user of the service.
The user is required to create a health check executor service without necessarily understanding why. Providing that should be completely optional for them.

I would like to see the above code be rewritten to something like:

ZooKeeperConnection connection = new ZooKeeperConfiguration()
            .setConnectString(connectString)
            .setRetryNTimes(new com.bazaarvoice.soa.zookeeper.RetryNTimes(3, 100))
            .connect();

ServicePool<CalculatorService> pool = new ServicePoolBuilder<CalculatorService>()
            .withZooKeeperHostDiscovery(connection)
            .withServiceFactory(new CalculatorServiceFactory())
            .build();

Stability Testing - Data Collection and Analysis [5]

Correlate a failure to an event
- Capture bad actor events and who is responsible
- Capture good actor events

OnlyBadHostsExceptions should include underlying cause exceptions

Exceptions thrown during a service pool execute method can result in that endpoint being marked unhealthy. When no more endpoints are available an OnlyBadHostsException (OBHE) is thrown. It would be useful for debugging to include the underlying exceptions in the OBHE so that the root cause of the failing services can be determined.

Integrate EmoDB Dropwizard helper classes. [2]

EmoDB has a handful of classes that make it easier to use Ostrich and Dropwizard together. They are useful to other projects that, right now, pull these classes from EmoDB.

See the code here:

ConfiguredFixedHostDiscoverySource and ConfiguredPayload make it easier to configure fixed end points in YAML config files. The interface is a bit awkward, though, because you must create a service-specific subclass (example).
Payload and PayloadBuilder remove some of the tedious work required to create and parse ServiceEndPoint payloads.
ManagedRegistration ties host discovery registration and unregistration to Dropwizard lifecycle events.
ResourceRegistry uses the ServiceName annotation and Jersey Path annotation to build ServiceEndPoint objects and register a resource with both Jersey and host discovery.

I don't expect you to take these classes as-is. Pick and choose and refactor as you see fit.

Support more advanced load balancing

We should probably support a load balancing strategy other than random. The most logical one would be something like least loaded. This could be a measure of number of local or global connections to a service, or something like the load on the remote server.

Investigate potential connection loss [3]

The data team had an issue earlier today where an instance seemingly lost connection to ZooKeeper. There wasn't a good way to diagnose this at runtime, so adding some metrics to Ostrich to show connection states and things that are happening with the connection may be useful.

Action:

Show how long the server has been connected
Show the number of host connect attempts
Show the current connection state

Support for non-JVM languages as service providers

We have to make sure that this is needed.

If we wanted to support non-JVM language service providers it could probably be done really easily without having to write separate code for each language. We could write a simple dropwizard service that receives a POST with service endpoint info inside of it. The service would take the endpoint info and create an ephemeral node in ZooKeeper on behalf of the caller. After creating the node it would NOT close the HTTP connection. Instead it would monitor the connection, and if the caller ever closes it, the service would delete the ephemeral node. So in this model having an open connection to a webserver is a proxy for a service being alive. If that connection closes then the server is assumed to not be alive anymore. When HTTP timeouts happen the client will have to reestablish the connection if it still wants to be available.

Given that pretty much all modern languages have the ability to make a HTTP POST request this should enable a service written in any language to be made available through Ostrich. Of course the service provider would have to write a client library in all languages that they have users in.

ServiceRegistry should implement closeable

Add a callback to clean up endpoint service after it's discarded

If the service created allocates some resources there is no way today to reclaim them after it's evicted from the cache (in timeout cases for example).
A method should be exposed to implement that would be called when that happens.

Remove Service marker interface

The com.bazaarvoice.soa.Service interface presents no real use and requires modification of the service itself (assuming the consumer is following the Dropwizard project-api, project-client, project-service project structure).

Support running an operation on all available endpoints

Make it possible to instantiate an AsyncServicePool from the ServicePoolBuilder

ServiceCallback should have a "without result" sibling

com.bazaarvoice.soa.ServiceCallback requires some return type from a service method invocation.

In the case of a "void" return type on a service method (like Databus.subscribe), it would be nice to not have to "return null".

databusServicePool.execute(new RetryNTimes(3, 100, TimeUnit.MILLISECONDS), new ServiceCallback<Databus, Object>() {
@OverRide
public Object call(Databus service) throws ServiceException {
service.subscribe(DATABUS_SUBSCRIPTION_NAME, 86400, 86400);
return null;
}
});

would be nice to have the following instead:

databusServicePool.execute(new RetryNTimes(3, 100, TimeUnit.MILLISECONDS), new ServiceCallbackWithoutResult() {
@OverRide
public void call(Databus service) throws ServiceException {
service.subscribe(DATABUS_SUBSCRIPTION_NAME, 86400, 86400);
}
});

Stability testing [13]

We should have a method to perform long running stability testing (hours if not days) for Ostrich. We need to make sure that we correctly handle all sorts of error conditions such as:

ZooKeeper nodes restarting (in a way that doesn't lose quorum)
Services being registered and unregistered
Services throwing exceptions and errors
Services being unhealthy for long periods of time

Refactor ServiceEndPoint

There are a few things wrong with ServiceEndpoint

It shouldn't require specific machine and port information in the name. It should be refactored to have an opaque name for the service that it's associated with, and an opaque id representing the node that it was created for (likely something like hostname:port). This will give operations the ability to look inside of ZooKeeper and make sense of what services are registered (mentally decoding the opaque id into hostname and port). At the same time this also ensures that people who are writing services and clients will include all necessary information for connecting to the service in the payload on the registration. This insulates service authors from changes to the way we represent and name the nodes.
It is currently representing too many concerns. An endpoint is an identifier for a specific instance of a service. Mapping it to JSON and determining when it was registered shouldn't be part of its concerns. These are the concerns of the service registry.
It should be named ServiceEndPoint with a capital P. This is consistent naming with other projects out there.

Netflix Curator library used conflicts with Bouncer

The Netflix Curator library used by soa (curator-recipes 1.1.12) needs more from the curator-client than version 1.1.2 included with curator-framework 1.1.2 included with Bouncer.

ServicePool should expose an isHealthy() method.

Client applications that use a ServicePool should integrate into their own health check a verification that their dependencies are also healthy.

The implementation may be as simple as checking that there's at least one end point that's not marked bad. It would be better to actually ping through to at least one end point as part of computing isHealthy().

We should find a way to expose this for proxies that wrap a ServicePool, too. For example, assuming Dropwizard:

MyService service = ServicePool.create(MyService.class)...buildProxy(retryPolicy);
environment.addHealthCheck(new HealthCheck("my-service") {
    @Override
    protected Result check() {
        // TODO: it's nice to include a string w/the name of the live endpoint + timing info
        // like "localhost 493us"
        return ServiceProxies.isHealthy(service) ? Result.healthy() : Result.unhealthy();
    }
});
environment.addManaged(new Managed() {
    @Override
    public void start() {}
    @Override
    public void stop() {
        ServiceProxies.closeQuietly(service);
    }
});

Instrument code to expose metrics

Not completely sure how this should be done exactly, but we have a goal of being able to collect several metrics about the codebase: https://bits.bazaarvoice.com/confluence/display/DEV/SOA+Metrics

bazaarvoice / ostrich Goto Github PK

ostrich's Introduction

Introduction

Quick Start Guides

ostrich's People

Contributors

Stargazers

Watchers

Forkers

ostrich's Issues

Recommend Projects

Recommend Topics

Recommend Org