ncherkas / hazelcast-jet-kinesis Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 38 KB

Hazelcast Jet Connector for the Amazon Kinesis Streams

License: Apache License 2.0

hazelcast-jet-kinesis's People

Contributors

Stargazers

Watchers

Forkers

googlielmo

hazelcast-jet-kinesis's Issues

Migrate to AWS Async SDK v2

It's already recommended for prod use https://docs.aws.amazon.com/sdk-for-java/v2/developer-guide/examples-kinesis-stream.html

Remove complete() call from the Sink

Don't flush data inside complete() and remove this from the Sink. We're OK with the default behavior.

Calling sleep() when there is no data

Even though this is done by the Jet internally, we still may need to sleep more when there is no data - in order to avoid the Kinesis rate limiting. Also, keep returning false until the Kinesis Stream is not closed.

Add Google Guava NotNull and other pre-conditions

Let's improve our code!

Implement Fault Tolerance

Kinesis provides a record sequence number to make the checkpoints while you're processing the stream. The checkpoints itself are the record sequence numbers which you periodically save into separate storage, so that you can go back to them in case of any kind stream processing failure. The Jet supports this scenario out-of-box by providing a corresponding Processor API to save/restore any processing state like checkpoints etc.

We need to implement this for the Amazon Kinesis Jet Source. It's suggested to learn how it's done in the Kafka Connector.

Embed AWS SDK dependency

It would be good to embed AWS SDK and maybe other dependencies so that it's safer and easier to use our connector in big projects with possible version conflicts.

Retry errors inside the Kinesis Source and Sink

When implementing the Jet Connector we should keep in mind that all the retriable faults and errors should be handled inside the Source or Sink. In our case, there are some various cases when Kinesis API may return error which should be properly handled e.g. it can be just InternalFailure or something more specific like ThrottlingException. Here, we should analyze possible cases and handle properly. We can consider using https://github.com/jhalterman/failsafe

Review log levels

Review log messages and log levels.

Investigate Flink Kinesis implementation

https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html

Overall how they implemented it.

Define exact scope for the 1st release

And put labels on the corresponding issues.

Choose a proper naming for ISet

We use the Hazelcast ISet to sync about which shards are already processed and closed. It's very important since two parent shards can be taken by the parallel processors and we can proceed with a parent shard later only after both of them are handled. This is why we coordinate using ISet. The current ISet naming doesn't seem correct because it's bound only to the stream name though it's possible that multiple jobs process a single stream in parallel and need its own context for the coordination. It's suggested to consider using a vertex name of the Source/Sink (should be unique) maybe adding Job id to it (in newer versions it can ve different after restart). Let's double check that.

Also, we need to check how this works if we store ISet data into a snapshot for the Fault-Tolerance. Maybe this will also solve the naming problem. Suggested to work on this after we implement the Fault-Tolerance.

Setup initial CI pipeline

It should automatically trigger on PR creation or merge. We should look around for some solution integrated with GH.

Switch to Kinesis Data Generator

Switch to Kinesis Data Generator - https://aws.amazon.com/blogs/big-data/test-your-streaming-data-solution-with-the-new-amazon-kinesis-data-generator/ to generate the test data.

Alert on falling behind in the stream processing

When processing the Kinesis Stream, we may fall behind the retention period and at the end lose some data. The Kinesis API provides a response attribute MillisBehindLatest to track this, so that you can alert via error log message about possible falling behind. Also, there is a CloudWatch metric GetRecords.IteratorAgeMilliseconds related to that but I'm not sure we can integrate it somehow.

Let's print error or warn message when MillisBehindLatest > stream retention period.

Unit test coverage for the Sink Connector

We may consider using https://github.com/localstack/localstack or something like this.

Unit test coverage for the Source Connector

We may consider using https://github.com/localstack/localstack or something like this.

Explore the new API for the Custom Connectors

There will be a new API for implementing custom Sources & Sinks. We need to explore it and evaluate the migration path.

Research edge cases

We need to investigate and research the following edge cases:

upscale and downscale
fault tolerance and error handling
shards assignment scenarios

Downscale handling

So far, the downscale (when we merge shards) is not properly handled, or at least we're not sure if it's handled somehow under the hood of Jet. The main problem here is with the WatermarkSourceUtil, which is used to manage watermarks. Its API doesn't allow to decrease a partition count so we're not able to explicitly propagate this down to the Jet internals. It was suggested that the Jet itself internally decreases a partition count when there is no data for the partition during more than 60 sec. Here, by no data we mean there was no events reported via the WatermarkSourceUtil API for the given duration. We need to do the following:

verify and test the behavior of Jet and WatermarkSourceUtil when there is no data for the partition
check for other possible issues when Kinesis downscales
communicate with Jet Team about hazelcast/hazelcast-jet#763 (update: seems already fixed)

Migrate from 0.7

The current draft implementation is based on the Jet ver. 0.7 We need to migrate to a recent one.

Messages order per processor instance

There was some issue with the order of the messages read from multiple shards in the same processor. It was the internal Jet problem, let's clarify that.