ncherkas / hazelcast-jet-kinesis Goto Github PK
View Code? Open in Web Editor NEWHazelcast Jet Connector for the Amazon Kinesis Streams
License: Apache License 2.0
Hazelcast Jet Connector for the Amazon Kinesis Streams
License: Apache License 2.0
It's already recommended for prod use https://docs.aws.amazon.com/sdk-for-java/v2/developer-guide/examples-kinesis-stream.html
Don't flush data inside complete()
and remove this from the Sink. We're OK with the default behavior.
Even though this is done by the Jet internally, we still may need to sleep more when there is no data - in order to avoid the Kinesis rate limiting. Also, keep returning false
until the Kinesis Stream is not closed.
Let's improve our code!
Kinesis provides a record sequence number to make the checkpoints while you're processing the stream. The checkpoints itself are the record sequence numbers which you periodically save into separate storage, so that you can go back to them in case of any kind stream processing failure. The Jet supports this scenario out-of-box by providing a corresponding Processor API to save/restore any processing state like checkpoints etc.
We need to implement this for the Amazon Kinesis Jet Source. It's suggested to learn how it's done in the Kafka Connector.
It would be good to embed AWS SDK and maybe other dependencies so that it's safer and easier to use our connector in big projects with possible version conflicts.
When implementing the Jet Connector we should keep in mind that all the retriable faults and errors should be handled inside the Source or Sink. In our case, there are some various cases when Kinesis API may return error which should be properly handled e.g. it can be just InternalFailure
or something more specific like ThrottlingException
. Here, we should analyze possible cases and handle properly. We can consider using https://github.com/jhalterman/failsafe
Review log messages and log levels.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html
Overall how they implemented it.
And put labels on the corresponding issues.
We use the Hazelcast ISet
to sync about which shards are already processed and closed. It's very important since two parent shards can be taken by the parallel processors and we can proceed with a parent shard later only after both of them are handled. This is why we coordinate using ISet
. The current ISet
naming doesn't seem correct because it's bound only to the stream name though it's possible that multiple jobs process a single stream in parallel and need its own context for the coordination. It's suggested to consider using a vertex name of the Source/Sink (should be unique) maybe adding Job id to it (in newer versions it can ve different after restart). Let's double check that.
Also, we need to check how this works if we store ISet
data into a snapshot for the Fault-Tolerance. Maybe this will also solve the naming problem. Suggested to work on this after we implement the Fault-Tolerance.
It should automatically trigger on PR creation or merge. We should look around for some solution integrated with GH.
Switch to Kinesis Data Generator - https://aws.amazon.com/blogs/big-data/test-your-streaming-data-solution-with-the-new-amazon-kinesis-data-generator/ to generate the test data.
When processing the Kinesis Stream, we may fall behind the retention period and at the end lose some data. The Kinesis API provides a response attribute MillisBehindLatest
to track this, so that you can alert via error log message about possible falling behind. Also, there is a CloudWatch metric GetRecords.IteratorAgeMilliseconds
related to that but I'm not sure we can integrate it somehow.
Let's print error or warn message when MillisBehindLatest
> stream retention period.
We may consider using https://github.com/localstack/localstack or something like this.
We may consider using https://github.com/localstack/localstack or something like this.
There will be a new API for implementing custom Sources & Sinks. We need to explore it and evaluate the migration path.
We need to investigate and research the following edge cases:
So far, the downscale (when we merge shards) is not properly handled, or at least we're not sure if it's handled somehow under the hood of Jet. The main problem here is with the WatermarkSourceUtil
, which is used to manage watermarks. Its API doesn't allow to decrease a partition count so we're not able to explicitly propagate this down to the Jet internals. It was suggested that the Jet itself internally decreases a partition count when there is no data for the partition during more than 60 sec. Here, by no data
we mean there was no events reported via the WatermarkSourceUtil
API for the given duration. We need to do the following:
WatermarkSourceUtil
when there is no data for the partitionThe current draft implementation is based on the Jet ver. 0.7 We need to migrate to a recent one.
There was some issue with the order of the messages read from multiple shards in the same processor. It was the internal Jet problem, let's clarify that.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.