Giter Site home page Giter Site logo

Comments (5)

pnadolny13 avatar pnadolny13 commented on July 19, 2024 2

@aaronsteers thanks for your thoughts, that was super helpful!

For this reason, I lean towards your suggestion using an arbitrary constraint to exclude records from being emitted if they are within the defined cool-off period of 1-5 minutes.

Duplicate records are tolerated by the spec, of course, but they are expensive to manage in practical use cases. For this reason, I'd vote to not just keep a conservative bookmark (in which case the records would still come through), but actually to filter the records and keep the conservative bookmark limit.

I agree, I think this is a pretty reasonable solution and it should be relatively easy to implement. The way that the cloudwatch API works is that I send a start and end timestamp for the batch of logs I want to retrieve, I will plan to make the end timestamp current_time minus 5 mins (default but configurable). This should result in what you suggest i.e. no records will be emitted after that timestamp and the bookmark will always be at least 5 mins from real time.

from tap-cloudwatch.

pnadolny13 avatar pnadolny13 commented on July 19, 2024

@aaronsteers do you have any thoughts on this? Do the issue make sense to you? I wasnt able to find anything in the cloudwatch docs that was helpful with this.

from tap-cloudwatch.

aaronsteers avatar aaronsteers commented on July 19, 2024

@pnadolny13 - Yeah, this makes sense. Thanks for raising.

Your writeup is very well done here and very clear.

In general scenarios, we don't want to prematurely limit records to be exclusive of the last time interval (like the last 5 minutes) because when records get updated frequently, they can just always move into the latest bucket and therefor never be synced (or more realistically, they would be conspicuously absent for >1 sync operation where they otherwise should have been included).

In the above scenario though, updates are not a thing we need to worry about, since each line is immutable and will not be modified or be "moved" into another time window.

For this reason, I lean towards your suggestion using an arbitrary constraint to exclude records from being emitted if they are within the defined cool-off period of 1-5 minutes.

Duplicate records are tolerated by the spec, of course, but they are expensive to manage in practical use cases. For this reason, I'd vote to not just keep a conservative bookmark (in which case the records would still come through), but actually to filter the records and keep the conservative bookmark limit.

from tap-cloudwatch.

aaronsteers avatar aaronsteers commented on July 19, 2024

Another implementation choice, If I understand the comment quoted below, would be to change the Stream behavior to use is_sorted=False. Essentially, that would turn on a host of other features that solve the fundamental problem of "the records I'm seeing in this order aren't all the records up to this time."

This makes streams non-resumable on interrupt, unless you do extra work to manage state finalization, and it uses the internal signpost feature to basically say that the bookmark is not allowed to advance past the moment in time that the sync operation began.

I can't say that this fully solves the issue, but it addresses the "overshooting the end time" you mention here:

We set the end_time of the query to end_time = datetime.now(timezone.utc).timestamp() right now then overshoot it because we use a batch windowing mechanism to avoid the 10k limit. This ends up submitting a query like shown above where the end timestamp was probably somewhere around 2023-02-28T06:03: UTC but the window ends in 2023-02-28T07:00:37 UTC. This aims to get real time logs and overshoots the end time to do so.

If it fixes the issue, this is the "easiest" solution because it uses all out-of-box capabilities in the SDK. But you'll still get significant record duplication if you don't block records from emitting, because that generic behavior is built for scenarios where updates might occur, and as such, the records themselves are allowed to be emitted, even past the point where the bookmark is allowed by the signpost to advance.

from tap-cloudwatch.

aaronsteers avatar aaronsteers commented on July 19, 2024

@pnadolny13 - Sounds great. There are several ways to implement, but if helpful, one option is to drop records by returning None from post_process() if the datetime is past the established limit. (Probably more efficient in a loop, but just wanted to mention this as a possibility if its simpler.)

from tap-cloudwatch.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.