Giter Site home page Giter Site logo

tap-cloudwatch's People

Stargazers

 avatar

Watchers

 avatar

tap-cloudwatch's Issues

Use threading and unordered stream to increase speed

This PR makes it explicit that we sort the stream records and that its safe to resume if failed. This could be much faster if we dont care about order. Potentially adding a feature that you can config to make is_sorted false and stop using an ordered queue for keeping records in order, potentially using threads and just yield them as they are received.

bug: daily diffs in cloudwatch stats vs output are slightly off

I'm noticing that some days with large log counts we have slight diffs in the overall counts for a day, theyre <1% for my data set so its an edge case but still needs to be fixed. In my first batch using this tap I was able to tie them out but it looks like potentially on larger batches we're skipping some logs.

I suspect that theres a bug in the sub-batching logic thats leading to logs getting skipped on the edge of those ranges. I think we already use >= but something isn't working as expected.

Batch limit exceeded queries better

Right now if the 10k limit is exceeded we split the request and make subsequent requests serially which works but its slow, especially if the limit was exceeded by a lot. We should either batch automatically into smaller chunks and concurrently start all of them so theres less wait time.

Support relative end_date syntax

Initially discussed in #26 (comment) and meltano/sdk#922.

TLDR;

What do you think of us defining end_date to be either an ISO 8601 date or datetime value (as is expected for start_date) or an ISO 8601 time interval. We could then use the following interpretation logic:
......
So, from the above, our proposed default of "5 minutes ago" would be "end_date": "-PT5M". To include exactly 10 days of data from the provided start_date, the end date could use a interval:"end_date": "P10D".

bug: `End time cannot be less than Start time`

My pipeline has been running with no problems for ~5 months. Today it failed because the query batch window was broken, the start date was after the end date. It looks like it was the last batch and must be using some sort of logic with current time that hit an edge case. We should either fix the core of it where the max the start date can be is less than or equal to the end date/current time. Another option is to assert the start date is before the end date and skip it if not but this is a little heavy handed and could be risky if other bugs come up and it skips batches mid sync vs the last batch up to current time, probably avoid this if possible.

Execution ID: 21c875d9c0854236a004a70b73202cf9

tap_cloudwatch.subquery | Submitting query for batch from: `2023-07-18T10:01:05 UTC` - `2023-07-18T10:01:13 UTC`
....
tap_cloudwatch.subquery | Submitting query for batch from: `2023-07-19T06:01:14 UTC` - `2023-07-19T06:01:13 UTC
....
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartQuery operation: End time cannot be less than Start time (Service: AWSLogs; Status Code: 400; Error Code: InvalidParameterException

feat: support aggregate functions

Currently if stats is used in the query then we throw an exception

. Ideally we'd allow stats by disabling some of the other features that dont work with it.

  • allow stats in query config, probably set a flag so other methods know stats is in use
  • if stats is in query we might not need to force "@timestamp" to be there too because the aggregate might be messed up. So we'll need to deactivate incremental syncs for this case.
  • Figure out how to handle schema generation. We hard code the log ID as the PK but could make that configurable
  • I dont think we can paginate in batches so if we detect that the 10k batch limit is reached after aggregating then we need to throw and error

https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html

Handle late arriving events

I'm noticing that even if everything looks to function as expected on the tap side I'm still getting very slight diffs between the source and the warehouse. It looks like we're querying near realtime logs at the end of a sync and then bookmarking the latest timestamp we receive and I suspect some logs with earlier timestamp are arriving after the query is run. Then subsequent queries assume all prior logs have been synced so those late arriving logs are outside the next filter range and never get replicated. I've only seen this affect records within a few mins (<5 mins).

For my sync I see the following in my logs:

  1. 2023-02-28, 06:02:00 UTC - Submit Query (with filter range 2023-02-28T06:00:38 UTC - 2023-02-28T07:00:37 UTC)
  2. 2023-02-28, 06:06:20 - Get Results (679 Received)
  3. State emitted {"bookmarks": {"log": {"replication_key": "timestamp", "replication_key_value": "2023-02-28 06:01:09.040"}}}
  4. Next Run - 2023-03-01, 06:00:47 UTC "Submitting query for batch from: 2023-02-28T06:01:09 UTC - 2023-02-28T07:01:09 UTC"

This indicates to me that the replication key is being properly tracked and passed to the next run.

When I diff at the minute level right around the replication key (2023-02-28T06:01:09 UTC) range I see:

  • 2023-02-27 5:59:00 - 33 missing (58% of that minute)
  • 2023-02-27 6:00:00 - 1933 missing (47% of that minute)

Potential solutions:

  1. Dont sync up to real time. We set the end_time of the query to end_time = datetime.now(timezone.utc).timestamp() right now then overshoot it because we use a batch windowing mechanism to avoid the 10k limit. This ends up submitting a query like shown above where the end timestamp was probably somewhere around 2023-02-28T06:03: UTC but the window ends in 2023-02-28T07:00:37 UTC. This aims to get real time logs and overshoots the end time to do so. We might not be able to rely on the logs being "complete" in real time.
  2. Add a look back grace period ~5 mins default. On the next sync we start the query window at replication key value minus 5 mins.

I prefer solution 1 but it has 2 potential implementations: either dont overshoot the end timestamp which could solve the problem but I'd guess probably not and the other is to make the tap not query up to real time data by limiting the end time to now minus 5 mins (or 10 mins). Personally I'd be fine with giving cloudwatch a buffer of a few mins to allow late arriving logs to show up, so if you were trying to use this tap to sync in real time it would be on a 5 mins delay. We could make this a configurable parameter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.