meltanolabs / tap-cloudwatch Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
This PR makes it explicit that we sort the stream records and that its safe to resume if failed. This could be much faster if we dont care about order. Potentially adding a feature that you can config to make is_sorted false and stop using an ordered queue for keeping records in order, potentially using threads and just yield them as they are received.
I'm noticing that some days with large log counts we have slight diffs in the overall counts for a day, theyre <1% for my data set so its an edge case but still needs to be fixed. In my first batch using this tap I was able to tie them out but it looks like potentially on larger batches we're skipping some logs.
I suspect that theres a bug in the sub-batching logic thats leading to logs getting skipped on the edge of those ranges. I think we already use >= but something isn't working as expected.
Right now if the 10k limit is exceeded we split the request and make subsequent requests serially which works but its slow, especially if the limit was exceeded by a lot. We should either batch automatically into smaller chunks and concurrently start all of them so theres less wait time.
Initially discussed in #26 (comment) and meltano/sdk#922.
TLDR;
What do you think of us defining end_date to be either an ISO 8601 date or datetime value (as is expected for start_date) or an ISO 8601 time interval. We could then use the following interpretation logic:
......
So, from the above, our proposed default of "5 minutes ago" would be "end_date": "-PT5M". To include exactly 10 days of data from the provided start_date, the end date could use a interval:"end_date": "P10D".
My pipeline has been running with no problems for ~5 months. Today it failed because the query batch window was broken, the start date was after the end date. It looks like it was the last batch and must be using some sort of logic with current time that hit an edge case. We should either fix the core of it where the max the start date can be is less than or equal to the end date/current time. Another option is to assert the start date is before the end date and skip it if not but this is a little heavy handed and could be risky if other bugs come up and it skips batches mid sync vs the last batch up to current time, probably avoid this if possible.
Execution ID: 21c875d9c0854236a004a70b73202cf9
tap_cloudwatch.subquery | Submitting query for batch from: `2023-07-18T10:01:05 UTC` - `2023-07-18T10:01:13 UTC`
....
tap_cloudwatch.subquery | Submitting query for batch from: `2023-07-19T06:01:14 UTC` - `2023-07-19T06:01:13 UTC
....
botocore.errorfactory.InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartQuery operation: End time cannot be less than Start time (Service: AWSLogs; Status Code: 400; Error Code: InvalidParameterException
Currently if stats is used in the query then we throw an exception
. Ideally we'd allow stats by disabling some of the other features that dont work with it.https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
I'm noticing that even if everything looks to function as expected on the tap side I'm still getting very slight diffs between the source and the warehouse. It looks like we're querying near realtime logs at the end of a sync and then bookmarking the latest timestamp we receive and I suspect some logs with earlier timestamp are arriving after the query is run. Then subsequent queries assume all prior logs have been synced so those late arriving logs are outside the next filter range and never get replicated. I've only seen this affect records within a few mins (<5 mins).
For my sync I see the following in my logs:
2023-02-28T06:01:09 UTC
- 2023-02-28T07:01:09 UTC
"This indicates to me that the replication key is being properly tracked and passed to the next run.
When I diff at the minute level right around the replication key (2023-02-28T06:01:09 UTC
) range I see:
Potential solutions:
end_time = datetime.now(timezone.utc).timestamp()
right now then overshoot it because we use a batch windowing mechanism to avoid the 10k limit. This ends up submitting a query like shown above where the end timestamp was probably somewhere around 2023-02-28T06:03: UTC but the window ends in 2023-02-28T07:00:37 UTC. This aims to get real time logs and overshoots the end time to do so. We might not be able to rely on the logs being "complete" in real time.I prefer solution 1 but it has 2 potential implementations: either dont overshoot the end timestamp which could solve the problem but I'd guess probably not and the other is to make the tap not query up to real time data by limiting the end time to now minus 5 mins (or 10 mins). Personally I'd be fine with giving cloudwatch a buffer of a few mins to allow late arriving logs to show up, so if you were trying to use this tap to sync in real time it would be on a 5 mins delay. We could make this a configurable parameter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.