Faust appears to support a sliding window. Any chance this package currently supports

Yeah, there's a sliding window class, and I have <a href="https://github.com/faust-str

Support for SlidingWindow about kafka-aggregator HOT 2 OPEN

lsst-sqre commented on June 3, 2024

Support for SlidingWindow

from kafka-aggregator.

Comments (2)

afausti commented on June 3, 2024

That would be an interesting addition to enable this package to compute moving averages, for example. Did you see any documentation on Sliding widow in Faust? the only options I found are Hopping and Tumbling windows.

As for the performance, Faust leverages Kafka's parallelism model. In principle, if you need a large sliding window or process large messages you can increase the number of topic partitions and run more Faust workers.

from kafka-aggregator.

fonty422 commented on June 3, 2024

Yeah, there's a sliding window class, and I have an open issue on the faust-streaming repo which provides links to several classes and modules that might provide access. I just can't seem to get any of them to work as I would expect them to.
I think in general people would like some basic aggregates - counts, mean, min, max, sum, in a continuously moving window.
In my mind the basic way would be to perform a calculation/function when something enters the window, then perform the opposite function to remove that when one exits the window. Normal windows know when something leaves the window.

With my particular use case the issue is that some keyed messages might sporadically produce thousands of messages more than usual or than others, so a particular partition might become overloaded without warning and randomly. Considering the window is 3 months long I wondered whether there's a topic size limit to consider - say 1 message per second for a misbehaving/troubled input, that's ~8M records over the window period.
My understanding (and correct me if I'm wrong) is that a stream processor (aka agent) can be run by multiple workers, but those workers can only read from specific partitions, so if you need to process by key then you need to ensure that the keys always end up in the same partition. that makes it painful when performing aggregations if a single partition by chance ends up with a a few million more records than all others. If those aggregations are held in a table, then each worker can only see it's own table and not those of others, meaning you can't split the workload across multiple workers.

If there's another way to do it, please let me know.

from kafka-aggregator.

Support for SlidingWindow about kafka-aggregator HOT 2 OPEN

Comments (2)

Related Issues (3)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent