Neat project! I really like the idea of warming caches based on queries that are likel

More lazy eager_db about eager_db HOT 4 OPEN

zmaril commented on August 20, 2024

More lazy eager_db

from eager_db.

Comments (4)

wangjohn commented on August 20, 2024

Haha, guess what I'm currently implementing =).

Ya, you raise valid points, and that's the next step in the project (it was actually intended to do this all along). The interesting thing is that the implementation of the prediction part can be completely separate from the infrastructure because of this DSL that I've defined.

My implementation of the prediction was actually going to be almost exactly what you said, since it can be parallelized quite easily (breaking up the logs into k chunks doesn't lose you very much information except at the edges of the break points, and then law of large numbers can give you pretty good guarantees that you don't lose too much information).

from eager_db.

zmaril commented on August 20, 2024

Another comment on the above eagerness:

If you have only lots of reads, then the above could work pretty well. At any point, the cache will always be up to date and you won't have to worry about it too much.

If you start throwing writes into the application, then the cache could easily become invalidated and be holding onto old results. For the above user application of a "It'll be a github, but for, uh, recipes!", if grandma forks mom's recipe after looking at it, then the cache would be invalidated and you're SOL.

Generally, cache invalidation is, uh, hard to say the least. I see at least two ways towards mitigating this complexity, the smart way and the dumb way.

The smart way would be to create a dependency graph of queries. Derive the fact that every time grandma forks mom, then the cache corresponding to all grandma's recipes goes out of date and needs to be refreshed. This could be inferable and might not be too hard if you are into doing logic on SQL ASTS.

The dumb way would be to just bang queries out every N milliseconds. The smarter dumb way is to estimate the distributions and refresh queries slightly before the user does. Thanks to the logs, you have good estimates of the time between edges. As in, you could probably infer that it takes, on average, grandma a minute to read mom's recipe, rage because she is using too much butter, and then decide to fork it. You could then refresh the query a few seconds before you expect grandma to issue it and impress her with your forethought. In the same moment, you could expire the cache after you are reasonably confident that grandma has logged off. In between, you could be refreshing based on some supplied parameter or information from the previous smart way.

I guess, it would be predicting the working set that is needed based on the previous time series of queries and trying to keep it as up to date as possible with statistics and logic.

from eager_db.

wangjohn commented on August 20, 2024

Right, so cache invalidation is hard, and EagerDB doesn't actually deal with it. That's because that "cache" of EagerDB is essentially your original database's cache. This way, you don't actually have to worry about cache invalidation or any of the methods you listed above.

Moreover, this way, if someone comes along with a better implementation of the cache on top of Postgres, you don't have to change anything in EagerDB. Another advantage is that if there were a cache, you'd essentially be creating a second cache on top of the database's pre-existing cache. Two caches aren't that nice, especially if they hold replicated data, which not quite be completely replicated due to the cache invalidation issues you mention.

EagerDB is just going to send these queries to your database and "warm" the cache by hitting the database with these preload queries (it's not even going to store the results). That way, the next time you go to your preloaded query, it will already be in the databases's cache.

Another reason for this design is that you don't have to worry about cache expiration or tweaking parameters like size of memory allocated to the cache -- it's already in your database.

from eager_db.

zmaril commented on August 20, 2024

Nice. The writes aren't an issue then for EagerDB, or, they are only as big an issue as they are for the original database. I blanked out on the design for a second and blurred the lines between the prewarming (heater?) and the actual db itself.

I'll be curious to see how this turns out. Creating the markov chain from the stream of events doesn't seem to hard. Simple random sampling would probably let you sample a small amount of the logs and get a pretty accurate approximation of the graph without messing with parallel code. There are so many workflows in the application that the user can follow. Then you throw some async calls on every query to tell the heater to warm up for the db and you're done.

On another note, I was curious to see what previous work has been done on this. "Predictive caching" produces some interesting papers (if you avoid ones on CPU caches). This one lead me to this one. The latter is a variation on this idea, in the sense that it develops a markov chain of sorts for the links in a document database and prefetches based on that. I've yet to find a paper on warming the db cache based on previous queries though.

from eager_db.

More lazy eager_db about eager_db HOT 4 OPEN

Comments (4)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent