Sorry this isn't complete, haven't had as much time as I would have liked. I think the big picture points are most important. Mainly, I think the analogies between the needs/strategies of runtimes and LLMs could be clearer and more frontloaded.
set up the problem of bottlenecks generally and in terms of both runtimes and LLMs so we can be prepared to take in this strategy
or maybe even before this section setup the juxtaposition - the reader needs to grok the analogy/intersections between the two structures before they can appreciate the parallels in strategies/applications.
also, i feel like this strategy is even more general than kernels/runtimes... like in liam's group we used this strategy all the time... i would often make more approximate models for online inference and design and then use slower inference with a more accurate or structured model offline. it's not 1:1, but like... not clear me to that this is something that is unique enough to demonstrate that LLMs are a building block of a computer system... this is more general to any computational system (i.e. a program that does an analysis vs a program that is a runtime).
Registers store small amounts of intermediate data
are they intermediates? i sort of see them as the representations/instantiations of the data that are actually used for computation. the registers are the things that go into and out of the CPU. they are the actual thing
The vLLM inference server that uses this technique
is it crucial here that this LLM is handling two requests at once... would be nice to know how this type of system is setup since that informs what's at issue here - like if these were two LLMs on different machines, they would not have a problem. In this case, it's one LLM process on the CPU or multiple? Curious to get a little bit of context on that to help me see the parallel to the runtime case.
The MemGPT pattern uses prompting and careful design of tools
for me, this section is by far the one that actually talks about using an LLM like a runtime... somehow this one needs to either to go first, or probably last is best... but it could be used as an opportunity to anchor the whole thing?
boom. this is key. again - this section feels so much more relevant... if you are going to compare and contrast... this section is compare, the others more contrast almost..?
Interruptibility and event-driven-ness are key features of biological agents as well.
To quote Copilot's suggested completion of this paragraph:
"If you poke a cat, it will stop what it's doing and respond to the poke."
:)
The event of that suggestion's appearance while drafting this post
interrupted my drafting task, triggering me to reflect on it
and then resume my drafting task with a new idea for what to say next.
I like this but worded awkwardly atm
It is clear that if LLMs are to become the cognitive kernels
of agents in any way worthy of the name,
they will need a similar architecture.
weeeeeeee
i think move this section to the top. give us the goods up front... i'm worried people on the fringes of this convo will not have a hook soon enough. for me, as you clearly know, this part is the hook.
cognitive kernels
❤️
Whether your background is in systems or ML,
you'll find a lot of value in thinking
"across the divide" and your knowledge to the other field.