Motivation At the moment there are 3 layers that can be used to al

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Shaping new JIT allocator API about asmjit HOT 11 CLOSED

kobalicek commented on June 12, 2024 2

Shaping new JIT allocator API

from asmjit.

Comments (11)

kobalicek commented on June 12, 2024 2

@jhogberg I will be back to this soon.

I think yeah, theoretically CPU can prefetch pages so I guess it would not be safe to do that, which is a pity as with block flushing it would be theoretically pretty fast even if there is a need to do a syscall to propagate the flushing to all threads.

But still, I'm not 100% sure, maybe CPUs would only prefetch i-cache lines that are reachable from code - but this is not something I would want to do if there is uncertainty as I really don't like random hard to repro crashes due to this.

from asmjit.

jhogberg commented on June 12, 2024 1

Flush instruction cache after the code has been written (done by ProtectJitReadWriteScope destructor).

To add another wrinkle to this, it's not enough to do this on the thread that modified the code in a multi-threaded environment. For it to work properly one needs to issue an instruction synchronization barrier on all threads that will execute the code (either by using a serializing instruction like cpuid on x86 or isb on aarch64).

We ran into this recently on aarch64, doing it wrong worked for an embarrassingly long time because the way we synchronized other code-related things made us hit implicit barriers all the time (mainly syscalls and being scheduled out).

Is the current API sufficient? How to support environments where it's not sufficient?

We find it sufficient, but if JitAllocator gave us less freedom to cross-modify running code we'd either start using VirtMem directly or roll that on our own too. We're using it out of convenience and wouldn't lose much if you started making drastic changes in this area, it would be a bit of extra work but we've been thinking about doing that anyway to get more control over the memory layout.

Is the new API design usable from user's perspective?

I think so, nearly all uses of JitAllocator are followed by a copy and instruction cache flush anyway so wrapping that in a write() function is just more convenient.

Which new techniques could be added to the new API to make it fit more use cases?

To go back to the instruction cache stuff, I think people would appreciate something to make multi-threaded instruction synchronization easier to handle. You don't have to do anything as funky as cross-modifying running code to run into issues, it's enough to free code and then generate code at the same address.

from asmjit.

jhogberg commented on June 12, 2024 1

@jhogberg That's the point of this discussion to actually design the new API in a way so you would not need to create your own abstraction on top of VirtMem - although it's already great that you would do it on top of VirtMem and not creating one from scratch :)

Sure, I just wanted you to know that if our requirements place too many strange constraints on the new design, it wouldn't be too much effort to change things on our end.

I think the allocator should track removed code in each block and if adding code to a block in which the code has been removed we should issue this barrier

The tricky bit is issuing it on all (other) threads that might execute code. MacOS and Windows settle this internally behind their instruction cache flushing APIs, on Linux you can use sys_membarrier(2), but other platforms require help from the user.

from asmjit.

kobalicek commented on June 12, 2024 1

@jhogberg I think one solution to have a cross-platform barrier could be having something like a counter that would be incremented every time something is removed from executable memory and a place where each thread would call something like VirtMem::maybeSynchronizeBarrier() (or something nicer) that would compare a global counter with a thread local one and issue that barrier if there is a mismatch (and then update the thread local counter) - however, it's true that this would be on the user to call, not sure if it's user friendly though, however, if there is no way to ask the OS to do that I'm unsure how else this could potentially be solved.

Additionally, if this cannot be solved, we can just forbid removing already allocated code by just marking that place as a waste and releasing the whole mapped block once all functions in that block are removed (but this could be a waste in cases that code is removed, but also stays there).

I think that reviewing the jit_alloc branch would be nice, as the only missing thing there is a multiple write scope.

from asmjit.

jhogberg commented on June 12, 2024 1

however, it's true that this would be on the user to call, not sure if it's user friendly though, however, if there is no way to ask the OS to do that I'm unsure how else this could potentially be solved.

There's no better alternative that I know of, so it's probably best just to bite the bullet. I think having the write functions return an error code to flag this condition would be enough. Thankfully the most popular platforms have syscalls to handle it without manual intervention so most people won't ever see it.

I think that reviewing the jit_alloc branch would be nice, as the only missing thing there is a multiple write scope.

It looks good, though people might want a cheaper write variant that skips the instruction barrier when it's known that it isn't necessary, e.g. changing a b or b.cond branch target on aarch64, which is one of a few alterations that the architecture explicitly allows without barriers (see section B2.2.5 Concurrent modification ... in the reference manual if you're curious).

from asmjit.

kobalicek commented on June 12, 2024 1

I have updated the https://github.com/asmjit/asmjit/tree/jit_alloc branch.

Still not sure whether the API is okay, basically it's now possible to control the cache flush via a parameter, and there is now a WriteScope, which is not really doing anything extra - just about designing a sane API at the moment.

I'm still thinking how to model better the flushInstructionCache() and CachePolicy - in the end this should be configurable whether to flush caches in all threads or whether local flush is okay.

I also thought about an idea of flushing cache before writing to executable memory, in batches. For example, JitAllocator allocates a new block, marks all segments as dirty, and then when allocating memory it would flush in batches, like 4kB, 8kB, 16kB - and then mark these regions as flushed. Then when you again use JitAllocator::alloc and it gives you a span which region was already flushed, it would be marked as NoFlush (or something smarter) and the writer would know to not flush that one. In theory, if a region of executable memory was flushed, no thread ever executed it afterwards, then it should be safe to write code into it without flushing, right?

from asmjit.

kobalicek commented on June 12, 2024

@jhogberg That's the point of this discussion to actually design the new API in a way so you would not need to create your own abstraction on top of VirtMem - although it's already great that you would do it on top of VirtMem and not creating one from scratch :)

I don't plan to remove features from JitAllocator actually, I just think that a bit different design could solve a lot of issues while offering a very similar set of features. If we settle on spans and offer a write() API and write scopes, then JitAllocator can be refactored at any time to offer something new without having to touch any code that uses it.

For example I have lately found a function pthread_jit_write_with_callback_np() - if a new API is added to JitAllocator that would allow the callback (as already drafted in the issue) then we would be able to support pthread_jit_write_with_callback_np() out of box and the application would be able to even use PTHREAD_JIT_WRITE_ALLOW_CALLBACKS_NP() to even tighten the security, etc...

BTW I have already implemented some bits yesterday, you can check out this branch - impact on user code is minimal:

https://github.com/asmjit/asmjit/tree/jit_alloc

BTW that "instruction synchronization barrier" is a good point. I think the allocator should track removed code in each block and if adding code to a block in which the code has been removed we should issue this barrier. I will look into this, but most likely I will start adding this feature to jit_alloc branch. BTW testing this would be fun as well.

from asmjit.

jhogberg commented on June 12, 2024

Still not sure whether the API is okay, basically it's now possible to control the cache flush via a parameter, and there is now a WriteScope, which is not really doing anything extra - just about designing a sane API at the moment.

I think it looks good, I can't think of anything we'd need beyond that.

In theory, if a region of executable memory was flushed, no thread ever executed it afterwards, then it should be safe to write code into it without flushing, right?

I'm not so sure, couldn't the processor speculatively pre-fetch some of those cache lines in between flushing and reading (for whatever reason)?

from asmjit.

jhogberg commented on June 12, 2024

But still, I'm not 100% sure, maybe CPUs would only prefetch i-cache lines that are reachable from code - but this is not something I would want to do if there is uncertainty as I really don't like random hard to repro crashes due to this.

I don't think it's a good idea either, even if the processor would restrict instruction prefetching to code that's reachable, there's often data alongside the code (e.g. dispatch/lookup tables) which could potentially cause it to pre-fetch out of bounds. If some important data is then present in the pre-fetched region, we'd go off the rails just the same.

from asmjit.

kobalicek commented on June 12, 2024

I'm ready to merge jit_alloc branch. It's an ABI break, but not source break, so we should slowly move in this direction and see whether this would be a sufficient design or not.

from asmjit.

kobalicek commented on June 12, 2024

Jit allocator changes were already merged: 8e2f4de

from asmjit.

Shaping new JIT allocator API about asmjit HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent