By allocating JIT memory in large chunks we can reduce the overhead of compilation and possibly speed up the compiled code a bit.
Currently, when we need to allocate memory for the JIT, we ask the OS for a sufficiently large chunk of virtual memory. This is simple and reasonably efficient. It does have a few flaws:
- We need to create DWARF debug info and register it for each trace
- We need to make a syscall to get memory for each trace
- We have no control over the location of the jitted code, meaning that calls in the executable may need trampolines
By allocating large chunks we can reduce the overhead of (1) and (2) to once per-chunk, not per trace.
By allocating large chunks we can also afford the additional overhead of requesting memory near to the executable.
How it would work:
- When we need memory for jitted code, we request it from our special allocator.
- When the allocator needs memory, it requests it from the OS, making several requests for it near the executable before accepting an location
- The allocator itself will be a standard obmalloc/jemalloc style block allocator
Size classes and fragmentation:
All jitted code will need to page aligned, so blocks will need to a multiple of the page size.
With 4 size classes per power of 2 increase in size (like jemalloc) and assuming 2M chunks we get relatively little internal fragmentation, but potential quite a lot of external fragmentation as there would be around 20 size classes.
The external fragmentation can be mitigated by allowing the OS to lazily allocate the pages on demand.
API
The _PyObject_VirtualAlloc function will need extending (or a new function added) to allow the desired address to be passed, so the allocator can get blocks near the executable.
The jit_alloc function's API will be unchanged.
By allocating JIT memory in large chunks we can reduce the overhead of compilation and possibly speed up the compiled code a bit.
Currently, when we need to allocate memory for the JIT, we ask the OS for a sufficiently large chunk of virtual memory. This is simple and reasonably efficient. It does have a few flaws:
By allocating large chunks we can reduce the overhead of (1) and (2) to once per-chunk, not per trace.
By allocating large chunks we can also afford the additional overhead of requesting memory near to the executable.
How it would work:
Size classes and fragmentation:
All jitted code will need to page aligned, so blocks will need to a multiple of the page size.
With 4 size classes per power of 2 increase in size (like jemalloc) and assuming 2M chunks we get relatively little internal fragmentation, but potential quite a lot of external fragmentation as there would be around 20 size classes.
The external fragmentation can be mitigated by allowing the OS to lazily allocate the pages on demand.
API
The _PyObject_VirtualAlloc function will need extending (or a new function added) to allow the desired address to be passed, so the allocator can get blocks near the executable.
The jit_alloc function's API will be unchanged.