Subject of the issue
wamr_llvm_jit is dramatically slower than peer runtimes on tight loops that repeatedly perform mutable-global read-modify-write updates when the updated value depends on a loop-varying integer value.
Test case
The clearest minimized reproducer is a hot loop with repeated mutable-global accumulation and no memory-load target at all:
(module
(type (func (param i32)))
(type (func))
(import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))
(func (type 1)
(local $i i64)
(local.set $i (i64.const 4294967296))
(loop $body
(global.get $g0)
(local.get $i)
(i32.wrap_i64)
(i32.add)
(global.set $g0)
(local.set $i (i64.sub (local.get $i) (i64.const 1)))
(br_if $body (i64.ne (local.get $i) (i64.const 0)))
)
(call 0 (i32.const 0))
)
(memory $m0 1)
(global $g0 (mut i32) (i32.const 0))
(export "_start" (func 1))
(export "memory" (memory 0))
)
Related reduced variants also show a consistent pattern:
global_acc_i64.wat: global.get (mut i64) + local.get $i + i64.add + global.set
global_acc_i32_xorlike.wat: global.get (mut i32) + local.get $i narrowed to i32 + i32.xor + global.set
global_acc_i32_const_add.wat: global.get (mut i32) + i32.const 1 + i32.add + global.set
Environment
The runtime tools are all built on release and use JIT mode.
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
Steps to reproduce
- Compile the testcase with
wat2wasm.
- Run it under
wamr_llvm_jit and compare the wall-clock / task-clock time with other runtimes.
- Repeat with the small structural variants above.
A representative comparison uses the same host and iteration count (4294967296) across runtimes.
wat2wasm test_case.wat -o test_case.wasm
# Execute the wasm file and collect data
perf stat -r 5 -e 'task-clock' /path/to/wasmer run -l test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/wasmedge --enable-jit test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_llvm_jit/iwasm test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_fast_jit/iwasm test_case.wasm
Expected and actual behavior
Expected behavior
For such a small and simple repeated mutable-global update loop, I would expect wamr_llvm_jit to be in the same rough performance range as the other major JIT/AOT backends, or at least not to be a dramatic outlier.
Actual behavior
wamr_llvm_jit is a large slowdown outlier on the loop-varying mutable-global update pattern.
Representative task-clock timings:
| variant |
wasmer_llvm |
wasmedge_jit |
wamr_llvm_jit |
wamr_fast_jit |
global_acc_only |
~1.00s |
~0.015s |
~7.03s |
~2.95s |
global_acc_i64 |
~0.98s |
~0.016s |
~7.06s |
~2.92s |
global_acc_i32_xorlike |
~0.99s |
~0.62s |
~7.01s |
~2.92s |
global_acc_i32_const_add |
~1.01s |
~0.016s |
~0.99s |
~2.92s |
Important observations:
- The slowdown is reproducible without any memory-load instruction.
- The slowdown is not limited to
i32.add; it also appears with i64.add and i32.xor in the same mutable-global read-modify-write structure.
- The slowdown disappears when the global is still updated every iteration but the update uses a constant increment (
i32.const 1) instead of a loop-varying value.
This suggests that the trigger condition is specifically a repeated mutable-global read-modify-write update whose new value depends on a loop-varying integer value.
Extra Info
This testcase family was discovered while analyzing a separate i32.load8_s / i32.load8_u microbenchmark family. After several reduction steps, the narrow-load instruction itself no longer appears necessary to reproduce the dominant wamr_llvm_jit slowdown.
I have low-level survival evidence from other runtimes (Wasmer LLVM / Cranelift / Wasmtime) showing that the corresponding hot loop structure remains present after lowering. I have not yet confirmed the exact internal cause inside WAMR LLVM JIT, so the description above is intentionally limited to the observed trigger pattern and timing behavior.
Subject of the issue
wamr_llvm_jitis dramatically slower than peer runtimes on tight loops that repeatedly perform mutable-global read-modify-write updates when the updated value depends on a loop-varying integer value.Test case
The clearest minimized reproducer is a hot loop with repeated mutable-global accumulation and no memory-load target at all:
Related reduced variants also show a consistent pattern:
global_acc_i64.wat:global.get (mut i64)+local.get $i+i64.add+global.setglobal_acc_i32_xorlike.wat:global.get (mut i32)+local.get $inarrowed toi32+i32.xor+global.setglobal_acc_i32_const_add.wat:global.get (mut i32)+i32.const 1+i32.add+global.setEnvironment
The runtime tools are all built on release and use JIT mode.
Steps to reproduce
wat2wasm.wamr_llvm_jitand compare the wall-clock / task-clock time with other runtimes.A representative comparison uses the same host and iteration count (
4294967296) across runtimes.Expected and actual behavior
Expected behavior
For such a small and simple repeated mutable-global update loop, I would expect
wamr_llvm_jitto be in the same rough performance range as the other major JIT/AOT backends, or at least not to be a dramatic outlier.Actual behavior
wamr_llvm_jitis a large slowdown outlier on the loop-varying mutable-global update pattern.Representative task-clock timings:
global_acc_onlyglobal_acc_i64global_acc_i32_xorlikeglobal_acc_i32_const_addImportant observations:
i32.add; it also appears withi64.addandi32.xorin the same mutable-global read-modify-write structure.i32.const 1) instead of a loop-varying value.This suggests that the trigger condition is specifically a repeated mutable-global read-modify-write update whose new value depends on a loop-varying integer value.
Extra Info
This testcase family was discovered while analyzing a separate
i32.load8_s/i32.load8_umicrobenchmark family. After several reduction steps, the narrow-load instruction itself no longer appears necessary to reproduce the dominantwamr_llvm_jitslowdown.I have low-level survival evidence from other runtimes (Wasmer LLVM / Cranelift / Wasmtime) showing that the corresponding hot loop structure remains present after lowering. I have not yet confirmed the exact internal cause inside WAMR LLVM JIT, so the description above is intentionally limited to the observed trigger pattern and timing behavior.