Skip to content

wamr_llvm_jit is dramatically slower than peer runtimes on tight loops with loop-varying mutable-global read-modify-write updates #4921

@gaaraw

Description

@gaaraw

Subject of the issue

wamr_llvm_jit is dramatically slower than peer runtimes on tight loops that repeatedly perform mutable-global read-modify-write updates when the updated value depends on a loop-varying integer value.

Test case

The clearest minimized reproducer is a hot loop with repeated mutable-global accumulation and no memory-load target at all:

(module
  (type (func (param i32)))
  (type (func))

  (import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))

  (func (type 1)
    (local $i i64)

    (local.set $i (i64.const 4294967296))
    (loop $body
      (global.get $g0)
      (local.get $i)
      (i32.wrap_i64)
      (i32.add)
      (global.set $g0)
      (local.set $i (i64.sub (local.get $i) (i64.const 1)))
      (br_if $body (i64.ne (local.get $i) (i64.const 0)))
    )

    (call 0 (i32.const 0))
  )

  (memory $m0 1)
  (global $g0 (mut i32) (i32.const 0))
  (export "_start" (func 1))
  (export "memory" (memory 0))
)

Related reduced variants also show a consistent pattern:

  • global_acc_i64.wat: global.get (mut i64) + local.get $i + i64.add + global.set
  • global_acc_i32_xorlike.wat: global.get (mut i32) + local.get $i narrowed to i32 + i32.xor + global.set
  • global_acc_i32_const_add.wat: global.get (mut i32) + i32.const 1 + i32.add + global.set

Environment

The runtime tools are all built on release and use JIT mode.

  • wasmer: 6.1.0
  • WAMR: iwasm 2.4.4
  • wasmedge: 0.16.1-18-gc457fe30
  • wasmtime: 41.0.0 (4898322a4 2025-12-18)
  • llvm: 21.1.5
  • Host OS: Ubuntu 22.04.5 LTS x64
  • CPU: 12th Gen Intel® Core™ i7-12700 × 20

Steps to reproduce

  1. Compile the testcase with wat2wasm.
  2. Run it under wamr_llvm_jit and compare the wall-clock / task-clock time with other runtimes.
  3. Repeat with the small structural variants above.

A representative comparison uses the same host and iteration count (4294967296) across runtimes.

wat2wasm test_case.wat -o test_case.wasm

# Execute the wasm file and collect data
perf stat -r 5 -e 'task-clock' /path/to/wasmer run -l test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/wasmedge --enable-jit test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_llvm_jit/iwasm test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_fast_jit/iwasm test_case.wasm

Expected and actual behavior

Expected behavior

For such a small and simple repeated mutable-global update loop, I would expect wamr_llvm_jit to be in the same rough performance range as the other major JIT/AOT backends, or at least not to be a dramatic outlier.

Actual behavior

wamr_llvm_jit is a large slowdown outlier on the loop-varying mutable-global update pattern.

Representative task-clock timings:

variant wasmer_llvm wasmedge_jit wamr_llvm_jit wamr_fast_jit
global_acc_only ~1.00s ~0.015s ~7.03s ~2.95s
global_acc_i64 ~0.98s ~0.016s ~7.06s ~2.92s
global_acc_i32_xorlike ~0.99s ~0.62s ~7.01s ~2.92s
global_acc_i32_const_add ~1.01s ~0.016s ~0.99s ~2.92s

Important observations:

  • The slowdown is reproducible without any memory-load instruction.
  • The slowdown is not limited to i32.add; it also appears with i64.add and i32.xor in the same mutable-global read-modify-write structure.
  • The slowdown disappears when the global is still updated every iteration but the update uses a constant increment (i32.const 1) instead of a loop-varying value.

This suggests that the trigger condition is specifically a repeated mutable-global read-modify-write update whose new value depends on a loop-varying integer value.

Extra Info

This testcase family was discovered while analyzing a separate i32.load8_s / i32.load8_u microbenchmark family. After several reduction steps, the narrow-load instruction itself no longer appears necessary to reproduce the dominant wamr_llvm_jit slowdown.

I have low-level survival evidence from other runtimes (Wasmer LLVM / Cranelift / Wasmtime) showing that the corresponding hot loop structure remains present after lowering. I have not yet confirmed the exact internal cause inside WAMR LLVM JIT, so the description above is intentionally limited to the observed trigger pattern and timing behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions