`wamr_llvm_jit` is dramatically slower than peer runtimes on tight loops with loop-varying mutable-global read-modify-write updates

# Subject of the issue

`wamr_llvm_jit` is dramatically slower than peer runtimes on tight loops that repeatedly perform mutable-global read-modify-write updates when the updated value depends on a loop-varying integer value.

# Test case

The clearest minimized reproducer is a hot loop with repeated mutable-global accumulation and no memory-load target at all:

```wat
(module
  (type (func (param i32)))
  (type (func))

  (import "wasi_snapshot_preview1" "proc_exit" (func (type 0)))

  (func (type 1)
    (local $i i64)

    (local.set $i (i64.const 4294967296))
    (loop $body
      (global.get $g0)
      (local.get $i)
      (i32.wrap_i64)
      (i32.add)
      (global.set $g0)
      (local.set $i (i64.sub (local.get $i) (i64.const 1)))
      (br_if $body (i64.ne (local.get $i) (i64.const 0)))
    )

    (call 0 (i32.const 0))
  )

  (memory $m0 1)
  (global $g0 (mut i32) (i32.const 0))
  (export "_start" (func 1))
  (export "memory" (memory 0))
)
```

Related reduced variants also show a consistent pattern:

- `global_acc_i64.wat`: `global.get (mut i64)` + `local.get $i` + `i64.add` + `global.set`
- `global_acc_i32_xorlike.wat`: `global.get (mut i32)` + `local.get $i` narrowed to `i32` + `i32.xor` + `global.set`
- `global_acc_i32_const_add.wat`: `global.get (mut i32)` + `i32.const 1` + `i32.add` + `global.set`

# Environment

The runtime tools are all built on release and use JIT mode.

- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wasmtime: 41.0.0 (4898322a4 2025-12-18)
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20

# Steps to reproduce

1. Compile the testcase with `wat2wasm`.
2. Run it under `wamr_llvm_jit` and compare the wall-clock / task-clock time with other runtimes.
3. Repeat with the small structural variants above.

A representative comparison uses the same host and iteration count (`4294967296`) across runtimes.

```bash
wat2wasm test_case.wat -o test_case.wasm

# Execute the wasm file and collect data
perf stat -r 5 -e 'task-clock' /path/to/wasmer run -l test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/wasmedge --enable-jit test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_llvm_jit/iwasm test_case.wasm
perf stat -r 5 -e 'task-clock' /path/to/build_fast_jit/iwasm test_case.wasm
```

# Expected and actual behavior

## Expected behavior

For such a small and simple repeated mutable-global update loop, I would expect `wamr_llvm_jit` to be in the same rough performance range as the other major JIT/AOT backends, or at least not to be a dramatic outlier.

## Actual behavior

`wamr_llvm_jit` is a large slowdown outlier on the loop-varying mutable-global update pattern.

Representative task-clock timings:

| variant | wasmer_llvm | wasmedge_jit | wamr_llvm_jit | wamr_fast_jit |
|---|---:|---:|---:|---:|
| `global_acc_only` | ~1.00s | ~0.015s | ~7.03s | ~2.95s |
| `global_acc_i64` | ~0.98s | ~0.016s | ~7.06s | ~2.92s |
| `global_acc_i32_xorlike` | ~0.99s | ~0.62s | ~7.01s | ~2.92s |
| `global_acc_i32_const_add` | ~1.01s | ~0.016s | ~0.99s | ~2.92s |

Important observations:

- The slowdown is reproducible without any memory-load instruction.
- The slowdown is not limited to `i32.add`; it also appears with `i64.add` and `i32.xor` in the same mutable-global read-modify-write structure.
- The slowdown disappears when the global is still updated every iteration but the update uses a constant increment (`i32.const 1`) instead of a loop-varying value.

This suggests that the trigger condition is specifically a repeated mutable-global read-modify-write update whose new value depends on a loop-varying integer value.

# Extra Info

This testcase family was discovered while analyzing a separate `i32.load8_s` / `i32.load8_u` microbenchmark family. After several reduction steps, the narrow-load instruction itself no longer appears necessary to reproduce the dominant `wamr_llvm_jit` slowdown.

I have low-level survival evidence from other runtimes (Wasmer LLVM / Cranelift / Wasmtime) showing that the corresponding hot loop structure remains present after lowering. I have not yet confirmed the exact internal cause inside WAMR LLVM JIT, so the description above is intentionally limited to the observed trigger pattern and timing behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wamr_llvm_jit` is dramatically slower than peer runtimes on tight loops with loop-varying mutable-global read-modify-write updates #4921

Subject of the issue

Test case

Environment

Steps to reproduce

Expected and actual behavior

Expected behavior

Actual behavior

Extra Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

variant	wasmer_llvm	wasmedge_jit	wamr_llvm_jit	wamr_fast_jit
`global_acc_only`	~1.00s	~0.015s	~7.03s	~2.95s
`global_acc_i64`	~0.98s	~0.016s	~7.06s	~2.92s
`global_acc_i32_xorlike`	~0.99s	~0.62s	~7.01s	~2.92s
`global_acc_i32_const_add`	~1.01s	~0.016s	~0.99s	~2.92s

wamr_llvm_jit is dramatically slower than peer runtimes on tight loops with loop-varying mutable-global read-modify-write updates #4921

Description

Subject of the issue

Test case

Environment

Steps to reproduce

Expected and actual behavior

Expected behavior

Actual behavior

Extra Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`wamr_llvm_jit` is dramatically slower than peer runtimes on tight loops with loop-varying mutable-global read-modify-write updates #4921