fix(webapp): eliminate SSE abort-signal memory leak (#3430)

ericallam · web-flow · commit 486f49791d43 · 2026-04-23T15:54:05.000+02:00
## Summary Fixes a server-side memory leak in the webapp's SSE helper. Every aborted SSE connection (client tab close, navigation, timeout) was pinning its full request/response graph indefinitely on Node 20, so any long-running webapp process accumulated retained memory proportional to streaming-request churn. ## Root cause `apps/webapp/app/utils/sse.ts` combined four abort signals via `AbortSignal.any([requestAbortSignal, timeoutSignal, internalController.signal])`. The composite signal tracks its source signals in an internal `Set<WeakRef>` registered against a `FinalizationRegistry`; under sustained traffic those entries accumulate faster than they're cleaned up, pinning every source signal (and its listeners, and anything those listeners close over) until the parent signal itself is GC'd or aborts. This is a long-standing Node issue with multiple open reports: - [nodejs/node#54614](nodejs/node#54614) — original report, still open. A [follow-up from ChainSafe](nodejs/node#54614 (comment)) describes the exact same shape in a Lodestar production workload (req + timeout signals composed per request accumulating in long-running worker) and the same mitigation: drop `AbortSignal.any`, compose manually. - [nodejs/node#55351](nodejs/node#55351) — mechanism confirmed by Node member @jasnell: *"the set of dependent signals known to the AbortSignal are kept in an internal Set using WeakRefs. The AbortSignals are being properly gc'd but the Set is never cleaned out of the WeakRefs making those leak."* Partially fixed by [PR #55354](nodejs/node#55354), shipped in Node 22.12.0 — but only covers the tight-loop case, not long-lived parent signals. - [nodejs/node#57584](nodejs/node#57584) — circular-dependency variant, still open. - [nodejs/node#62363](nodejs/node#62363) — regression in Node 24/25 from an unrelated V8 change ("Don't pretenure WeakCells"). Different root cause, same symptom. A separate issue in `apps/webapp/app/entry.server.tsx` — `setTimeout(abort, ABORT_DELAY)` with no `clearTimeout` on success paths — kept the React render tree + `remixContext` alive for 30s per successful HTML request. Same pattern fixed upstream in React Router templates ([react-router#14200](remix-run/react-router#14200)), never backported to Remix v2. ## What changed - **`apps/webapp/app/utils/sse.ts`** — single-signal abort chain. `AbortSignal.any` removed; `AbortSignal.timeout` replaced by a plain `setTimeout` cleared when the controller aborts; named sentinel constants used as stackless abort reasons; request-abort handler explicitly removed on cleanup. - **`apps/webapp/app/entry.server.tsx`** — clears the `setTimeout(abort, ABORT_DELAY)` timer in `onShellReady` / `onAllReady` / `onShellError`. - **`apps/webapp/app/v3/tracer.server.ts` + `env.server.ts`** — gates OpenTelemetry `HttpInstrumentation` and `ExpressInstrumentation` behind `DISABLE_HTTP_INSTRUMENTATION=true` as an escape hatch for future OTel-listener retention patterns. Defaults to enabled. - **`apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts`** — uses the shared `ABORT_REASON_SEND_ERROR` sentinel. ## Verification ### Full-app reproduction (memlab) Isolated local harness, 500 abrupt SSE disconnects against a dev-presence route, GC between passes, heap snapshot diff with [memlab](https://facebook.github.io/memlab/): | Run | Heap delta after 500 conns + GC | memlab retained leaks | | --- | --- | --- | | Before | +16.0 MB (linear with request count) | 158 clusters; 250 `ServerResponse`, 1000 `AbortController`, 250 `SpanImpl` retained | | After | **+3.3 MB (noise)** | **0 app-code leaks** | ### Standalone mechanism isolation To confirm *which* axis of the change is load-bearing, a separate standalone Node script (`/tmp/abort-leak-test.mjs`) ran 2000 requests × 200 KB payload per variant: | Variant | Heap delta after GC | | --- | --- | | baseline (no signal machinery) | 0 MB | | V1: `AbortSignal.any` + string abort reason | **+9.1 MB** | | V2: `AbortSignal.any` only (no reason) | **+10.8 MB** | | V3: string reason only (no `AbortSignal.any`) | 0 MB | | V4: neither (the fix) | 0 MB | | V5: `AbortSignal.any` with no listener on the composite | **+10.2 MB** | This proves `AbortSignal.any` is the sole mechanism. The reason type (`.abort()` vs `.abort("string")`) is irrelevant for retention — V3 is clean, V5 leaks even without a listener on the composite. ## Risk - `sse.ts` is used by the dev-presence routes. Behaviour is equivalent — timeouts and client disconnects still abort the stream. `signal.reason` is now a named string sentinel (`"timeout"`, `"request_aborted"`, etc.) instead of the previous string arg or default `AbortError`. No in-tree reader of `signal.reason` exists. - `entry.server.tsx` change is a standard cleanup of an abort timer, matches upstream React Router guidance. - `tracer.server.ts` change is env-gated and defaults to current behaviour. - Three other webapp `AbortSignal.timeout()` callsites (alert delivery, remote-build status) are fire-and-forget passed directly to `fetch` — not composed with anything long-lived, no retention risk, untouched. ## Test plan - [ ] Existing SSE integration tests pass - [ ] Dev-presence SSE behaves normally across tab open/close cycles - [ ] No heap growth under sustained aborted-connection traffic (heap snapshot diff) ## Follow-up The same `AbortSignal.any([userSignal, internalSignal])` pattern exists in several SDK/core callsites that ship to customers (`packages/core/src/v3/realtimeStreams/manager.ts`, `packages/trigger-sdk/src/v3/{ai,chat,chat-client,sessions}.ts`, `packages/core/src/v3/workers/warmStartClient.ts`). Whether those leak in practice depends on the user passing a long-lived signal. Tracked separately.
diff --git a/.server-changes/fix-sse-memory-leak.md b/.server-changes/fix-sse-memory-leak.md
@@ -0,0 +1,6 @@
+---
+area: webapp
+type: fix
+---
+
+Fix memory leak where every aborted SSE connection pinned the full request/response graph on Node 20, caused by `AbortSignal.any()` in `sse.ts` retaining its source signals indefinitely (see nodejs/node#54614, nodejs/node#55351). Also clear the `setTimeout(abort)` timer in `entry.server.tsx` so successful HTML renders don't pin the React tree for 30s per request.
diff --git a/apps/webapp/app/entry.server.tsx b/apps/webapp/app/entry.server.tsx
@@ -83,6 +83,10 @@ function handleBotRequest(
 ) {
   return new Promise((resolve, reject) => {
     let shellRendered = false;
+    // Timer handle is cleared in every terminal callback so the abort closure
+    // (which captures the full React render tree + remixContext) doesn't pin
+    // memory for 30s per successful request. See react-router PR #14200.
+    let abortTimer: NodeJS.Timeout | undefined;
     const { pipe, abort } = renderToPipeableStream(
       <OperatingSystemContextProvider platform={platform}>
         <LocaleContextProvider locales={locales}>
@@ -105,8 +109,10 @@ function handleBotRequest(
           );
 
           pipe(body);
+          clearTimeout(abortTimer);
         },
         onShellError(error: unknown) {
+          clearTimeout(abortTimer);
           reject(error);
         },
         onError(error: unknown) {
@@ -121,7 +127,7 @@ function handleBotRequest(
       }
     );
 
-    setTimeout(abort, ABORT_DELAY);
+    abortTimer = setTimeout(abort, ABORT_DELAY);
   });
 }
 
@@ -135,6 +141,10 @@ function handleBrowserRequest(
 ) {
   return new Promise((resolve, reject) => {
     let shellRendered = false;
+    // Timer handle is cleared in every terminal callback so the abort closure
+    // (which captures the full React render tree + remixContext) doesn't pin
+    // memory for 30s per successful request. See react-router PR #14200.
+    let abortTimer: NodeJS.Timeout | undefined;
     const { pipe, abort } = renderToPipeableStream(
       <OperatingSystemContextProvider platform={platform}>
         <LocaleContextProvider locales={locales}>
@@ -157,8 +167,10 @@ function handleBrowserRequest(
           );
 
           pipe(body);
+          clearTimeout(abortTimer);
         },
         onShellError(error: unknown) {
+          clearTimeout(abortTimer);
           reject(error);
         },
         onError(error: unknown) {
@@ -173,7 +185,7 @@ function handleBrowserRequest(
       }
     );
 
-    setTimeout(abort, ABORT_DELAY);
+    abortTimer = setTimeout(abort, ABORT_DELAY);
   });
 }
 
diff --git a/apps/webapp/app/env.server.ts b/apps/webapp/app/env.server.ts
@@ -438,6 +438,7 @@ const EnvironmentSchema = z
     INTERNAL_OTEL_TRACE_SAMPLING_RATE: z.string().default("20"),
     INTERNAL_OTEL_TRACE_INSTRUMENT_PRISMA_ENABLED: z.string().default("0"),
     INTERNAL_OTEL_TRACE_DISABLED: z.string().default("0"),
+    DISABLE_HTTP_INSTRUMENTATION: BoolEnv.default(false),
 
     INTERNAL_OTEL_LOG_EXPORTER_URL: z.string().optional(),
     INTERNAL_OTEL_METRIC_EXPORTER_URL: z.string().optional(),
diff --git a/apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts b/apps/webapp/app/presenters/v3/RunStreamPresenter.server.ts
@@ -1,7 +1,7 @@
 import { type PrismaClient, prisma } from "~/db.server";
 import { logger } from "~/services/logger.server";
 import { singleton } from "~/utils/singleton";
-import { createSSELoader, SendFunction } from "~/utils/sse";
+import { ABORT_REASON_SEND_ERROR, createSSELoader, SendFunction } from "~/utils/sse";
 import { throttle } from "~/utils/throttle";
 import { tracePubSub } from "~/v3/services/tracePubSub.server";
 
@@ -66,8 +66,10 @@ export class RunStreamPresenter {
                   });
                 }
               }
-              // Abort the stream on send error
-              context.controller.abort("Send error");
+              // Abort the stream on send error. Uses a stackless string sentinel
+              // from sse.ts — a no-arg abort() would create a DOMException with a
+              // stack trace, which is unnecessary retention on the signal.reason.
+              context.controller.abort(ABORT_REASON_SEND_ERROR);
             }
           },
           1000
diff --git a/apps/webapp/app/utils/sse.ts b/apps/webapp/app/utils/sse.ts
@@ -38,14 +38,27 @@ type SSEOptions = {
 // This is used to track the open connections, for debugging
 const connections: Set<string> = new Set();
 
+// Stackless sentinel reasons passed to AbortController#abort. Calling .abort()
+// with no argument produces a DOMException that captures a ~500-byte stack
+// trace; a string reason is stored verbatim with no stack. The choice of
+// reason type does not cause the retention we saw in prod (that was the
+// AbortSignal.any composite — see comment near the timeoutTimer below for the
+// Node issue refs), but naming the sentinels keeps call sites readable and
+// lets future signal.reason consumers branch on the cause.
+export const ABORT_REASON_REQUEST = "request_aborted";
+export const ABORT_REASON_TIMEOUT = "timeout";
+export const ABORT_REASON_SEND_ERROR = "send_error";
+export const ABORT_REASON_INIT_STOP = "init_requested_stop";
+export const ABORT_REASON_ITERATOR_STOP = "iterator_requested_stop";
+export const ABORT_REASON_ITERATOR_ERROR = "iterator_error";
+
 export function createSSELoader(options: SSEOptions) {
   const { timeout, interval = 500, debug = false, handler } = options;
 
   return async function loader({ request, params }: LoaderFunctionArgs) {
     const id = request.headers.get("x-request-id") || Math.random().toString(36).slice(2, 8);
 
     const internalController = new AbortController();
-    const timeoutSignal = AbortSignal.timeout(timeout);
 
     const log = (message: string) => {
       if (debug)
@@ -60,16 +73,20 @@ export function createSSELoader(options: SSEOptions) {
           if (!internalController.signal.aborted) {
             originalSend(event);
           }
-          // If controller is aborted, silently ignore the send attempt
         } catch (error) {
           if (error instanceof Error) {
             if (error.message?.includes("Controller is already closed")) {
-              // Silently handle controller closed errors
               return;
             }
             log(`Error sending event: ${error.message}`);
           }
-          throw error; // Re-throw other errors
+          // Abort before rethrowing so timer + request-abort listener are cleaned
+          // up immediately. Otherwise a send-failure in initStream leaves them
+          // alive until `timeout` fires.
+          if (!internalController.signal.aborted) {
+            internalController.abort(ABORT_REASON_SEND_ERROR);
+          }
+          throw error;
         }
       };
     };
@@ -92,51 +109,57 @@ export function createSSELoader(options: SSEOptions) {
 
     const requestAbortSignal = getRequestAbortSignal();
 
-    const combinedSignal = AbortSignal.any([
-      requestAbortSignal,
-      timeoutSignal,
-      internalController.signal,
-    ]);
-
     log("Start");
 
-    requestAbortSignal.addEventListener(
-      "abort",
-      () => {
-        log(`request signal aborted`);
-        internalController.abort("Request aborted");
-      },
-      { once: true, signal: internalController.signal }
-    );
+    // Single-signal abort chain: everything rolls up into internalController.
+    // Timeout is a plain setTimeout cleared on abort rather than an
+    // AbortSignal.timeout() combined via AbortSignal.any() — AbortSignal.any
+    // keeps its source signals in an internal Set<WeakRef> managed by a
+    // FinalizationRegistry, and under sustained request traffic those entries
+    // accumulate faster than they get cleaned up, pinning every source signal
+    // (and its listeners, and anything those listeners close over) until the
+    // parent signal is GC'd or aborts. Reproduced locally in isolation; shape
+    // matches the ChainSafe Lodestar production case described in
+    // nodejs/node#54614. See also nodejs/node#55351 (mechanism confirmed by
+    // @jasnell, narrow fix in 22.12.0 via #55354) and nodejs/node#57584
+    // (circular-dep variant, still open).
+    const timeoutTimer = setTimeout(() => {
+      if (!internalController.signal.aborted) internalController.abort(ABORT_REASON_TIMEOUT);
+    }, timeout);
+
+    const onRequestAbort = () => {
+      log("request signal aborted");
+      if (!internalController.signal.aborted) internalController.abort(ABORT_REASON_REQUEST);
+    };
 
-    combinedSignal.addEventListener(
+    internalController.signal.addEventListener(
       "abort",
       () => {
-        log(`combinedSignal aborted: ${combinedSignal.reason}`);
+        clearTimeout(timeoutTimer);
+        requestAbortSignal.removeEventListener("abort", onRequestAbort);
       },
-      { once: true, signal: internalController.signal }
+      { once: true }
     );
 
-    timeoutSignal.addEventListener(
-      "abort",
-      () => {
-        if (internalController.signal.aborted) return;
-        log(`timeoutSignal aborted: ${timeoutSignal.reason}`);
-        internalController.abort("Timeout");
-      },
-      { once: true, signal: internalController.signal }
-    );
+    // The request could have been aborted during `await handler(context)` above.
+    // AbortSignal listeners added after the signal is already aborted never fire,
+    // so invoke cleanup synchronously in that case instead of waiting for `timeout`.
+    if (requestAbortSignal.aborted) {
+      onRequestAbort();
+    } else {
+      requestAbortSignal.addEventListener("abort", onRequestAbort, { once: true });
+    }
 
     if (handlers.beforeStream) {
       const shouldContinue = await handlers.beforeStream();
       if (shouldContinue === false) {
         log("beforeStream returned false, so we'll exit before creating the stream");
-        internalController.abort("Init requested stop");
+        internalController.abort(ABORT_REASON_INIT_STOP);
         return;
       }
     }
 
-    return eventStream(combinedSignal, function setup(send) {
+    return eventStream(internalController.signal, function setup(send) {
       connections.add(id);
       const safeSend = createSafeSend(send);
 
@@ -147,14 +170,14 @@ export function createSSELoader(options: SSEOptions) {
             const shouldContinue = await handlers.initStream({ send: safeSend });
             if (shouldContinue === false) {
               log("initStream returned false, so we'll stop the stream");
-              internalController.abort("Init requested stop");
+              internalController.abort(ABORT_REASON_INIT_STOP);
               return;
             }
           }
 
           log("Starting interval");
           for await (const _ of setInterval(interval, null, {
-            signal: combinedSignal,
+            signal: internalController.signal,
           })) {
             log("PING");
 
@@ -165,13 +188,16 @@ export function createSSELoader(options: SSEOptions) {
                 const shouldContinue = await handlers.iterator({ date, send: safeSend });
                 if (shouldContinue === false) {
                   log("iterator return false, so we'll stop the stream");
-                  internalController.abort("Iterator requested stop");
+                  internalController.abort(ABORT_REASON_ITERATOR_STOP);
                   break;
                 }
               } catch (error) {
                 log("iterator threw an error, aborting stream");
                 // Immediately abort to trigger cleanup
-                internalController.abort(error instanceof Error ? error.message : "Iterator error");
+                if (error instanceof Error && error.name !== "AbortError") {
+                  log(`iterator error: ${error.message}`);
+                }
+                internalController.abort(ABORT_REASON_ITERATOR_ERROR);
                 // No need to re-throw as we're handling it by aborting
                 return; // Exit the run function immediately
               }
diff --git a/apps/webapp/app/v3/tracer.server.ts b/apps/webapp/app/v3/tracer.server.ts
@@ -302,13 +302,15 @@ function setupTelemetry() {
   provider.register();
 
   let instrumentations: Instrumentation[] = [
-    new HttpInstrumentation(),
-    new ExpressInstrumentation(),
     new AwsSdkInstrumentation({
       suppressInternalInstrumentation: true,
     }),
   ];
 
+  if (!env.DISABLE_HTTP_INSTRUMENTATION) {
+    instrumentations.unshift(new HttpInstrumentation(), new ExpressInstrumentation());
+  }
+
   if (env.INTERNAL_OTEL_TRACE_INSTRUMENT_PRISMA_ENABLED === "1") {
     instrumentations.push(new PrismaInstrumentation());
   }

Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`import { type PrismaClient, prisma } from "~/db.server";`
`2`	`2`	`import { logger } from "~/services/logger.server";`
`3`	`3`	`import { singleton } from "~/utils/singleton";`
`4`		`-import { createSSELoader, SendFunction } from "~/utils/sse";`
	`4`	`+import { ABORT_REASON_SEND_ERROR, createSSELoader, SendFunction } from "~/utils/sse";`
`5`	`5`	`import { throttle } from "~/utils/throttle";`
`6`	`6`	`import { tracePubSub } from "~/v3/services/tracePubSub.server";`
`7`	`7`
`@@ -66,8 +66,10 @@ export class RunStreamPresenter {`
`66`	`66`	`});`
`67`	`67`	`}`
`68`	`68`	`}`
`69`		`- // Abort the stream on send error`
`70`		`- context.controller.abort("Send error");`
	`69`	`+ // Abort the stream on send error. Uses a stackless string sentinel`
	`70`	`+ // from sse.ts — a no-arg abort() would create a DOMException with a`
	`71`	`+ // stack trace, which is unnecessary retention on the signal.reason.`
	`72`	`+ context.controller.abort(ABORT_REASON_SEND_ERROR);`
`71`	`73`	`}`
`72`	`74`	`},`
`73`	`75`	`1000`