LogicalReplicationClient stuck in infinite extend loop after Redlock expires; never re-acquires

## Summary

`RunsReplicationService` / `LogicalReplicationClient` uses Redlock2 (Redis-based) for single-leader election on the `task_runs_to_clickhouse_v1` slot. When the leader loses its lock — typically a brief event-loop stall or Redis blip pushes lock renewal past `RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS` — the renewer never recovers. It just logs `Cannot extend an already-expired lock` every `RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS` forever.

The pod stays "healthy" (no health check failure, no crash), but it's no longer functionally the leader. Another pod *does* eventually pick up the lock and resume replication, so user-visible CH lag is bounded — but the original pod permanently spams two paired error log lines per renewal interval.

## Environment

- Self-hosted Trigger.dev v4 (helm chart, `webapp` v4.3.0)
- 8 webapp replicas
- `RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS=60000`
- `RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS=20000`
- `RUN_REPLICATION_LEADER_LOCK_ADDITIONAL_TIME_MS=30000`
- Redis: bitnami chart, single master in same K8s cluster

## Observed log signature

Every 20 seconds, identical timestamp delta, from one specific pod:

```json
{
  "name":"LogicalReplicationClient",
  "slotName":"task_runs_to_clickhouse_v1",
  "publicationName":"task_runs_to_clickhouse_v1_publication",
  "error":{
    "message":"Cannot extend an already-expired lock.",
    "stack":"ExecutionError: Cannot extend an already-expired lock.\n    at Redlock2.extend (apps/webapp/build/index.js:120923:13)\n    at Lock.extend (apps/webapp/build/index.js:120834:25)\n    at Timeout._onTimeout (apps/webapp/build/index.js:121616:49)\n    at listOnTimeout (node:internal/timers:573:17)\n    at process.processTimers (node:internal/timers:514:7)"
  },
  "lockTimeoutMs":60000,
  "lockExtendIntervalMs":20000,
  "message":"Failed to extend leader lock",
  "level":"error"
}
```

Followed in the same millisecond by a sibling `RunsReplicationService` log:

```json
{
  "name":"RunsReplicationService",
  "error":{ "message":"Cannot extend an already-expired lock.", ... },
  "message":"Replication client error",
  "level":"error"
}
```

Sample timestamps from one production pod, no breaks for hours:

```
2026-04-23T03:14:12.639Z
2026-04-23T03:14:32.639Z
2026-04-23T03:14:52.639Z
2026-04-23T03:15:12.640Z
2026-04-23T03:15:32.640Z
…
```

Killing the pod (`kubectl delete pod`) immediately stops the spam. The newly-spawned replica re-elects cleanly.

## Reproduction

1. Self-host with multiple webapp replicas + Redis. Run `task_runs_to_clickhouse_v1` replication.
2. Force the current leader pod's event loop to stall for `> RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS`. Easy ways:
   - `kubectl exec` and send `SIGSTOP` for 70 seconds, then `SIGCONT`. Or
   - Network-partition the pod from Redis for 70 seconds via NetworkPolicy.
3. After the partition heals, the renewer's next `extend()` call returns `Cannot extend an already-expired lock`.
4. Observe the log line repeating every `RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS` indefinitely. The pod is no longer the leader (another pod has acquired the slot) but doesn't know it.

## Root cause (best guess from stack)

`LogicalReplicationClient.extend()` is invoked from a `setTimeout` chain with no error-handler that resets state. When `extend()` rejects:

- The error is caught and logged.
- The `setTimeout` for the next extend is re-armed unconditionally.
- The client never calls `acquire()` to re-elect, never disposes the slot listener, and never marks itself as not-leader.

So the renewal loop continues forever against a lock the pod no longer holds, while the actual leadership has silently moved to another pod.

## Suggested fix

In `RunsReplicationService` / `LogicalReplicationClient` extend handler, when `extend()` rejects with the expired-lock error:

1. Stop the renewal loop (`clearTimeout`).
2. Tear down the open replication slot subscription on this pod.
3. Either:
   - Attempt a fresh `acquire()` (will succeed if no other pod holds it, fail otherwise — fail is fine, just don't loop), OR
   - Throw / exit the process so K8s restarts and the deployment self-heals via a fresh re-election.

Option (b) is simpler and more robust against further state drift. The pod is already useless for replication at this point, so a restart costs nothing.

## Workaround for self-hosters

Restart the noisy pod. We've added a Datadog monitor on `Failed to extend leader lock` count > N/min that pages and includes a runbook step to `kubectl delete pod <leader>`.

## Related context

We're already running this CH-load reduction work that surfaced the issue: https://github.com/triggerdotdev/trigger.dev/issues/3426

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LogicalReplicationClient stuck in infinite extend loop after Redlock expires; never re-acquires #3428

Summary

Environment

Observed log signature

Reproduction

Root cause (best guess from stack)

Suggested fix

Workaround for self-hosters

Related context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

LogicalReplicationClient stuck in infinite extend loop after Redlock expires; never re-acquires #3428

Description

Summary

Environment

Observed log signature

Reproduction

Root cause (best guess from stack)

Suggested fix

Workaround for self-hosters

Related context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions