Summary
RunsReplicationService / LogicalReplicationClient uses Redlock2 (Redis-based) for single-leader election on the task_runs_to_clickhouse_v1 slot. When the leader loses its lock — typically a brief event-loop stall or Redis blip pushes lock renewal past RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS — the renewer never recovers. It just logs Cannot extend an already-expired lock every RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS forever.
The pod stays "healthy" (no health check failure, no crash), but it's no longer functionally the leader. Another pod does eventually pick up the lock and resume replication, so user-visible CH lag is bounded — but the original pod permanently spams two paired error log lines per renewal interval.
Environment
- Self-hosted Trigger.dev v4 (helm chart,
webapp v4.3.0)
- 8 webapp replicas
RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS=60000
RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS=20000
RUN_REPLICATION_LEADER_LOCK_ADDITIONAL_TIME_MS=30000
- Redis: bitnami chart, single master in same K8s cluster
Observed log signature
Every 20 seconds, identical timestamp delta, from one specific pod:
{
"name":"LogicalReplicationClient",
"slotName":"task_runs_to_clickhouse_v1",
"publicationName":"task_runs_to_clickhouse_v1_publication",
"error":{
"message":"Cannot extend an already-expired lock.",
"stack":"ExecutionError: Cannot extend an already-expired lock.\n at Redlock2.extend (apps/webapp/build/index.js:120923:13)\n at Lock.extend (apps/webapp/build/index.js:120834:25)\n at Timeout._onTimeout (apps/webapp/build/index.js:121616:49)\n at listOnTimeout (node:internal/timers:573:17)\n at process.processTimers (node:internal/timers:514:7)"
},
"lockTimeoutMs":60000,
"lockExtendIntervalMs":20000,
"message":"Failed to extend leader lock",
"level":"error"
}
Followed in the same millisecond by a sibling RunsReplicationService log:
{
"name":"RunsReplicationService",
"error":{ "message":"Cannot extend an already-expired lock.", ... },
"message":"Replication client error",
"level":"error"
}
Sample timestamps from one production pod, no breaks for hours:
2026-04-23T03:14:12.639Z
2026-04-23T03:14:32.639Z
2026-04-23T03:14:52.639Z
2026-04-23T03:15:12.640Z
2026-04-23T03:15:32.640Z
…
Killing the pod (kubectl delete pod) immediately stops the spam. The newly-spawned replica re-elects cleanly.
Reproduction
- Self-host with multiple webapp replicas + Redis. Run
task_runs_to_clickhouse_v1 replication.
- Force the current leader pod's event loop to stall for
> RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS. Easy ways:
kubectl exec and send SIGSTOP for 70 seconds, then SIGCONT. Or
- Network-partition the pod from Redis for 70 seconds via NetworkPolicy.
- After the partition heals, the renewer's next
extend() call returns Cannot extend an already-expired lock.
- Observe the log line repeating every
RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS indefinitely. The pod is no longer the leader (another pod has acquired the slot) but doesn't know it.
Root cause (best guess from stack)
LogicalReplicationClient.extend() is invoked from a setTimeout chain with no error-handler that resets state. When extend() rejects:
- The error is caught and logged.
- The
setTimeout for the next extend is re-armed unconditionally.
- The client never calls
acquire() to re-elect, never disposes the slot listener, and never marks itself as not-leader.
So the renewal loop continues forever against a lock the pod no longer holds, while the actual leadership has silently moved to another pod.
Suggested fix
In RunsReplicationService / LogicalReplicationClient extend handler, when extend() rejects with the expired-lock error:
- Stop the renewal loop (
clearTimeout).
- Tear down the open replication slot subscription on this pod.
- Either:
- Attempt a fresh
acquire() (will succeed if no other pod holds it, fail otherwise — fail is fine, just don't loop), OR
- Throw / exit the process so K8s restarts and the deployment self-heals via a fresh re-election.
Option (b) is simpler and more robust against further state drift. The pod is already useless for replication at this point, so a restart costs nothing.
Workaround for self-hosters
Restart the noisy pod. We've added a Datadog monitor on Failed to extend leader lock count > N/min that pages and includes a runbook step to kubectl delete pod <leader>.
Related context
We're already running this CH-load reduction work that surfaced the issue: #3426
Summary
RunsReplicationService/LogicalReplicationClientuses Redlock2 (Redis-based) for single-leader election on thetask_runs_to_clickhouse_v1slot. When the leader loses its lock — typically a brief event-loop stall or Redis blip pushes lock renewal pastRUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS— the renewer never recovers. It just logsCannot extend an already-expired lockeveryRUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MSforever.The pod stays "healthy" (no health check failure, no crash), but it's no longer functionally the leader. Another pod does eventually pick up the lock and resume replication, so user-visible CH lag is bounded — but the original pod permanently spams two paired error log lines per renewal interval.
Environment
webappv4.3.0)RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS=60000RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS=20000RUN_REPLICATION_LEADER_LOCK_ADDITIONAL_TIME_MS=30000Observed log signature
Every 20 seconds, identical timestamp delta, from one specific pod:
{ "name":"LogicalReplicationClient", "slotName":"task_runs_to_clickhouse_v1", "publicationName":"task_runs_to_clickhouse_v1_publication", "error":{ "message":"Cannot extend an already-expired lock.", "stack":"ExecutionError: Cannot extend an already-expired lock.\n at Redlock2.extend (apps/webapp/build/index.js:120923:13)\n at Lock.extend (apps/webapp/build/index.js:120834:25)\n at Timeout._onTimeout (apps/webapp/build/index.js:121616:49)\n at listOnTimeout (node:internal/timers:573:17)\n at process.processTimers (node:internal/timers:514:7)" }, "lockTimeoutMs":60000, "lockExtendIntervalMs":20000, "message":"Failed to extend leader lock", "level":"error" }Followed in the same millisecond by a sibling
RunsReplicationServicelog:{ "name":"RunsReplicationService", "error":{ "message":"Cannot extend an already-expired lock.", ... }, "message":"Replication client error", "level":"error" }Sample timestamps from one production pod, no breaks for hours:
Killing the pod (
kubectl delete pod) immediately stops the spam. The newly-spawned replica re-elects cleanly.Reproduction
task_runs_to_clickhouse_v1replication.> RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS. Easy ways:kubectl execand sendSIGSTOPfor 70 seconds, thenSIGCONT. Orextend()call returnsCannot extend an already-expired lock.RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MSindefinitely. The pod is no longer the leader (another pod has acquired the slot) but doesn't know it.Root cause (best guess from stack)
LogicalReplicationClient.extend()is invoked from asetTimeoutchain with no error-handler that resets state. Whenextend()rejects:setTimeoutfor the next extend is re-armed unconditionally.acquire()to re-elect, never disposes the slot listener, and never marks itself as not-leader.So the renewal loop continues forever against a lock the pod no longer holds, while the actual leadership has silently moved to another pod.
Suggested fix
In
RunsReplicationService/LogicalReplicationClientextend handler, whenextend()rejects with the expired-lock error:clearTimeout).acquire()(will succeed if no other pod holds it, fail otherwise — fail is fine, just don't loop), OROption (b) is simpler and more robust against further state drift. The pod is already useless for replication at this point, so a restart costs nothing.
Workaround for self-hosters
Restart the noisy pod. We've added a Datadog monitor on
Failed to extend leader lockcount > N/min that pages and includes a runbook step tokubectl delete pod <leader>.Related context
We're already running this CH-load reduction work that surfaced the issue: #3426