Skip to content

LogicalReplicationClient stuck in infinite extend loop after Redlock expires; never re-acquires #3428

@eni9889

Description

@eni9889

Summary

RunsReplicationService / LogicalReplicationClient uses Redlock2 (Redis-based) for single-leader election on the task_runs_to_clickhouse_v1 slot. When the leader loses its lock — typically a brief event-loop stall or Redis blip pushes lock renewal past RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS — the renewer never recovers. It just logs Cannot extend an already-expired lock every RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS forever.

The pod stays "healthy" (no health check failure, no crash), but it's no longer functionally the leader. Another pod does eventually pick up the lock and resume replication, so user-visible CH lag is bounded — but the original pod permanently spams two paired error log lines per renewal interval.

Environment

  • Self-hosted Trigger.dev v4 (helm chart, webapp v4.3.0)
  • 8 webapp replicas
  • RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS=60000
  • RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS=20000
  • RUN_REPLICATION_LEADER_LOCK_ADDITIONAL_TIME_MS=30000
  • Redis: bitnami chart, single master in same K8s cluster

Observed log signature

Every 20 seconds, identical timestamp delta, from one specific pod:

{
  "name":"LogicalReplicationClient",
  "slotName":"task_runs_to_clickhouse_v1",
  "publicationName":"task_runs_to_clickhouse_v1_publication",
  "error":{
    "message":"Cannot extend an already-expired lock.",
    "stack":"ExecutionError: Cannot extend an already-expired lock.\n    at Redlock2.extend (apps/webapp/build/index.js:120923:13)\n    at Lock.extend (apps/webapp/build/index.js:120834:25)\n    at Timeout._onTimeout (apps/webapp/build/index.js:121616:49)\n    at listOnTimeout (node:internal/timers:573:17)\n    at process.processTimers (node:internal/timers:514:7)"
  },
  "lockTimeoutMs":60000,
  "lockExtendIntervalMs":20000,
  "message":"Failed to extend leader lock",
  "level":"error"
}

Followed in the same millisecond by a sibling RunsReplicationService log:

{
  "name":"RunsReplicationService",
  "error":{ "message":"Cannot extend an already-expired lock.", ... },
  "message":"Replication client error",
  "level":"error"
}

Sample timestamps from one production pod, no breaks for hours:

2026-04-23T03:14:12.639Z
2026-04-23T03:14:32.639Z
2026-04-23T03:14:52.639Z
2026-04-23T03:15:12.640Z
2026-04-23T03:15:32.640Z
…

Killing the pod (kubectl delete pod) immediately stops the spam. The newly-spawned replica re-elects cleanly.

Reproduction

  1. Self-host with multiple webapp replicas + Redis. Run task_runs_to_clickhouse_v1 replication.
  2. Force the current leader pod's event loop to stall for > RUN_REPLICATION_LEADER_LOCK_TIMEOUT_MS. Easy ways:
    • kubectl exec and send SIGSTOP for 70 seconds, then SIGCONT. Or
    • Network-partition the pod from Redis for 70 seconds via NetworkPolicy.
  3. After the partition heals, the renewer's next extend() call returns Cannot extend an already-expired lock.
  4. Observe the log line repeating every RUN_REPLICATION_LEADER_LOCK_EXTEND_INTERVAL_MS indefinitely. The pod is no longer the leader (another pod has acquired the slot) but doesn't know it.

Root cause (best guess from stack)

LogicalReplicationClient.extend() is invoked from a setTimeout chain with no error-handler that resets state. When extend() rejects:

  • The error is caught and logged.
  • The setTimeout for the next extend is re-armed unconditionally.
  • The client never calls acquire() to re-elect, never disposes the slot listener, and never marks itself as not-leader.

So the renewal loop continues forever against a lock the pod no longer holds, while the actual leadership has silently moved to another pod.

Suggested fix

In RunsReplicationService / LogicalReplicationClient extend handler, when extend() rejects with the expired-lock error:

  1. Stop the renewal loop (clearTimeout).
  2. Tear down the open replication slot subscription on this pod.
  3. Either:
    • Attempt a fresh acquire() (will succeed if no other pod holds it, fail otherwise — fail is fine, just don't loop), OR
    • Throw / exit the process so K8s restarts and the deployment self-heals via a fresh re-election.

Option (b) is simpler and more robust against further state drift. The pod is already useless for replication at this point, so a restart costs nothing.

Workaround for self-hosters

Restart the noisy pod. We've added a Datadog monitor on Failed to extend leader lock count > N/min that pages and includes a runbook step to kubectl delete pod <leader>.

Related context

We're already running this CH-load reduction work that surfaced the issue: #3426

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions