Skip to content

feat(config): configurable agent watchdog + bash command timeouts#65

Open
Magentron wants to merge 6 commits intoMiniCodeMonkey:mainfrom
Magentron:feature/agent-watchdog-timeout
Open

feat(config): configurable agent watchdog + bash command timeouts#65
Magentron wants to merge 6 commits intoMiniCodeMonkey:mainfrom
Magentron:feature/agent-watchdog-timeout

Conversation

@Magentron
Copy link
Copy Markdown

Summary

Two related timeout knobs become configurable, so users can match Chief's safety nets to how their projects actually run:

  • agent.watchdogTimeout — silence threshold before Chief kills the agent process as hung. Default unchanged (5 minutes), now configurable. Bump this when acceptance
    criteria run long, quiet commands (integration test suites that produce no stdout for several minutes). "0s" disables the watchdog. Unparseable / negative values fall back to
    the default.
  • bash.timeout — maximum runtime for external bash commands invoked by Chief (currently worktree.setup). Empty / unset = no timeout (unchanged behaviour). Unparseable or
    negative values fall back to "no timeout" but surface a one-line warning in the worktree spinner so a typo isn't silently masked.

Both fields are Go duration strings ("30s", "5m", "1h"). Both are validated inline in the Settings TUI — invalid input is rejected with a clear error and the editor stays
open so the user doesn't lose their value. Hand-edited config files with invalid values fall back to the documented default at runtime.

Also includes a small loop package cleanup: the test that exercises watchdog disable reads the live Loop via the internal map instead of GetInstance's defensive copy, so
the assertion sees the actual runtime state. The GetInstance runtime-field omission contract is now documented.

Test plan

  • go test ./... passes
  • With agent.watchdogTimeout: "30m", an agent that pauses for ~10 minutes is not killed (was killed at 5m before)
  • With agent.watchdogTimeout: "0s", the watchdog is disabled and a hung agent is not killed
  • With bash.timeout: "10s" and a worktree.setup: "sleep 60", the setup is killed at ~10s
  • With bash.timeout: "" (default), worktree.setup runs to completion regardless of duration
  • In the Settings TUI, editing agent.watchdogTimeout to "not-a-duration" shows invalid duration "..." inline; editor stays open; valid value "20m" is accepted and
    persisted to .chief/config.yaml
  • Same flow for bash.timeout (incl. negative-rejection: "-10s" shows the negative-value error)
  • Hand-editing .chief/config.yaml with an invalid duration produces the documented warning in the worktree spinner (for bash.timeout) and falls back to the 5-minute
    default (for agent.watchdogTimeout)

Magentron and others added 6 commits April 29, 2026 06:56
Introduces a new `bash.timeout` setting in `.chief/config.yaml` (and the
Settings TUI under "Bash → Command timeout") that caps the runtime of
external bash commands invoked by Chief — currently `worktree.setup`.

Worktree setup now defaults to a 5 minute timeout and is killed via
process-group SIGKILL on Unix so child processes spawned by the script
do not leak. `cmd.WaitDelay` keeps the spinner from hanging on stuck
grandchildren. Set `bash.timeout: "0s"` to restore the previous
"no timeout" behaviour.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `agent.watchdogTimeout` to `.chief/config.yaml` (and the Settings
TUI under Agent → Watchdog timeout) so projects whose acceptance
criteria run long, quiet commands (e.g. integration test suites that
produce no stdout for several minutes) can raise the silence threshold
that was previously hardcoded at 5 minutes in internal/loop/loop.go.

The value is plumbed through Manager.Start into Loop.SetWatchdogTimeout.
"0s" disables the watchdog; empty/unparseable values keep the historical
5 minute default so behaviour is unchanged for users who don't set it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restores the original PR intent of `bash.timeout` being opt-in: setup
commands run unbounded unless the user sets an explicit duration.
Empty, unparseable, or negative values all return 0 (no timeout);
unparseable/negative values still surface a one-line warning in the
worktree spinner so a typo isn't silently masked.

Other review fixes:
- Delete `AgentWatchdogTimeoutWarning` (and its tests). It was added
  for symmetry with `BashTimeoutWarning` but never wired to a UI
  surface — YAGNI. Wiring a watchdog warning into the loop event
  stream is left for a follow-up.
- Extract `parseDurationOrDefault` so `BashTimeout` and
  `AgentWatchdogTimeout` share their parsing path. Annotate each call
  site with the chosen default's meaning so the asymmetry (bash → 0,
  watchdog → 5m) is self-evident at the call site.
- Cross-reference `loop.DefaultWatchdogTimeout` and
  `config.DefaultAgentWatchdogTimeout`, and add
  `TestWatchdogDefaultsAreInSync` to guard against drift.
- Reorder Settings TUI items to Agent → Worktree → Bash → On Complete
  to match the YAML example in docs.
- Add `TestSettingsOverlay_AgentWatchdogTimeoutNegativeRejected` for
  symmetry with the bash.timeout negative test.
- Add `TestManagerStartDisablesWatchdogWhenConfigured` covering the
  end-to-end "0s" propagation through Manager.Start.
- Add Agent/Bash section assertions to `TestSettingsOverlay_Render`.
- Update CHANGELOG and docs to drop the "5m default" promise for
  bash.timeout and reword the migration note accordingly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The watchdog-timeout manager tests called Manager.GetInstance to reach
the Loop, but GetInstance intentionally returns a snapshot copy that
omits the Loop pointer (manager.go:463). The copy's Loop field was
nil, which surfaced as either a t.Fatal or a SIGSEGV on the
WatchdogTimeout call.

Add a small `liveLoopFor` helper that reaches into m.instances under
the manager lock to obtain the live Loop, and use it across the three
existing watchdog manager tests. Same-package access only — no API
change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents on Manager.GetInstance that the returned snapshot
intentionally omits Loop, ctx, and cancel — so future readers don't
chase a dead end like the watchdog tests did before 58fa912.

Adds TestGetInstanceOmitsRuntimeFields to lock the contract: if
someone "fixes" GetInstance to expose the live Loop, the test fails
and the doc + liveLoopFor helper get revisited together rather than
silently going out of sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant