feat(config): configurable agent watchdog + bash command timeouts#65
Open
Magentron wants to merge 6 commits intoMiniCodeMonkey:mainfrom
Open
feat(config): configurable agent watchdog + bash command timeouts#65Magentron wants to merge 6 commits intoMiniCodeMonkey:mainfrom
Magentron wants to merge 6 commits intoMiniCodeMonkey:mainfrom
Conversation
Introduces a new `bash.timeout` setting in `.chief/config.yaml` (and the Settings TUI under "Bash → Command timeout") that caps the runtime of external bash commands invoked by Chief — currently `worktree.setup`. Worktree setup now defaults to a 5 minute timeout and is killed via process-group SIGKILL on Unix so child processes spawned by the script do not leak. `cmd.WaitDelay` keeps the spinner from hanging on stuck grandchildren. Set `bash.timeout: "0s"` to restore the previous "no timeout" behaviour. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds `agent.watchdogTimeout` to `.chief/config.yaml` (and the Settings TUI under Agent → Watchdog timeout) so projects whose acceptance criteria run long, quiet commands (e.g. integration test suites that produce no stdout for several minutes) can raise the silence threshold that was previously hardcoded at 5 minutes in internal/loop/loop.go. The value is plumbed through Manager.Start into Loop.SetWatchdogTimeout. "0s" disables the watchdog; empty/unparseable values keep the historical 5 minute default so behaviour is unchanged for users who don't set it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restores the original PR intent of `bash.timeout` being opt-in: setup commands run unbounded unless the user sets an explicit duration. Empty, unparseable, or negative values all return 0 (no timeout); unparseable/negative values still surface a one-line warning in the worktree spinner so a typo isn't silently masked. Other review fixes: - Delete `AgentWatchdogTimeoutWarning` (and its tests). It was added for symmetry with `BashTimeoutWarning` but never wired to a UI surface — YAGNI. Wiring a watchdog warning into the loop event stream is left for a follow-up. - Extract `parseDurationOrDefault` so `BashTimeout` and `AgentWatchdogTimeout` share their parsing path. Annotate each call site with the chosen default's meaning so the asymmetry (bash → 0, watchdog → 5m) is self-evident at the call site. - Cross-reference `loop.DefaultWatchdogTimeout` and `config.DefaultAgentWatchdogTimeout`, and add `TestWatchdogDefaultsAreInSync` to guard against drift. - Reorder Settings TUI items to Agent → Worktree → Bash → On Complete to match the YAML example in docs. - Add `TestSettingsOverlay_AgentWatchdogTimeoutNegativeRejected` for symmetry with the bash.timeout negative test. - Add `TestManagerStartDisablesWatchdogWhenConfigured` covering the end-to-end "0s" propagation through Manager.Start. - Add Agent/Bash section assertions to `TestSettingsOverlay_Render`. - Update CHANGELOG and docs to drop the "5m default" promise for bash.timeout and reword the migration note accordingly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The watchdog-timeout manager tests called Manager.GetInstance to reach the Loop, but GetInstance intentionally returns a snapshot copy that omits the Loop pointer (manager.go:463). The copy's Loop field was nil, which surfaced as either a t.Fatal or a SIGSEGV on the WatchdogTimeout call. Add a small `liveLoopFor` helper that reaches into m.instances under the manager lock to obtain the live Loop, and use it across the three existing watchdog manager tests. Same-package access only — no API change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents on Manager.GetInstance that the returned snapshot intentionally omits Loop, ctx, and cancel — so future readers don't chase a dead end like the watchdog tests did before 58fa912. Adds TestGetInstanceOmitsRuntimeFields to lock the contract: if someone "fixes" GetInstance to expose the live Loop, the test fails and the doc + liveLoopFor helper get revisited together rather than silently going out of sync. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related timeout knobs become configurable, so users can match Chief's safety nets to how their projects actually run:
agent.watchdogTimeout— silence threshold before Chief kills the agent process as hung. Default unchanged (5 minutes), now configurable. Bump this when acceptancecriteria run long, quiet commands (integration test suites that produce no stdout for several minutes).
"0s"disables the watchdog. Unparseable / negative values fall back tothe default.
bash.timeout— maximum runtime for external bash commands invoked by Chief (currentlyworktree.setup). Empty / unset = no timeout (unchanged behaviour). Unparseable ornegative values fall back to "no timeout" but surface a one-line warning in the worktree spinner so a typo isn't silently masked.
Both fields are Go duration strings (
"30s","5m","1h"). Both are validated inline in the Settings TUI — invalid input is rejected with a clear error and the editor staysopen so the user doesn't lose their value. Hand-edited config files with invalid values fall back to the documented default at runtime.
Also includes a small
looppackage cleanup: the test that exercises watchdog disable reads the liveLoopvia the internal map instead ofGetInstance's defensive copy, sothe assertion sees the actual runtime state. The
GetInstanceruntime-field omission contract is now documented.Test plan
go test ./...passesagent.watchdogTimeout: "30m", an agent that pauses for ~10 minutes is not killed (was killed at 5m before)agent.watchdogTimeout: "0s", the watchdog is disabled and a hung agent is not killedbash.timeout: "10s"and aworktree.setup: "sleep 60", the setup is killed at ~10sbash.timeout: ""(default),worktree.setupruns to completion regardless of durationagent.watchdogTimeoutto"not-a-duration"showsinvalid duration "..."inline; editor stays open; valid value"20m"is accepted andpersisted to
.chief/config.yamlbash.timeout(incl. negative-rejection:"-10s"shows the negative-value error).chief/config.yamlwith an invalid duration produces the documented warning in the worktree spinner (forbash.timeout) and falls back to the 5-minutedefault (for
agent.watchdogTimeout)