Top Failure Modes
Top Failure Modes
Section titled “Top Failure Modes”Operator-facing diagnosis and fix guides for the most common Colony failure modes. Each section covers symptoms, root cause, diagnosis steps, fix, and prevention.
Postgres is the authority for pipeline state; GitHub labels are a projection for human visibility. Always check Postgres first when diagnosing pipeline issues.
For the underlying engineering patterns that explain many of the specific symptoms here (retry-without-terminal-classification, silent-fallback at trust boundaries, operator-surface inconsistency), see docs/production-learnings.md § Cross-Cutting Reliability Patterns. When a symptom doesn’t match any single entry below, it’s often an instance of one of those patterns.
Quick Decision Tree
Section titled “Quick Decision Tree”Setup issues? → colony check --fix → confirm prompts → doneIssue not picked up? → colony status → colony check --stage runtime → check intake_modeAgent won't start? → colony check --fix → check .colony/logs/<agent>.logIssue stuck in a state? → colony why <N> → colony tasks --issue <N>Worker stuck / one worker pinned? → colony workers → colony workers reclaim <workerId>Repeated comments / merger loop? → pause issue or request changes on PR to break the loopBranch CI red on files diff doesn't touch? → branch is stale; merge `main` in (or open a fresh branch)Config confusion? → colony check --stage config → review effective config1. Stale is_blocked Flag
Section titled “1. Stale is_blocked Flag”Symptoms
Section titled “Symptoms”- Issue stuck in a state with no worker activity
work_taskstable has no pending tasks for the issue- Dashboard shows the issue as blocked, but there is no obvious reason
Root Cause
Section titled “Root Cause”is_blocked was set by a transient failure (e.g., push conflict, OOM, subprocess timeout) and never cleared. The monitor’s auto_unblock_transient may be disabled or the issue has exceeded max_auto_unblocks_per_issue (default: 3).
Diagnosis Steps
Section titled “Diagnosis Steps”Check which issues are blocked:
colony issues --blockedSQL fallback (no CLI access)
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, r.owner, r.nameFROM pipeline_issues piJOIN repos r ON r.id = pi.repo_idWHERE pi.is_blocked = true;Confirm worker liveness — verify a worker is registered, healthy, and not pinned to a different issue:
colony workersCheck whether there are any pending tasks for the issue:
colony tasks --issue <N>SQL fallback (no CLI access)
SELECT id, task_type, status, created_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check the monitor logs for auto-unblock activity — the self-healing module logs when it unblocks issues and when it skips issues that have exceeded the unblock cap.
Run the unblock CLI command:
colony unblock <N>This clears the is_blocked flag, removes the colony:blocked label, and re-queues the issue in one step.
Alternatively, comment /colony:retry on the GitHub issue to retry with current settings.
If you do not have CLI access, clear the flag manually with SQL:
UPDATE pipeline_issuesSET is_blocked = falseWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');Prevention
Section titled “Prevention”Enable automatic unblocking of transient failures in your config:
agents: monitor: self_healing: auto_unblock_transient: true max_auto_unblocks_per_issue: 5The monitor classifies blocking reasons by analyzing the most recent Colony bot comment on the issue. If the reason is transient and the responsible agent is healthy, it automatically clears the is_blocked flag and transitions the issue to a retry state.
Relevant code: packages/pipeline-store/src/pipeline-store.ts (is_blocked column), packages/monitor/src/self-healing.ts (auto-unblock logic)
2. Head Branch Out of Date on Merge
Section titled “2. Head Branch Out of Date on Merge”Symptoms
Section titled “Symptoms”- Merge fails — merger logs show rebase failures or drift assessment
- Multiple PRs targeting main at the same time
work_taskstable shows failedmergetasks
Root Cause
Section titled “Root Cause”Concurrent PRs — when PR A merges into main, PR B’s branch is stale. The merger attempts to rebase PR B onto the updated main branch. If the rebase produces conflicts, the merger runs a drift assessment to determine whether the conflicts are resolvable.
Diagnosis Steps
Section titled “Diagnosis Steps”Check for failed merge tasks:
colony tasks --type merge --status failedSQL fallback (no CLI access)
SELECT id, issue_number, status, created_at, updated_atFROM work_tasksWHERE task_type = 'merge' AND status = 'failed' AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check merger logs for rebase failure indicators:
"Rebase failed"— initial rebase attempt failed"Rebase failed, running drift assessment"— merger is evaluating conflict severity"Unable to rebase automatically"— escalation marker indicating conflicts require intervention
The merger computes a “drift overlap” metric (percentage of line overlap between conflicting code) to assess whether auto-resolution is feasible.
The sprint-master automatically re-enqueues merge tasks, and the merger retries the rebase after the conflicting PR has merged. In most cases, the retry succeeds without intervention.
If the issue is stuck:
- Comment
/colony:retryon the GitHub issue to re-enqueue the merge task - For complex conflicts, the merger may escalate with a comment containing conflict details, drift estimates, and a recommendation (manual rebase vs. re-implementation)
- As a last resort, manually rebase the branch:
git rebase origin/mainand force-push
Prevention
Section titled “Prevention”review.rebase_before_check: true(default) ensures the branch is rebased before review checks run, reducing stale-branch scenarios at merge time- The merger has built-in retry logic with drift assessment — most concurrent-PR conflicts resolve automatically on the next attempt
- For repos with high PR throughput, ensure the sprint-master poll interval is short enough to quickly re-enqueue failed merge tasks
Relevant code: packages/merger/src/executor.ts (rebase and drift assessment logic), packages/sprint-master/src/sprint-master.ts (task re-enqueue)
3. Label Projection Failures
Section titled “3. Label Projection Failures”Symptoms
Section titled “Symptoms”- GitHub issue shows the wrong or missing
colony:label - Pipeline is actually progressing — dashboard or Postgres shows the correct state
- Sprint-master logs show label sync errors (rate limit, network timeout)
Root Cause
Section titled “Root Cause”GitHub API rate limit or transient network error during label sync. Labels are a write-only projection of Postgres state, not the source of truth. A failed label update does not affect pipeline processing.
Important: Manually adding or removing colony: labels on GitHub does NOT affect pipeline state. The pipeline reads state exclusively from Postgres. The only exceptions are two supported label commands: colony:enqueue (seeds a new issue into the pipeline) and colony:paused (pauses or resumes a running issue). For all other pipeline control, use slash commands (e.g., /colony:retry, /colony:state <target>) rather than label manipulation.
Diagnosis Steps
Section titled “Diagnosis Steps”Check for issues where Postgres state and stored labels have diverged:
colony issues --label-driftSQL fallback (no CLI access)
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, pi.labelsFROM pipeline_issues piWHERE pi.issue_number = <N> AND pi.repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');Compare the state column (authoritative) with the labels array (projection). If they diverge, the label sync will auto-correct on the next poll cycle.
Check sprint-master logs for label sync errors — the syncLabelsFromPostgres() function runs on every poll cycle and reconciles managed labels (state labels, colony:blocked, colony:paused) against GitHub.
Wait one poll cycle — syncLabelsFromPostgres() auto-corrects within the sprint-master’s poll interval (default 30s). The function compares expected labels (derived from Postgres state and is_blocked/is_paused flags) against actual GitHub labels and issues the necessary add/remove calls.
For immediate correction, use the CLI:
npx colony issue transition <issue-number> --state <state>Prevention
Section titled “Prevention”This is cosmetic — no action needed. Postgres state is authoritative. The sprint-master’s label sync is self-healing by design and catches up automatically. The label sync processes up to 25 issues per cycle (configurable via label_sync_limit).
Relevant code: packages/sprint-master/src/label-sync.ts (syncLabelsFromPostgres()), packages/core/src/state-transition.ts
4. Worker OOM During npm install
Section titled “4. Worker OOM During npm install”Symptoms
Section titled “Symptoms”- Worker container crashes or is killed during workspace setup
- Container logs show
KilledorOOMKilled - The issue gets blocked after repeated setup failures
Root Cause
Section titled “Root Cause”Default container memory limit is too low for large node_modules trees. npm install can spike memory significantly for repos with many dependencies.
Diagnosis Steps
Section titled “Diagnosis Steps”Check if the container was OOM-killed:
docker inspect <container> | grep OOMKilledOr check container logs:
docker compose logs worker | grep -i killedCheck worker health to confirm whether the container restarted or is still running:
colony workersA dead or stale freshness reading for the worker that was processing the issue confirms it was killed or lost its heartbeat around the time of the failure.
Check the work task failure history for the affected issue:
colony tasks --issue <N> --status failedSQL fallback (no CLI access)
SELECT id, task_type, status, created_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>') AND status = 'failed'ORDER BY created_at DESCLIMIT 10;Increase the worker memory limit in your config:
repos: - owner: my-org name: my-repo workers: memory: '6g'For very large repos (thousands of dependencies), use '8g' or higher. After updating the config, rebuild and restart the worker containers.
Prevention
Section titled “Prevention”- Set
repos[].workers.memorybased on the target repo’s dependency tree size - Monitor container memory usage during initial workspace setup to establish a baseline
- If using a custom
workspace.setup_command(e.g.,bundle installfor Ruby), the same memory considerations apply
Relevant code: docs/user-guide/configuration.md (repos[].workers.memory field)
5. Planning Timeout Loops
Section titled “5. Planning Timeout Loops”Symptoms
Section titled “Symptoms”- Issue stuck in
in-developmentwith a high turn count - Multiple failed
developtasks inwork_tasks— sprint-master keeps re-enqueueing - Developer logs show max turns being hit (look for
maxTurnsin structured log output)
Root Cause
Section titled “Root Cause”The issue is too complex or ambiguous for the configured turn limit. The developer exhausts developer_max_turns without completing the task, gets blocked, and the sprint-master retries — creating a loop.
Diagnosis Steps
Section titled “Diagnosis Steps”Use colony why for a quick diagnosis of why the issue is stuck:
colony why <N>Confirm workers are running and not stuck on a different issue before examining task history:
colony workersCheck for repeated develop task failures:
colony tasks --issue <N>SQL fallback (no CLI access)
SELECT id, task_type, status, created_at, updated_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check developer logs for max turns indicators:
- The developer executor logs
maxTurnsin its structured output when starting development - When the turn limit is hit, the result includes
isMaxTurns: true - The failure tracker (
packages/core/src/failure-tracker.ts) counts consecutive failures per issue
Stop the retry loop:
/colony:cancelComment this on the GitHub issue to close it and stop re-enqueue.
Alternatively, decompose the issue into smaller sub-issues:
/colony:decomposeThis sends the issue to the planner, which breaks it into smaller, more tractable sub-issues.
If the issue is close to completion and just needs more turns, bump the limits:
claude: scaling: large: developer_max_turns: 50Complexity tiers (small, medium, large) each have their own developer_max_turns setting.
Prevention
Section titled “Prevention”- Use the planner for large issues — comment
/colony:decomposebefore the issue enters development - Configure progress detection windows to catch stalls early:
claude:scaling:large:no_progress_window: 75
- Write well-scoped issues with clear acceptance criteria — ambiguous issues are the primary driver of timeout loops
- The failure tracker counts consecutive failures per issue key; after the threshold is exceeded, the issue is blocked to prevent unbounded retries
Relevant code: packages/developer/src/executor.ts (turn limit logic), packages/core/src/state-transition.ts (slash commands), packages/core/src/failure-tracker.ts (failure counting), docs/user-guide/configuration.md (claude.scaling)
6. Agent Looping on an Approved-but-CI-Red PR
Section titled “6. Agent Looping on an Approved-but-CI-Red PR”Symptoms
Section titled “Symptoms”- Issue stuck in
merge-pendingorin-reviewfor hours with repeating bot comments every 1–2 minutes - Same comment posted dozens of times — typically the merger’s “Branch is already up-to-date — skipped force-push. PR already approved — moving to merge-pending.” with a CI-failure suffix
- One worker permanently occupied processing back-to-back tasks against this one issue, blocking real work for other issues
- PR is approved, mergeable, but has a
FAILURECI check that won’t recover without code or infra change
Root Cause
Section titled “Root Cause”The merger (and similarly-positioned agents) doesn’t distinguish IN_PROGRESS / QUEUED CI checks (truly transient, retry-friendly) from FAILURE checks (terminal, code or infra fix required). When a definitive CI failure exists, the agent keeps re-enqueuing merge tasks every cycle expecting the CI state to change. It never does without operator action. Compounding the visibility: the comment-post path lacks dedup, so each loop iteration writes a fresh copy of the status comment to the issue thread.
This is the operator-visible surface of the cross-cutting pattern “every retry path needs an explicit terminal classification” (see production-learnings § Cross-Cutting Reliability Patterns).
Diagnosis Steps
Section titled “Diagnosis Steps”Check worker liveness and whether one worker is pinned to this issue:
colony workersA loop signature: one worker showing task: builtin:merge #<N> (or builtin:review #<N>) that never changes across multiple colony workers invocations, with a continuously refreshing fresh heartbeat — the worker is alive but not advancing the issue state. The dashboard Worker Pool Status panel shows the same data and updates live via SSE.
Check the issue state and recent task history to confirm the loop:
colony tasks --issue <N>SQL fallback (no CLI access)
SELECT id, task_type, status, claimed_by, created_at, completed_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 15;A loop signature in the task history: same task_type (merge or review), same claimed_by worker, new row every 60–120 seconds for hours, all complete with no progress on the issue’s state.
Check which CI checks are failing on the PR:
gh pr checks <pr-number> --repo <owner>/<repo>If any required check is FAILURE (not PENDING), the loop will not resolve itself.
First, determine whether the worker is still live. colony workers shows heartbeat freshness — a fresh worker with a continuously refreshing heartbeat is actively processing; stale or dead means the claim record is orphaned and no worker is acting on it.
Scenario A — stale or dead worker (orphaned claim):
If colony workers shows the worker’s freshness as stale or dead, the work_tasks claim is still held but no worker is processing it. Free the lease immediately:
colony workers reclaim <workerId>This forcibly returns the worker’s claimed task to pending so any available worker can pick it up. On success:
Reclaimed <taskType> task for issue #<N> (task <taskId>) from worker <workerId>Scenario B — fresh worker actively looping on a CI-red PR:
If the worker is fresh and actively looping, the issue is a definitive CI failure the merger cannot self-resolve. Break the loop immediately by transitioning the issue to a state the agent won’t keep retrying. Two equally good options:
Option A — request changes on the PR (cleanest if the failure needs developer rework):
gh pr review <pr-number> --repo <owner>/<repo> --request-changes \ --body "CI blocked by <specific failing check>. Returning to changes-requested while <root cause> is resolved."This moves the issue to changes-requested and stops the loop.
Option B — pause the issue (cleanest if the failure is infra-side and you’ll come back later):
gh issue edit <N> --repo <owner>/<repo> --add-label colony:pausedAfter the underlying CI failure is fixed (push a code fix, repair the CI workflow, rotate a broken token), comment /colony:retry on the issue to re-enter the pipeline.
Prevention
Section titled “Prevention”- Track and prioritize merger loop and comment-dedup work (#3853, #3854 at time of writing). When these land, hard-failed CI auto-transitions to a block state and the loop class is eliminated.
- Triage flaky tests on
mainaggressively (#3851) — a flake produces the same loop-prone state until either fixed or worked around. - For repos with custom CI jobs (Cloudflare Pages deploys, content validation, security scans), keep
review.checksconfig aligned with the CI workflow’s check-run names so the reviewer catches breakage locally before the loop class can fire.
Relevant code: packages/merger/src/ (retry logic), packages/sprint-master/src/slash-commands.ts (isDuplicateBotComment reference for the dedup pattern that needs to be shared).
7. Stale Branch Failing CI After main Moved
Section titled “7. Stale Branch Failing CI After main Moved”Symptoms
Section titled “Symptoms”- A PR’s CI was passing yesterday and is failing today, with no changes pushed in between
- The failing check points at a file that isn’t in the branch’s diff (e.g.,
.gitmodules,package-lock.json, a CI workflow YAML, an SSH/secrets setup step) - Multiple unrelated PRs all fail the same way at roughly the same time
- The error often mentions submodules, missing dependencies, secret/auth setup, or workflow steps the branch never touched
Root Cause
Section titled “Root Cause”A structural change landed on main — submodule schema change, lockfile bump, CI workflow change, secret rotation — that every open branch must absorb before its CI can pass again. Branches forked before that commit don’t yet have the new shape; their CI re-runs against a checkout that is internally inconsistent (e.g., a gitlink with no matching .gitmodules entry, or a workflow expecting a secret only the new main knows how to fetch).
This is the operator-visible surface of the cross-cutting pattern “trunk drift has O(open branches) blast radius” (see production-learnings § Cross-Cutting Reliability Patterns). The pattern shows up across pipeline-initiated work because there are typically 10+ branches open at any time, all individually exposed to the same main advance.
Diagnosis Steps
Section titled “Diagnosis Steps”Confirm the failing check references a file the branch doesn’t touch:
gh pr diff <pr-number> --repo <owner>/<repo> --name-onlygh run view <run-id> --repo <owner>/<repo> --log-failed | head -100If the failed step references a file not in the diff (commonly .gitmodules, .github/workflows/*.yml, package-lock.json, ssh config, secret-fetch steps), the breakage is on main, not in the branch.
Cross-check by looking at recent landings on main for structural changes:
git log --oneline origin/main -20 -- .gitmodules .github/workflows/ package-lock.jsonA recent commit touching any of these is almost certainly the trigger.
Merge main into the branch and resolve any conflicts that surface:
git checkout <branch>git fetch origin maingit merge origin/main# resolve conflicts (commonly tests with expanded mock chains, submodule pointers)git push origin <branch>If the branch is one you opened manually, this is straightforward. If the branch is owned by Colony (a feature branch or epic branch), merging main in is still safe — push the merge commit and the next pipeline cycle will re-run CI with the absorbed change.
If the branch is far behind and conflicts are unrelated to the feature, consider abandoning the branch and re-creating from the current main (the issue may transition back to ready-for-dev to regenerate the work).
Prevention
Section titled “Prevention”- After landing a structural change on
main(submodule add/remove, lockfile bump, CI workflow rewrite, secret rotation), expect every open PR to need amainmerge — plan a sweep, not one-by-one repair. - Auto-merge-
main-into-open-branches whenmainadvances is the structural fix (tracked as engineering work; until then, treat post-structural-change as an “all branches need touchups” event). - Distinguish “branch CI red on files not in diff” (stale-vs-main) from “branch CI red on files in diff” (feature bug) in operator triage — they need opposite responses (merge-main vs. fix-the-code).
Relevant code: packages/merger/src/ (where merge-main automation would live), .github/workflows/ (the workflow files most often involved in trunk-drift cascades).