Top Failure Modes

Operator-facing diagnosis and fix guides for the most common Colony failure modes. Each section covers symptoms, root cause, diagnosis steps, fix, and prevention.

Postgres is the authority for pipeline state; GitHub labels are a projection for human visibility. Always check Postgres first when diagnosing pipeline issues.

For the underlying engineering patterns that explain many of the specific symptoms here (retry-without-terminal-classification, silent-fallback at trust boundaries, operator-surface inconsistency), see docs/production-learnings.md § Cross-Cutting Reliability Patterns. When a symptom doesn’t match any single entry below, it’s often an instance of one of those patterns.

Quick Decision Tree

Setup issues?                       → colony check --fix → confirm prompts → done
Issue not picked up?                → colony status → colony check --stage runtime → check intake_mode
Agent won't start?                  → colony check --fix → check .colony/logs/<agent>.log
Issue stuck in a state?             → colony why <N> → colony tasks --issue <N>
Worker stuck / one worker pinned?   → colony workers → colony workers reclaim <workerId>
Repeated comments / merger loop?    → pause issue or request changes on PR to break the loop
Branch CI red on files diff doesn't touch? → branch is stale; merge `main` in (or open a fresh branch)
Config confusion?                   → colony check --stage config → review effective config

1. Stale `is_blocked` Flag

Symptoms

Issue stuck in a state with no worker activity
work_tasks table has no pending tasks for the issue
Dashboard shows the issue as blocked, but there is no obvious reason

Root Cause

is_blocked was set by a transient failure (e.g., push conflict, OOM, subprocess timeout) and never cleared. The monitor’s auto_unblock_transient may be disabled or the issue has exceeded max_auto_unblocks_per_issue (default: 3).

Diagnosis Steps

Check which issues are blocked:

colony issues --blocked

SQL fallback (no CLI access)

SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, r.owner, r.name
FROM pipeline_issues pi
JOIN repos r ON r.id = pi.repo_id
WHERE pi.is_blocked = true;

Confirm worker liveness — verify a worker is registered, healthy, and not pinned to a different issue:

colony workers

Check whether there are any pending tasks for the issue:

colony tasks --issue <N>

SQL fallback (no CLI access)

SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check the monitor logs for auto-unblock activity — the self-healing module logs when it unblocks issues and when it skips issues that have exceeded the unblock cap.

Fix

Run the unblock CLI command:

colony unblock <N>

This clears the is_blocked flag, removes the colony:blocked label, and re-queues the issue in one step.

Alternatively, comment /colony:retry on the GitHub issue to retry with current settings.

If you do not have CLI access, clear the flag manually with SQL:

UPDATE pipeline_issues
SET is_blocked = false
WHERE issue_number = <N>
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Prevention

Enable automatic unblocking of transient failures in your config:

agents:
  monitor:
    self_healing:
      auto_unblock_transient: true
      max_auto_unblocks_per_issue: 5

The monitor classifies blocking reasons by analyzing the most recent Colony bot comment on the issue. If the reason is transient and the responsible agent is healthy, it automatically clears the is_blocked flag and transitions the issue to a retry state.

Relevant code: packages/pipeline-store/src/pipeline-store.ts (is_blocked column), packages/monitor/src/self-healing.ts (auto-unblock logic)

2. Head Branch Out of Date on Merge

Symptoms

Merge fails — merger logs show rebase failures or drift assessment
Multiple PRs targeting main at the same time
work_tasks table shows failed merge tasks

Root Cause

Concurrent PRs — when PR A merges into main, PR B’s branch is stale. The merger attempts to rebase PR B onto the updated main branch. If the rebase produces conflicts, the merger runs a drift assessment to determine whether the conflicts are resolvable.

Diagnosis Steps

Check for failed merge tasks:

colony tasks --type merge --status failed

SQL fallback (no CLI access)

SELECT id, issue_number, status, created_at, updated_at
FROM work_tasks
WHERE task_type = 'merge'
  AND status = 'failed'
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check merger logs for rebase failure indicators:

"Rebase failed" — initial rebase attempt failed
"Rebase failed, running drift assessment" — merger is evaluating conflict severity
"Unable to rebase automatically" — escalation marker indicating conflicts require intervention

The merger computes a “drift overlap” metric (percentage of line overlap between conflicting code) to assess whether auto-resolution is feasible.

Fix

The sprint-master automatically re-enqueues merge tasks, and the merger retries the rebase after the conflicting PR has merged. In most cases, the retry succeeds without intervention.

If the issue is stuck:

Comment /colony:retry on the GitHub issue to re-enqueue the merge task
For complex conflicts, the merger may escalate with a comment containing conflict details, drift estimates, and a recommendation (manual rebase vs. re-implementation)
As a last resort, manually rebase the branch: git rebase origin/main and force-push

Prevention

review.rebase_before_check: true (default) ensures the branch is rebased before review checks run, reducing stale-branch scenarios at merge time
The merger has built-in retry logic with drift assessment — most concurrent-PR conflicts resolve automatically on the next attempt
For repos with high PR throughput, ensure the sprint-master poll interval is short enough to quickly re-enqueue failed merge tasks

Relevant code: packages/merger/src/executor.ts (rebase and drift assessment logic), packages/sprint-master/src/sprint-master.ts (task re-enqueue)

3. Label Projection Failures

Symptoms

GitHub issue shows the wrong or missing colony: label
Pipeline is actually progressing — dashboard or Postgres shows the correct state
Sprint-master logs show label sync errors (rate limit, network timeout)

Root Cause

GitHub API rate limit or transient network error during label sync. Labels are a write-only projection of Postgres state, not the source of truth. A failed label update does not affect pipeline processing.

Important: Manually adding or removing colony: labels on GitHub does NOT affect pipeline state. The pipeline reads state exclusively from Postgres. The only exceptions are two supported label commands: colony:enqueue (seeds a new issue into the pipeline) and colony:paused (pauses or resumes a running issue). For all other pipeline control, use slash commands (e.g., /colony:retry, /colony:state <target>) rather than label manipulation.

Diagnosis Steps

Check for issues where Postgres state and stored labels have diverged:

colony issues --label-drift

SQL fallback (no CLI access)

SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, pi.labels
FROM pipeline_issues pi
WHERE pi.issue_number = <N>
  AND pi.repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Compare the state column (authoritative) with the labels array (projection). If they diverge, the label sync will auto-correct on the next poll cycle.

Check sprint-master logs for label sync errors — the syncLabelsFromPostgres() function runs on every poll cycle and reconciles managed labels (state labels, colony:blocked, colony:paused) against GitHub.

Fix

Wait one poll cycle — syncLabelsFromPostgres() auto-corrects within the sprint-master’s poll interval (default 30s). The function compares expected labels (derived from Postgres state and is_blocked/is_paused flags) against actual GitHub labels and issues the necessary add/remove calls.

For immediate correction, use the CLI:

npx colony issue transition <issue-number> --state <state>

Prevention

This is cosmetic — no action needed. Postgres state is authoritative. The sprint-master’s label sync is self-healing by design and catches up automatically. The label sync processes up to 25 issues per cycle (configurable via label_sync_limit).

Relevant code: packages/sprint-master/src/label-sync.ts (syncLabelsFromPostgres()), packages/core/src/state-transition.ts

4. Worker OOM During `npm install`

Symptoms

Worker container crashes or is killed during workspace setup
Container logs show Killed or OOMKilled
The issue gets blocked after repeated setup failures

Root Cause

Default container memory limit is too low for large node_modules trees. npm install can spike memory significantly for repos with many dependencies.

Diagnosis Steps

Check if the container was OOM-killed:

docker inspect <container> | grep OOMKilled

Or check container logs:

docker compose logs worker | grep -i killed

Check worker health to confirm whether the container restarted or is still running:

colony workers

A dead or stale freshness reading for the worker that was processing the issue confirms it was killed or lost its heartbeat around the time of the failure.

Check the work task failure history for the affected issue:

colony tasks --issue <N> --status failed

SQL fallback (no CLI access)

SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
  AND status = 'failed'
ORDER BY created_at DESC
LIMIT 10;

Fix

Increase the worker memory limit in your config:

repos:
  - owner: my-org
    name: my-repo
    workers:
      memory: '6g'

For very large repos (thousands of dependencies), use '8g' or higher. After updating the config, rebuild and restart the worker containers.

Prevention

Set repos[].workers.memory based on the target repo’s dependency tree size
Monitor container memory usage during initial workspace setup to establish a baseline
If using a custom workspace.setup_command (e.g., bundle install for Ruby), the same memory considerations apply

Relevant code: docs/user-guide/configuration.md (repos[].workers.memory field)

5. Planning Timeout Loops

Symptoms

Issue stuck in in-development with a high turn count
Multiple failed develop tasks in work_tasks — sprint-master keeps re-enqueueing
Developer logs show max turns being hit (look for maxTurns in structured log output)

Root Cause

The issue is too complex or ambiguous for the configured turn limit. The developer exhausts developer_max_turns without completing the task, gets blocked, and the sprint-master retries — creating a loop.

Diagnosis Steps

Use colony why for a quick diagnosis of why the issue is stuck:

colony why <N>

Confirm workers are running and not stuck on a different issue before examining task history:

colony workers

Check for repeated develop task failures:

colony tasks --issue <N>

SQL fallback (no CLI access)

SELECT id, task_type, status, created_at, updated_at
FROM work_tasks
WHERE issue_number = <N>
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check developer logs for max turns indicators:

The developer executor logs maxTurns in its structured output when starting development
When the turn limit is hit, the result includes isMaxTurns: true
The failure tracker (packages/core/src/failure-tracker.ts) counts consecutive failures per issue

Fix

Stop the retry loop:

/colony:cancel

Comment this on the GitHub issue to close it and stop re-enqueue.

Alternatively, decompose the issue into smaller sub-issues:

/colony:decompose

This sends the issue to the planner, which breaks it into smaller, more tractable sub-issues.

If the issue is close to completion and just needs more turns, bump the limits:

claude:
  scaling:
    large:
      developer_max_turns: 50

Complexity tiers (small, medium, large) each have their own developer_max_turns setting.

Prevention

Use the planner for large issues — comment /colony:decompose before the issue enters development

Configure progress detection windows to catch stalls early:

claude:
  scaling:
    large:
      no_progress_window: 75

Write well-scoped issues with clear acceptance criteria — ambiguous issues are the primary driver of timeout loops
The failure tracker counts consecutive failures per issue key; after the threshold is exceeded, the issue is blocked to prevent unbounded retries

Relevant code: packages/developer/src/executor.ts (turn limit logic), packages/core/src/state-transition.ts (slash commands), packages/core/src/failure-tracker.ts (failure counting), docs/user-guide/configuration.md (claude.scaling)

6. Agent Looping on an Approved-but-CI-Red PR

Symptoms

Issue stuck in merge-pending or in-review for hours with repeating bot comments every 1–2 minutes
Same comment posted dozens of times — typically the merger’s “Branch is already up-to-date — skipped force-push. PR already approved — moving to merge-pending.” with a CI-failure suffix
One worker permanently occupied processing back-to-back tasks against this one issue, blocking real work for other issues
PR is approved, mergeable, but has a FAILURE CI check that won’t recover without code or infra change

Root Cause

The merger (and similarly-positioned agents) doesn’t distinguish IN_PROGRESS / QUEUED CI checks (truly transient, retry-friendly) from FAILURE checks (terminal, code or infra fix required). When a definitive CI failure exists, the agent keeps re-enqueuing merge tasks every cycle expecting the CI state to change. It never does without operator action. Compounding the visibility: the comment-post path lacks dedup, so each loop iteration writes a fresh copy of the status comment to the issue thread.

This is the operator-visible surface of the cross-cutting pattern “every retry path needs an explicit terminal classification” (see production-learnings § Cross-Cutting Reliability Patterns).

Diagnosis Steps

Check worker liveness and whether one worker is pinned to this issue:

colony workers

A loop signature: one worker showing task: builtin:merge #<N> (or builtin:review #<N>) that never changes across multiple colony workers invocations, with a continuously refreshing fresh heartbeat — the worker is alive but not advancing the issue state. The dashboard Worker Pool Status panel shows the same data and updates live via SSE.

Check the issue state and recent task history to confirm the loop:

colony tasks --issue <N>

SQL fallback (no CLI access)

SELECT id, task_type, status, claimed_by, created_at, completed_at
FROM work_tasks
WHERE issue_number = <N>
  AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 15;

A loop signature in the task history: same task_type (merge or review), same claimed_by worker, new row every 60–120 seconds for hours, all complete with no progress on the issue’s state.

Check which CI checks are failing on the PR:

gh pr checks <pr-number> --repo <owner>/<repo>

If any required check is FAILURE (not PENDING), the loop will not resolve itself.

Fix

First, determine whether the worker is still live. colony workers shows heartbeat freshness — a fresh worker with a continuously refreshing heartbeat is actively processing; stale or dead means the claim record is orphaned and no worker is acting on it.

Scenario A — stale or dead worker (orphaned claim):

If colony workers shows the worker’s freshness as stale or dead, the work_tasks claim is still held but no worker is processing it. Free the lease immediately:

colony workers reclaim <workerId>

This forcibly returns the worker’s claimed task to pending so any available worker can pick it up. On success:

Reclaimed <taskType> task for issue #<N> (task <taskId>) from worker <workerId>

Scenario B — fresh worker actively looping on a CI-red PR:

If the worker is fresh and actively looping, the issue is a definitive CI failure the merger cannot self-resolve. Break the loop immediately by transitioning the issue to a state the agent won’t keep retrying. Two equally good options:

Option A — request changes on the PR (cleanest if the failure needs developer rework):

gh pr review <pr-number> --repo <owner>/<repo> --request-changes \
  --body "CI blocked by <specific failing check>. Returning to changes-requested while <root cause> is resolved."

This moves the issue to changes-requested and stops the loop.

Option B — pause the issue (cleanest if the failure is infra-side and you’ll come back later):

gh issue edit <N> --repo <owner>/<repo> --add-label colony:paused

After the underlying CI failure is fixed (push a code fix, repair the CI workflow, rotate a broken token), comment /colony:retry on the issue to re-enter the pipeline.

Prevention

Track and prioritize merger loop and comment-dedup work (#3853, #3854 at time of writing). When these land, hard-failed CI auto-transitions to a block state and the loop class is eliminated.
Triage flaky tests on main aggressively (#3851) — a flake produces the same loop-prone state until either fixed or worked around.
For repos with custom CI jobs (Cloudflare Pages deploys, content validation, security scans), keep review.checks config aligned with the CI workflow’s check-run names so the reviewer catches breakage locally before the loop class can fire.

Relevant code: packages/merger/src/ (retry logic), packages/sprint-master/src/slash-commands.ts (isDuplicateBotComment reference for the dedup pattern that needs to be shared).

7. Stale Branch Failing CI After `main` Moved

Symptoms

A PR’s CI was passing yesterday and is failing today, with no changes pushed in between
The failing check points at a file that isn’t in the branch’s diff (e.g., .gitmodules, package-lock.json, a CI workflow YAML, an SSH/secrets setup step)
Multiple unrelated PRs all fail the same way at roughly the same time
The error often mentions submodules, missing dependencies, secret/auth setup, or workflow steps the branch never touched

Root Cause

A structural change landed on main — submodule schema change, lockfile bump, CI workflow change, secret rotation — that every open branch must absorb before its CI can pass again. Branches forked before that commit don’t yet have the new shape; their CI re-runs against a checkout that is internally inconsistent (e.g., a gitlink with no matching .gitmodules entry, or a workflow expecting a secret only the new main knows how to fetch).

This is the operator-visible surface of the cross-cutting pattern “trunk drift has O(open branches) blast radius” (see production-learnings § Cross-Cutting Reliability Patterns). The pattern shows up across pipeline-initiated work because there are typically 10+ branches open at any time, all individually exposed to the same main advance.

Diagnosis Steps

Confirm the failing check references a file the branch doesn’t touch:

gh pr diff <pr-number> --repo <owner>/<repo> --name-only
gh run view <run-id> --repo <owner>/<repo> --log-failed | head -100

If the failed step references a file not in the diff (commonly .gitmodules, .github/workflows/*.yml, package-lock.json, ssh config, secret-fetch steps), the breakage is on main, not in the branch.

Cross-check by looking at recent landings on main for structural changes:

git log --oneline origin/main -20 -- .gitmodules .github/workflows/ package-lock.json

A recent commit touching any of these is almost certainly the trigger.

Fix

Merge main into the branch and resolve any conflicts that surface:

git checkout <branch>
git fetch origin main
git merge origin/main
# resolve conflicts (commonly tests with expanded mock chains, submodule pointers)
git push origin <branch>

If the branch is one you opened manually, this is straightforward. If the branch is owned by Colony (a feature branch or epic branch), merging main in is still safe — push the merge commit and the next pipeline cycle will re-run CI with the absorbed change.

If the branch is far behind and conflicts are unrelated to the feature, consider abandoning the branch and re-creating from the current main (the issue may transition back to ready-for-dev to regenerate the work).

Prevention

After landing a structural change on main (submodule add/remove, lockfile bump, CI workflow rewrite, secret rotation), expect every open PR to need a main merge — plan a sweep, not one-by-one repair.
Auto-merge-main-into-open-branches when main advances is the structural fix (tracked as engineering work; until then, treat post-structural-change as an “all branches need touchups” event).
Distinguish “branch CI red on files not in diff” (stale-vs-main) from “branch CI red on files in diff” (feature bug) in operator triage — they need opposite responses (merge-main vs. fix-the-code).

Relevant code: packages/merger/src/ (where merge-main automation would live), .github/workflows/ (the workflow files most often involved in trunk-drift cascades).

Top Failure Modes

Top Failure Modes

Quick Decision Tree

1. Stale is_blocked Flag

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

2. Head Branch Out of Date on Merge

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

3. Label Projection Failures

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

4. Worker OOM During npm install

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

5. Planning Timeout Loops

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

6. Agent Looping on an Approved-but-CI-Red PR

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

7. Stale Branch Failing CI After main Moved

Symptoms

Root Cause

Diagnosis Steps

Fix

Prevention

1. Stale `is_blocked` Flag

4. Worker OOM During `npm install`

7. Stale Branch Failing CI After `main` Moved