Skip to content

Top Failure Modes

Operator-facing diagnosis and fix guides for the most common Colony failure modes. Each section covers symptoms, root cause, diagnosis steps, fix, and prevention.

Postgres is the authority for pipeline state; GitHub labels are a projection for human visibility. Always check Postgres first when diagnosing pipeline issues.

For the underlying engineering patterns that explain many of the specific symptoms here (retry-without-terminal-classification, silent-fallback at trust boundaries, operator-surface inconsistency), see docs/production-learnings.md § Cross-Cutting Reliability Patterns. When a symptom doesn’t match any single entry below, it’s often an instance of one of those patterns.

Setup issues? → colony check --fix → confirm prompts → done
Issue not picked up? → colony status → colony check --stage runtime → check intake_mode
Agent won't start? → colony check --fix → check .colony/logs/<agent>.log
Issue stuck in a state? → colony why <N> → colony tasks --issue <N>
Worker stuck / one worker pinned? → colony workers → colony workers reclaim <workerId>
Repeated comments / merger loop? → pause issue or request changes on PR to break the loop
Branch CI red on files diff doesn't touch? → branch is stale; merge `main` in (or open a fresh branch)
Config confusion? → colony check --stage config → review effective config

  • Issue stuck in a state with no worker activity
  • work_tasks table has no pending tasks for the issue
  • Dashboard shows the issue as blocked, but there is no obvious reason

is_blocked was set by a transient failure (e.g., push conflict, OOM, subprocess timeout) and never cleared. The monitor’s auto_unblock_transient may be disabled or the issue has exceeded max_auto_unblocks_per_issue (default: 3).

Check which issues are blocked:

Terminal window
colony issues --blocked
SQL fallback (no CLI access)
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, r.owner, r.name
FROM pipeline_issues pi
JOIN repos r ON r.id = pi.repo_id
WHERE pi.is_blocked = true;

Confirm worker liveness — verify a worker is registered, healthy, and not pinned to a different issue:

Terminal window
colony workers

Check whether there are any pending tasks for the issue:

Terminal window
colony tasks --issue <N>
SQL fallback (no CLI access)
SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check the monitor logs for auto-unblock activity — the self-healing module logs when it unblocks issues and when it skips issues that have exceeded the unblock cap.

Run the unblock CLI command:

Terminal window
colony unblock <N>

This clears the is_blocked flag, removes the colony:blocked label, and re-queues the issue in one step.

Alternatively, comment /colony:retry on the GitHub issue to retry with current settings.

If you do not have CLI access, clear the flag manually with SQL:

UPDATE pipeline_issues
SET is_blocked = false
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Enable automatic unblocking of transient failures in your config:

agents:
monitor:
self_healing:
auto_unblock_transient: true
max_auto_unblocks_per_issue: 5

The monitor classifies blocking reasons by analyzing the most recent Colony bot comment on the issue. If the reason is transient and the responsible agent is healthy, it automatically clears the is_blocked flag and transitions the issue to a retry state.

Relevant code: packages/pipeline-store/src/pipeline-store.ts (is_blocked column), packages/monitor/src/self-healing.ts (auto-unblock logic)


  • Merge fails — merger logs show rebase failures or drift assessment
  • Multiple PRs targeting main at the same time
  • work_tasks table shows failed merge tasks

Concurrent PRs — when PR A merges into main, PR B’s branch is stale. The merger attempts to rebase PR B onto the updated main branch. If the rebase produces conflicts, the merger runs a drift assessment to determine whether the conflicts are resolvable.

Check for failed merge tasks:

Terminal window
colony tasks --type merge --status failed
SQL fallback (no CLI access)
SELECT id, issue_number, status, created_at, updated_at
FROM work_tasks
WHERE task_type = 'merge'
AND status = 'failed'
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check merger logs for rebase failure indicators:

  • "Rebase failed" — initial rebase attempt failed
  • "Rebase failed, running drift assessment" — merger is evaluating conflict severity
  • "Unable to rebase automatically" — escalation marker indicating conflicts require intervention

The merger computes a “drift overlap” metric (percentage of line overlap between conflicting code) to assess whether auto-resolution is feasible.

The sprint-master automatically re-enqueues merge tasks, and the merger retries the rebase after the conflicting PR has merged. In most cases, the retry succeeds without intervention.

If the issue is stuck:

  • Comment /colony:retry on the GitHub issue to re-enqueue the merge task
  • For complex conflicts, the merger may escalate with a comment containing conflict details, drift estimates, and a recommendation (manual rebase vs. re-implementation)
  • As a last resort, manually rebase the branch: git rebase origin/main and force-push
  • review.rebase_before_check: true (default) ensures the branch is rebased before review checks run, reducing stale-branch scenarios at merge time
  • The merger has built-in retry logic with drift assessment — most concurrent-PR conflicts resolve automatically on the next attempt
  • For repos with high PR throughput, ensure the sprint-master poll interval is short enough to quickly re-enqueue failed merge tasks

Relevant code: packages/merger/src/executor.ts (rebase and drift assessment logic), packages/sprint-master/src/sprint-master.ts (task re-enqueue)


  • GitHub issue shows the wrong or missing colony: label
  • Pipeline is actually progressing — dashboard or Postgres shows the correct state
  • Sprint-master logs show label sync errors (rate limit, network timeout)

GitHub API rate limit or transient network error during label sync. Labels are a write-only projection of Postgres state, not the source of truth. A failed label update does not affect pipeline processing.

Important: Manually adding or removing colony: labels on GitHub does NOT affect pipeline state. The pipeline reads state exclusively from Postgres. The only exceptions are two supported label commands: colony:enqueue (seeds a new issue into the pipeline) and colony:paused (pauses or resumes a running issue). For all other pipeline control, use slash commands (e.g., /colony:retry, /colony:state <target>) rather than label manipulation.

Check for issues where Postgres state and stored labels have diverged:

Terminal window
colony issues --label-drift
SQL fallback (no CLI access)
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, pi.labels
FROM pipeline_issues pi
WHERE pi.issue_number = <N>
AND pi.repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Compare the state column (authoritative) with the labels array (projection). If they diverge, the label sync will auto-correct on the next poll cycle.

Check sprint-master logs for label sync errors — the syncLabelsFromPostgres() function runs on every poll cycle and reconciles managed labels (state labels, colony:blocked, colony:paused) against GitHub.

Wait one poll cycle — syncLabelsFromPostgres() auto-corrects within the sprint-master’s poll interval (default 30s). The function compares expected labels (derived from Postgres state and is_blocked/is_paused flags) against actual GitHub labels and issues the necessary add/remove calls.

For immediate correction, use the CLI:

Terminal window
npx colony issue transition <issue-number> --state <state>

This is cosmetic — no action needed. Postgres state is authoritative. The sprint-master’s label sync is self-healing by design and catches up automatically. The label sync processes up to 25 issues per cycle (configurable via label_sync_limit).

Relevant code: packages/sprint-master/src/label-sync.ts (syncLabelsFromPostgres()), packages/core/src/state-transition.ts


  • Worker container crashes or is killed during workspace setup
  • Container logs show Killed or OOMKilled
  • The issue gets blocked after repeated setup failures

Default container memory limit is too low for large node_modules trees. npm install can spike memory significantly for repos with many dependencies.

Check if the container was OOM-killed:

Terminal window
docker inspect <container> | grep OOMKilled

Or check container logs:

Terminal window
docker compose logs worker | grep -i killed

Check worker health to confirm whether the container restarted or is still running:

Terminal window
colony workers

A dead or stale freshness reading for the worker that was processing the issue confirms it was killed or lost its heartbeat around the time of the failure.

Check the work task failure history for the affected issue:

Terminal window
colony tasks --issue <N> --status failed
SQL fallback (no CLI access)
SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
AND status = 'failed'
ORDER BY created_at DESC
LIMIT 10;

Increase the worker memory limit in your config:

repos:
- owner: my-org
name: my-repo
workers:
memory: '6g'

For very large repos (thousands of dependencies), use '8g' or higher. After updating the config, rebuild and restart the worker containers.

  • Set repos[].workers.memory based on the target repo’s dependency tree size
  • Monitor container memory usage during initial workspace setup to establish a baseline
  • If using a custom workspace.setup_command (e.g., bundle install for Ruby), the same memory considerations apply

Relevant code: docs/user-guide/configuration.md (repos[].workers.memory field)


  • Issue stuck in in-development with a high turn count
  • Multiple failed develop tasks in work_tasks — sprint-master keeps re-enqueueing
  • Developer logs show max turns being hit (look for maxTurns in structured log output)

The issue is too complex or ambiguous for the configured turn limit. The developer exhausts developer_max_turns without completing the task, gets blocked, and the sprint-master retries — creating a loop.

Use colony why for a quick diagnosis of why the issue is stuck:

Terminal window
colony why <N>

Confirm workers are running and not stuck on a different issue before examining task history:

Terminal window
colony workers

Check for repeated develop task failures:

Terminal window
colony tasks --issue <N>
SQL fallback (no CLI access)
SELECT id, task_type, status, created_at, updated_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check developer logs for max turns indicators:

  • The developer executor logs maxTurns in its structured output when starting development
  • When the turn limit is hit, the result includes isMaxTurns: true
  • The failure tracker (packages/core/src/failure-tracker.ts) counts consecutive failures per issue

Stop the retry loop:

/colony:cancel

Comment this on the GitHub issue to close it and stop re-enqueue.

Alternatively, decompose the issue into smaller sub-issues:

/colony:decompose

This sends the issue to the planner, which breaks it into smaller, more tractable sub-issues.

If the issue is close to completion and just needs more turns, bump the limits:

claude:
scaling:
large:
developer_max_turns: 50

Complexity tiers (small, medium, large) each have their own developer_max_turns setting.

  • Use the planner for large issues — comment /colony:decompose before the issue enters development
  • Configure progress detection windows to catch stalls early:
    claude:
    scaling:
    large:
    no_progress_window: 75
  • Write well-scoped issues with clear acceptance criteria — ambiguous issues are the primary driver of timeout loops
  • The failure tracker counts consecutive failures per issue key; after the threshold is exceeded, the issue is blocked to prevent unbounded retries

Relevant code: packages/developer/src/executor.ts (turn limit logic), packages/core/src/state-transition.ts (slash commands), packages/core/src/failure-tracker.ts (failure counting), docs/user-guide/configuration.md (claude.scaling)


6. Agent Looping on an Approved-but-CI-Red PR

Section titled “6. Agent Looping on an Approved-but-CI-Red PR”
  • Issue stuck in merge-pending or in-review for hours with repeating bot comments every 1–2 minutes
  • Same comment posted dozens of times — typically the merger’s “Branch is already up-to-date — skipped force-push. PR already approved — moving to merge-pending.” with a CI-failure suffix
  • One worker permanently occupied processing back-to-back tasks against this one issue, blocking real work for other issues
  • PR is approved, mergeable, but has a FAILURE CI check that won’t recover without code or infra change

The merger (and similarly-positioned agents) doesn’t distinguish IN_PROGRESS / QUEUED CI checks (truly transient, retry-friendly) from FAILURE checks (terminal, code or infra fix required). When a definitive CI failure exists, the agent keeps re-enqueuing merge tasks every cycle expecting the CI state to change. It never does without operator action. Compounding the visibility: the comment-post path lacks dedup, so each loop iteration writes a fresh copy of the status comment to the issue thread.

This is the operator-visible surface of the cross-cutting pattern “every retry path needs an explicit terminal classification” (see production-learnings § Cross-Cutting Reliability Patterns).

Check worker liveness and whether one worker is pinned to this issue:

Terminal window
colony workers

A loop signature: one worker showing task: builtin:merge #<N> (or builtin:review #<N>) that never changes across multiple colony workers invocations, with a continuously refreshing fresh heartbeat — the worker is alive but not advancing the issue state. The dashboard Worker Pool Status panel shows the same data and updates live via SSE.

Check the issue state and recent task history to confirm the loop:

Terminal window
colony tasks --issue <N>
SQL fallback (no CLI access)
SELECT id, task_type, status, claimed_by, created_at, completed_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 15;

A loop signature in the task history: same task_type (merge or review), same claimed_by worker, new row every 60–120 seconds for hours, all complete with no progress on the issue’s state.

Check which CI checks are failing on the PR:

Terminal window
gh pr checks <pr-number> --repo <owner>/<repo>

If any required check is FAILURE (not PENDING), the loop will not resolve itself.

First, determine whether the worker is still live. colony workers shows heartbeat freshness — a fresh worker with a continuously refreshing heartbeat is actively processing; stale or dead means the claim record is orphaned and no worker is acting on it.

Scenario A — stale or dead worker (orphaned claim):

If colony workers shows the worker’s freshness as stale or dead, the work_tasks claim is still held but no worker is processing it. Free the lease immediately:

Terminal window
colony workers reclaim <workerId>

This forcibly returns the worker’s claimed task to pending so any available worker can pick it up. On success:

Reclaimed <taskType> task for issue #<N> (task <taskId>) from worker <workerId>

Scenario B — fresh worker actively looping on a CI-red PR:

If the worker is fresh and actively looping, the issue is a definitive CI failure the merger cannot self-resolve. Break the loop immediately by transitioning the issue to a state the agent won’t keep retrying. Two equally good options:

Option A — request changes on the PR (cleanest if the failure needs developer rework):

Terminal window
gh pr review <pr-number> --repo <owner>/<repo> --request-changes \
--body "CI blocked by <specific failing check>. Returning to changes-requested while <root cause> is resolved."

This moves the issue to changes-requested and stops the loop.

Option B — pause the issue (cleanest if the failure is infra-side and you’ll come back later):

Terminal window
gh issue edit <N> --repo <owner>/<repo> --add-label colony:paused

After the underlying CI failure is fixed (push a code fix, repair the CI workflow, rotate a broken token), comment /colony:retry on the issue to re-enter the pipeline.

  • Track and prioritize merger loop and comment-dedup work (#3853, #3854 at time of writing). When these land, hard-failed CI auto-transitions to a block state and the loop class is eliminated.
  • Triage flaky tests on main aggressively (#3851) — a flake produces the same loop-prone state until either fixed or worked around.
  • For repos with custom CI jobs (Cloudflare Pages deploys, content validation, security scans), keep review.checks config aligned with the CI workflow’s check-run names so the reviewer catches breakage locally before the loop class can fire.

Relevant code: packages/merger/src/ (retry logic), packages/sprint-master/src/slash-commands.ts (isDuplicateBotComment reference for the dedup pattern that needs to be shared).


7. Stale Branch Failing CI After main Moved

Section titled “7. Stale Branch Failing CI After main Moved”
  • A PR’s CI was passing yesterday and is failing today, with no changes pushed in between
  • The failing check points at a file that isn’t in the branch’s diff (e.g., .gitmodules, package-lock.json, a CI workflow YAML, an SSH/secrets setup step)
  • Multiple unrelated PRs all fail the same way at roughly the same time
  • The error often mentions submodules, missing dependencies, secret/auth setup, or workflow steps the branch never touched

A structural change landed on main — submodule schema change, lockfile bump, CI workflow change, secret rotation — that every open branch must absorb before its CI can pass again. Branches forked before that commit don’t yet have the new shape; their CI re-runs against a checkout that is internally inconsistent (e.g., a gitlink with no matching .gitmodules entry, or a workflow expecting a secret only the new main knows how to fetch).

This is the operator-visible surface of the cross-cutting pattern “trunk drift has O(open branches) blast radius” (see production-learnings § Cross-Cutting Reliability Patterns). The pattern shows up across pipeline-initiated work because there are typically 10+ branches open at any time, all individually exposed to the same main advance.

Confirm the failing check references a file the branch doesn’t touch:

Terminal window
gh pr diff <pr-number> --repo <owner>/<repo> --name-only
gh run view <run-id> --repo <owner>/<repo> --log-failed | head -100

If the failed step references a file not in the diff (commonly .gitmodules, .github/workflows/*.yml, package-lock.json, ssh config, secret-fetch steps), the breakage is on main, not in the branch.

Cross-check by looking at recent landings on main for structural changes:

Terminal window
git log --oneline origin/main -20 -- .gitmodules .github/workflows/ package-lock.json

A recent commit touching any of these is almost certainly the trigger.

Merge main into the branch and resolve any conflicts that surface:

Terminal window
git checkout <branch>
git fetch origin main
git merge origin/main
# resolve conflicts (commonly tests with expanded mock chains, submodule pointers)
git push origin <branch>

If the branch is one you opened manually, this is straightforward. If the branch is owned by Colony (a feature branch or epic branch), merging main in is still safe — push the merge commit and the next pipeline cycle will re-run CI with the absorbed change.

If the branch is far behind and conflicts are unrelated to the feature, consider abandoning the branch and re-creating from the current main (the issue may transition back to ready-for-dev to regenerate the work).

  • After landing a structural change on main (submodule add/remove, lockfile bump, CI workflow rewrite, secret rotation), expect every open PR to need a main merge — plan a sweep, not one-by-one repair.
  • Auto-merge-main-into-open-branches when main advances is the structural fix (tracked as engineering work; until then, treat post-structural-change as an “all branches need touchups” event).
  • Distinguish “branch CI red on files not in diff” (stale-vs-main) from “branch CI red on files in diff” (feature bug) in operator triage — they need opposite responses (merge-main vs. fix-the-code).

Relevant code: packages/merger/src/ (where merge-main automation would live), .github/workflows/ (the workflow files most often involved in trunk-drift cascades).