Top 5 Failure Modes
Top 5 Failure Modes
Section titled “Top 5 Failure Modes”Operator-facing diagnosis and fix guides for the five most common Colony failure modes. Each section covers symptoms, root cause, diagnosis steps, fix, and prevention.
Postgres is the authority for pipeline state; GitHub labels are a projection for human visibility. Always check Postgres first when diagnosing pipeline issues.
Quick Decision Tree
Section titled “Quick Decision Tree”Issue not picked up? → colony status → colony preflight → check intake_modeAgent won't start? → colony doctor → check .colony/logs/<agent>.logIssue stuck in a state? → colony status --db → check work_tasks tableConfig confusion? → colony validate-config → review effective config1. Stale is_blocked Flag
Section titled “1. Stale is_blocked Flag”Symptoms
Section titled “Symptoms”- Issue stuck in a state with no worker activity
work_taskstable has no pending tasks for the issue- Dashboard shows the issue as blocked, but there is no obvious reason
Root Cause
Section titled “Root Cause”is_blocked was set by a transient failure (e.g., push conflict, OOM, subprocess timeout) and never cleared. The monitor’s auto_unblock_transient may be disabled or the issue has exceeded max_auto_unblocks_per_issue (default: 3).
Diagnosis Steps
Section titled “Diagnosis Steps”Check which issues are blocked:
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, r.owner, r.nameFROM pipeline_issues piJOIN repos r ON r.id = pi.repo_idWHERE pi.is_blocked = true;Check whether there are any pending tasks for the issue:
SELECT id, task_type, status, created_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check the monitor logs for auto-unblock activity — the self-healing module logs when it unblocks issues and when it skips issues that have exceeded the unblock cap.
Clear the is_blocked flag manually:
UPDATE pipeline_issuesSET is_blocked = falseWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');Alternatively, comment /colony:retry on the GitHub issue to retry with current settings.
Prevention
Section titled “Prevention”Enable automatic unblocking of transient failures in your config:
agents: monitor: self_healing: auto_unblock_transient: true max_auto_unblocks_per_issue: 5The monitor classifies blocking reasons by analyzing the most recent Colony bot comment on the issue. If the reason is transient and the responsible agent is healthy, it automatically clears the is_blocked flag and transitions the issue to a retry state.
Relevant code: packages/pipeline-store/src/pipeline-store.ts (is_blocked column), packages/monitor/src/self-healing.ts (auto-unblock logic)
2. Head Branch Out of Date on Merge
Section titled “2. Head Branch Out of Date on Merge”Symptoms
Section titled “Symptoms”- Merge fails — merger logs show rebase failures or drift assessment
- Multiple PRs targeting main at the same time
work_taskstable shows failedmergetasks
Root Cause
Section titled “Root Cause”Concurrent PRs — when PR A merges into main, PR B’s branch is stale. The merger attempts to rebase PR B onto the updated main branch. If the rebase produces conflicts, the merger runs a drift assessment to determine whether the conflicts are resolvable.
Diagnosis Steps
Section titled “Diagnosis Steps”Check for failed merge tasks:
SELECT id, issue_number, status, created_at, updated_atFROM work_tasksWHERE task_type = 'merge' AND status = 'failed' AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check merger logs for rebase failure indicators:
"Rebase failed"— initial rebase attempt failed"Rebase failed, running drift assessment"— merger is evaluating conflict severity"Unable to rebase automatically"— escalation marker indicating conflicts require intervention
The merger computes a “drift overlap” metric (percentage of line overlap between conflicting code) to assess whether auto-resolution is feasible.
The sprint-master automatically re-enqueues merge tasks, and the merger retries the rebase after the conflicting PR has merged. In most cases, the retry succeeds without intervention.
If the issue is stuck:
- Comment
/colony:retryon the GitHub issue to re-enqueue the merge task - For complex conflicts, the merger may escalate with a comment containing conflict details, drift estimates, and a recommendation (manual rebase vs. re-implementation)
- As a last resort, manually rebase the branch:
git rebase origin/mainand force-push
Prevention
Section titled “Prevention”review.rebase_before_check: true(default) ensures the branch is rebased before review checks run, reducing stale-branch scenarios at merge time- The merger has built-in retry logic with drift assessment — most concurrent-PR conflicts resolve automatically on the next attempt
- For repos with high PR throughput, ensure the sprint-master poll interval is short enough to quickly re-enqueue failed merge tasks
Relevant code: packages/merger/src/executor.ts (rebase and drift assessment logic), packages/sprint-master/src/sprint-master.ts (task re-enqueue)
3. Label Projection Failures
Section titled “3. Label Projection Failures”Symptoms
Section titled “Symptoms”- GitHub issue shows the wrong or missing
colony:label - Pipeline is actually progressing — dashboard or Postgres shows the correct state
- Sprint-master logs show label sync errors (rate limit, network timeout)
Root Cause
Section titled “Root Cause”GitHub API rate limit or transient network error during label sync. Labels are a projection of Postgres state, not the source of truth. A failed label update does not affect pipeline processing.
Diagnosis Steps
Section titled “Diagnosis Steps”Check the actual pipeline state in Postgres:
SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, pi.labelsFROM pipeline_issues piWHERE pi.issue_number = <N> AND pi.repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');Compare the state column (authoritative) with the labels array (projection). If they diverge, the label sync will auto-correct on the next poll cycle.
Check sprint-master logs for label sync errors — the syncLabelsFromPostgres() function runs on every poll cycle and reconciles managed labels (state labels, colony:blocked, colony:paused) against GitHub.
Wait one poll cycle — syncLabelsFromPostgres() auto-corrects within the sprint-master’s poll interval (default 30s). The function compares expected labels (derived from Postgres state and is_blocked/is_paused flags) against actual GitHub labels and issues the necessary add/remove calls.
For immediate correction, use the CLI:
npx colony issue transition <issue-number> --state <state>Prevention
Section titled “Prevention”This is cosmetic — no action needed. Postgres state is authoritative. The sprint-master’s label sync is self-healing by design and catches up automatically. The label sync processes up to 25 issues per cycle (configurable via label_sync_limit).
Relevant code: packages/sprint-master/src/label-sync.ts (syncLabelsFromPostgres()), packages/core/src/state-transition.ts
4. Worker OOM During npm install
Section titled “4. Worker OOM During npm install”Symptoms
Section titled “Symptoms”- Worker container crashes or is killed during workspace setup
- Container logs show
KilledorOOMKilled - The issue gets blocked after repeated setup failures
Root Cause
Section titled “Root Cause”Default container memory limit is too low for large node_modules trees. npm install can spike memory significantly for repos with many dependencies.
Diagnosis Steps
Section titled “Diagnosis Steps”Check if the container was OOM-killed:
docker inspect <container> | grep OOMKilledOr check container logs:
docker compose logs worker | grep -i killedCheck the work task failure history:
SELECT id, task_type, status, created_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>') AND status = 'failed'ORDER BY created_at DESCLIMIT 10;Increase the worker memory limit in your config:
repos: - owner: my-org name: my-repo workers: memory: '6g'For very large repos (thousands of dependencies), use '8g' or higher. After updating the config, rebuild and restart the worker containers.
Prevention
Section titled “Prevention”- Set
repos[].workers.memorybased on the target repo’s dependency tree size - Monitor container memory usage during initial workspace setup to establish a baseline
- If using a custom
workspace.setup_command(e.g.,bundle installfor Ruby), the same memory considerations apply
Relevant code: docs/user-guide/configuration.md (repos[].workers.memory field)
5. Planning Timeout Loops
Section titled “5. Planning Timeout Loops”Symptoms
Section titled “Symptoms”- Issue stuck in
in-developmentwith a high turn count - Multiple failed
developtasks inwork_tasks— sprint-master keeps re-enqueueing - Developer logs show max turns being hit (look for
maxTurnsin structured log output)
Root Cause
Section titled “Root Cause”The issue is too complex or ambiguous for the configured turn limit. The developer exhausts developer_max_turns without completing the task, gets blocked, and the sprint-master retries — creating a loop.
Diagnosis Steps
Section titled “Diagnosis Steps”Check for repeated develop task failures:
SELECT id, task_type, status, created_at, updated_atFROM work_tasksWHERE issue_number = <N> AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')ORDER BY created_at DESCLIMIT 10;Check developer logs for max turns indicators:
- The developer executor logs
maxTurnsin its structured output when starting development - When the turn limit is hit, the result includes
isMaxTurns: true - The failure tracker (
packages/core/src/failure-tracker.ts) counts consecutive failures per issue
Stop the retry loop:
/colony:cancelComment this on the GitHub issue to close it and stop re-enqueue.
Alternatively, decompose the issue into smaller sub-issues:
/colony:decomposeThis sends the issue to the planner, which breaks it into smaller, more tractable sub-issues.
If the issue is close to completion and just needs more turns, bump the limits:
claude: scaling: large: developer_max_turns: 50Complexity tiers (small, medium, large) each have their own developer_max_turns setting.
Prevention
Section titled “Prevention”- Use the planner for large issues — comment
/colony:decomposebefore the issue enters development - Configure progress detection windows to catch stalls early:
claude:scaling:large:no_progress_window: 75
- Write well-scoped issues with clear acceptance criteria — ambiguous issues are the primary driver of timeout loops
- The failure tracker counts consecutive failures per issue key; after the threshold is exceeded, the issue is blocked to prevent unbounded retries
Relevant code: packages/developer/src/executor.ts (turn limit logic), packages/core/src/state-transition.ts (slash commands), packages/core/src/failure-tracker.ts (failure counting), docs/user-guide/configuration.md (claude.scaling)