Skip to content

Top 5 Failure Modes

Operator-facing diagnosis and fix guides for the five most common Colony failure modes. Each section covers symptoms, root cause, diagnosis steps, fix, and prevention.

Postgres is the authority for pipeline state; GitHub labels are a projection for human visibility. Always check Postgres first when diagnosing pipeline issues.

Issue not picked up? → colony status → colony preflight → check intake_mode
Agent won't start? → colony doctor → check .colony/logs/<agent>.log
Issue stuck in a state? → colony status --db → check work_tasks table
Config confusion? → colony validate-config → review effective config

  • Issue stuck in a state with no worker activity
  • work_tasks table has no pending tasks for the issue
  • Dashboard shows the issue as blocked, but there is no obvious reason

is_blocked was set by a transient failure (e.g., push conflict, OOM, subprocess timeout) and never cleared. The monitor’s auto_unblock_transient may be disabled or the issue has exceeded max_auto_unblocks_per_issue (default: 3).

Check which issues are blocked:

SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, r.owner, r.name
FROM pipeline_issues pi
JOIN repos r ON r.id = pi.repo_id
WHERE pi.is_blocked = true;

Check whether there are any pending tasks for the issue:

SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check the monitor logs for auto-unblock activity — the self-healing module logs when it unblocks issues and when it skips issues that have exceeded the unblock cap.

Clear the is_blocked flag manually:

UPDATE pipeline_issues
SET is_blocked = false
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Alternatively, comment /colony:retry on the GitHub issue to retry with current settings.

Enable automatic unblocking of transient failures in your config:

agents:
monitor:
self_healing:
auto_unblock_transient: true
max_auto_unblocks_per_issue: 5

The monitor classifies blocking reasons by analyzing the most recent Colony bot comment on the issue. If the reason is transient and the responsible agent is healthy, it automatically clears the is_blocked flag and transitions the issue to a retry state.

Relevant code: packages/pipeline-store/src/pipeline-store.ts (is_blocked column), packages/monitor/src/self-healing.ts (auto-unblock logic)


  • Merge fails — merger logs show rebase failures or drift assessment
  • Multiple PRs targeting main at the same time
  • work_tasks table shows failed merge tasks

Concurrent PRs — when PR A merges into main, PR B’s branch is stale. The merger attempts to rebase PR B onto the updated main branch. If the rebase produces conflicts, the merger runs a drift assessment to determine whether the conflicts are resolvable.

Check for failed merge tasks:

SELECT id, issue_number, status, created_at, updated_at
FROM work_tasks
WHERE task_type = 'merge'
AND status = 'failed'
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check merger logs for rebase failure indicators:

  • "Rebase failed" — initial rebase attempt failed
  • "Rebase failed, running drift assessment" — merger is evaluating conflict severity
  • "Unable to rebase automatically" — escalation marker indicating conflicts require intervention

The merger computes a “drift overlap” metric (percentage of line overlap between conflicting code) to assess whether auto-resolution is feasible.

The sprint-master automatically re-enqueues merge tasks, and the merger retries the rebase after the conflicting PR has merged. In most cases, the retry succeeds without intervention.

If the issue is stuck:

  • Comment /colony:retry on the GitHub issue to re-enqueue the merge task
  • For complex conflicts, the merger may escalate with a comment containing conflict details, drift estimates, and a recommendation (manual rebase vs. re-implementation)
  • As a last resort, manually rebase the branch: git rebase origin/main and force-push
  • review.rebase_before_check: true (default) ensures the branch is rebased before review checks run, reducing stale-branch scenarios at merge time
  • The merger has built-in retry logic with drift assessment — most concurrent-PR conflicts resolve automatically on the next attempt
  • For repos with high PR throughput, ensure the sprint-master poll interval is short enough to quickly re-enqueue failed merge tasks

Relevant code: packages/merger/src/executor.ts (rebase and drift assessment logic), packages/sprint-master/src/sprint-master.ts (task re-enqueue)


  • GitHub issue shows the wrong or missing colony: label
  • Pipeline is actually progressing — dashboard or Postgres shows the correct state
  • Sprint-master logs show label sync errors (rate limit, network timeout)

GitHub API rate limit or transient network error during label sync. Labels are a projection of Postgres state, not the source of truth. A failed label update does not affect pipeline processing.

Check the actual pipeline state in Postgres:

SELECT pi.issue_number, pi.state, pi.is_blocked, pi.is_paused, pi.labels
FROM pipeline_issues pi
WHERE pi.issue_number = <N>
AND pi.repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>');

Compare the state column (authoritative) with the labels array (projection). If they diverge, the label sync will auto-correct on the next poll cycle.

Check sprint-master logs for label sync errors — the syncLabelsFromPostgres() function runs on every poll cycle and reconciles managed labels (state labels, colony:blocked, colony:paused) against GitHub.

Wait one poll cycle — syncLabelsFromPostgres() auto-corrects within the sprint-master’s poll interval (default 30s). The function compares expected labels (derived from Postgres state and is_blocked/is_paused flags) against actual GitHub labels and issues the necessary add/remove calls.

For immediate correction, use the CLI:

Terminal window
npx colony issue transition <issue-number> --state <state>

This is cosmetic — no action needed. Postgres state is authoritative. The sprint-master’s label sync is self-healing by design and catches up automatically. The label sync processes up to 25 issues per cycle (configurable via label_sync_limit).

Relevant code: packages/sprint-master/src/label-sync.ts (syncLabelsFromPostgres()), packages/core/src/state-transition.ts


  • Worker container crashes or is killed during workspace setup
  • Container logs show Killed or OOMKilled
  • The issue gets blocked after repeated setup failures

Default container memory limit is too low for large node_modules trees. npm install can spike memory significantly for repos with many dependencies.

Check if the container was OOM-killed:

Terminal window
docker inspect <container> | grep OOMKilled

Or check container logs:

Terminal window
docker compose logs worker | grep -i killed

Check the work task failure history:

SELECT id, task_type, status, created_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
AND status = 'failed'
ORDER BY created_at DESC
LIMIT 10;

Increase the worker memory limit in your config:

repos:
- owner: my-org
name: my-repo
workers:
memory: '6g'

For very large repos (thousands of dependencies), use '8g' or higher. After updating the config, rebuild and restart the worker containers.

  • Set repos[].workers.memory based on the target repo’s dependency tree size
  • Monitor container memory usage during initial workspace setup to establish a baseline
  • If using a custom workspace.setup_command (e.g., bundle install for Ruby), the same memory considerations apply

Relevant code: docs/user-guide/configuration.md (repos[].workers.memory field)


  • Issue stuck in in-development with a high turn count
  • Multiple failed develop tasks in work_tasks — sprint-master keeps re-enqueueing
  • Developer logs show max turns being hit (look for maxTurns in structured log output)

The issue is too complex or ambiguous for the configured turn limit. The developer exhausts developer_max_turns without completing the task, gets blocked, and the sprint-master retries — creating a loop.

Check for repeated develop task failures:

SELECT id, task_type, status, created_at, updated_at
FROM work_tasks
WHERE issue_number = <N>
AND repo_id = (SELECT id FROM repos WHERE owner = '<owner>' AND name = '<repo>')
ORDER BY created_at DESC
LIMIT 10;

Check developer logs for max turns indicators:

  • The developer executor logs maxTurns in its structured output when starting development
  • When the turn limit is hit, the result includes isMaxTurns: true
  • The failure tracker (packages/core/src/failure-tracker.ts) counts consecutive failures per issue

Stop the retry loop:

/colony:cancel

Comment this on the GitHub issue to close it and stop re-enqueue.

Alternatively, decompose the issue into smaller sub-issues:

/colony:decompose

This sends the issue to the planner, which breaks it into smaller, more tractable sub-issues.

If the issue is close to completion and just needs more turns, bump the limits:

claude:
scaling:
large:
developer_max_turns: 50

Complexity tiers (small, medium, large) each have their own developer_max_turns setting.

  • Use the planner for large issues — comment /colony:decompose before the issue enters development
  • Configure progress detection windows to catch stalls early:
    claude:
    scaling:
    large:
    no_progress_window: 75
  • Write well-scoped issues with clear acceptance criteria — ambiguous issues are the primary driver of timeout loops
  • The failure tracker counts consecutive failures per issue key; after the threshold is exceeded, the issue is blocked to prevent unbounded retries

Relevant code: packages/developer/src/executor.ts (turn limit logic), packages/core/src/state-transition.ts (slash commands), packages/core/src/failure-tracker.ts (failure counting), docs/user-guide/configuration.md (claude.scaling)