Skip to content

Operator

You tend the colony. You don’t write the code Colony writes, you don’t review every PR, you don’t define the work. You make sure the colony has the conventions, capacity, and tools it needs to do its job — and you step in when something goes wrong.

If this is you: platform engineers, tech leads, the person who owns the deployment, the person on-call when “Colony is broken” surfaces.

  1. Configure — set the colony’s conventions, repo policies, and worker behavior
  2. Allocate capacity — size the worker pool for current and projected workload
  3. Intervene — fix work that’s stuck, stalled, or oscillating
  4. Recover — handle the failure cases that escape the colony’s automatic recovery

Per-repo configuration lives in .colony/conventions.md (and optional adjacent files). This is the most important thing you’ll write — it shapes every analyzer, developer, and reviewer interaction.

Key sections to fill in:

  • Tech stack — what languages, frameworks, build tools the repo uses
  • Coding conventions — file layout, naming, import styles, testing approach
  • Forbidden patterns — what not to do (often more useful than what to do)
  • Review priorities — what the human Reviewer cares about most

Tenant-level configuration (worker pool sizing, label policy, automerge behavior) lives in the dashboard.

Worker pool size controls how many issues can run in parallel. Sizing tradeoffs:

  • Too small: dispatch backs up; issues wait at ready-for-dev even when ready
  • Too large: worker churn, wasted compute, harder to read the dashboard
  • Just right: typical queue depth ≤ pool size; transient bursts drain in minutes, not hours

Watch the dashboard’s queue-depth and worker-utilization graphs for a week before adjusting. Resizing in response to a single bad day usually overcorrects.

When work stalls, the dashboard’s “needs operator” view shows what’s stuck and why. Common interventions:

  • Stalled worker — the mayor’s recovery protocol exhausted retries. Click “retry” or run /colony:retry on the issue. If it stalls again, the underlying problem is in .colony/conventions.md or the issue itself.
  • Cycling reviewer — the same PR has been rejected 3+ times. Read the rejections; usually the convention or the issue is ambiguous. Fix the source, not the symptom.
  • Dependency lock — Colony paused an issue because another PR is in flight. If the lock is spurious, override it on the dashboard. If the lock is real, wait or reorder.

Some failures escape automatic recovery. The handful you’ll see:

  • Orphan PR — PR exists, issue doesn’t (or vice versa). Run /colony:cancel on the issue, or close the PR; Colony reconciles on its next polling cycle.
  • Runaway cost — an issue burns more budget than expected. Run /colony:pause to stop dispatch; investigate the issue and the worker logs; resume with /colony:resume or cancel.
  • Conflict-resolution loop — the merger keeps producing conflicts. Manually rebase the PR or close it and let the developer worker re-attempt against the current main.
  • Outage — workers can’t reach the executor (Claude API, etc.). Pause dispatch; investigate the executor; resume.

Slash commands available on any issue or PR:

CommandWhat it does
/colony:retryRe-run the worker for the current state
/colony:cancelAbandon the issue; close any open PR
/colony:pauseStop dispatch for this issue (resumable)
/colony:resumeResume after pause
/colony:decomposeForce epic decomposition on the issue
/colony:reimplementDiscard the current PR and re-develop
/colony:reviewRe-run the reviewer (also works on external PRs)
  • .colony/conventions.md — per-repo configuration; lives in the repo
  • Slash commands — see table above
  • Dashboard worker view — Cloud-only; live worker state, queue depth, utilization
  • Recovery runbooks — see Reference when published
  • Resizing the worker pool to “fix” a stuck issue. Capacity isn’t the problem; that issue is. Capacity changes mask the real failure. Run /colony:retry and read the worker logs first.
  • Editing .colony/conventions.md for every one-off rejection. Conventions are for repeating problems. A single rejection might just be a bad issue. Wait for the second occurrence.
  • Running /colony:cancel to “clean up” a backlog. Cancellation breaks the audit trail and may cancel work the Author still needs. Triage first.
  • Pausing dispatch globally because one repo is misbehaving. Pause the issue or the repo, not the colony. Other tenants and repos shouldn’t suffer.