Operator
You tend the colony. You don’t write the code Colony writes, you don’t review every PR, you don’t define the work. You make sure the colony has the conventions, capacity, and tools it needs to do its job — and you step in when something goes wrong.
If this is you: platform engineers, tech leads, the person who owns the deployment, the person on-call when “Colony is broken” surfaces.
What this role does
Section titled “What this role does”- Configure — set the colony’s conventions, repo policies, and worker behavior
- Allocate capacity — size the worker pool for current and projected workload
- Intervene — fix work that’s stuck, stalled, or oscillating
- Recover — handle the failure cases that escape the colony’s automatic recovery
Configure
Section titled “Configure”Per-repo configuration lives in .colony/conventions.md (and optional adjacent files). This is the most important thing you’ll write — it shapes every analyzer, developer, and reviewer interaction.
Key sections to fill in:
- Tech stack — what languages, frameworks, build tools the repo uses
- Coding conventions — file layout, naming, import styles, testing approach
- Forbidden patterns — what not to do (often more useful than what to do)
- Review priorities — what the human Reviewer cares about most
Tenant-level configuration (worker pool sizing, label policy, automerge behavior) lives in the dashboard.
Allocate capacity
Section titled “Allocate capacity”Worker pool size controls how many issues can run in parallel. Sizing tradeoffs:
- Too small: dispatch backs up; issues wait at
ready-for-deveven when ready - Too large: worker churn, wasted compute, harder to read the dashboard
- Just right: typical queue depth ≤ pool size; transient bursts drain in minutes, not hours
Watch the dashboard’s queue-depth and worker-utilization graphs for a week before adjusting. Resizing in response to a single bad day usually overcorrects.
Intervene
Section titled “Intervene”When work stalls, the dashboard’s “needs operator” view shows what’s stuck and why. Common interventions:
- Stalled worker — the mayor’s recovery protocol exhausted retries. Click “retry” or run
/colony:retryon the issue. If it stalls again, the underlying problem is in.colony/conventions.mdor the issue itself. - Cycling reviewer — the same PR has been rejected 3+ times. Read the rejections; usually the convention or the issue is ambiguous. Fix the source, not the symptom.
- Dependency lock — Colony paused an issue because another PR is in flight. If the lock is spurious, override it on the dashboard. If the lock is real, wait or reorder.
Recover
Section titled “Recover”Some failures escape automatic recovery. The handful you’ll see:
- Orphan PR — PR exists, issue doesn’t (or vice versa). Run
/colony:cancelon the issue, or close the PR; Colony reconciles on its next polling cycle. - Runaway cost — an issue burns more budget than expected. Run
/colony:pauseto stop dispatch; investigate the issue and the worker logs; resume with/colony:resumeor cancel. - Conflict-resolution loop — the merger keeps producing conflicts. Manually rebase the PR or close it and let the developer worker re-attempt against the current
main. - Outage — workers can’t reach the executor (Claude API, etc.). Pause dispatch; investigate the executor; resume.
Slash commands available on any issue or PR:
| Command | What it does |
|---|---|
/colony:retry | Re-run the worker for the current state |
/colony:cancel | Abandon the issue; close any open PR |
/colony:pause | Stop dispatch for this issue (resumable) |
/colony:resume | Resume after pause |
/colony:decompose | Force epic decomposition on the issue |
/colony:reimplement | Discard the current PR and re-develop |
/colony:review | Re-run the reviewer (also works on external PRs) |
Where you engage in the Workflow
Section titled “Where you engage in the Workflow”- Phase 1: Intake — monitor the analyzer queue
- Phase 2: Planning & Dispatch — capacity allocation
- Phase 3: Development — intervene on stalls
- Phase 4: Review — fix repeating-review-loop config issues
- Phase 5: Merge & Close — resolve non-trivial merge conflicts
Your tools
Section titled “Your tools”.colony/conventions.md— per-repo configuration; lives in the repo- Slash commands — see table above
- Dashboard worker view — Cloud-only; live worker state, queue depth, utilization
- Recovery runbooks — see Reference when published
Anti-patterns
Section titled “Anti-patterns”- Resizing the worker pool to “fix” a stuck issue. Capacity isn’t the problem; that issue is. Capacity changes mask the real failure. Run
/colony:retryand read the worker logs first. - Editing
.colony/conventions.mdfor every one-off rejection. Conventions are for repeating problems. A single rejection might just be a bad issue. Wait for the second occurrence. - Running
/colony:cancelto “clean up” a backlog. Cancellation breaks the audit trail and may cancel work the Author still needs. Triage first. - Pausing dispatch globally because one repo is misbehaving. Pause the issue or the repo, not the colony. Other tenants and repos shouldn’t suffer.
Going deeper
Section titled “Going deeper”- Reviewer role — when reviews loop, this is the role that surfaces the symptom you’ll fix
- Team Patterns — multi-repo conventions and freeze windows
- Reference: configuration schema (when published) — every field in
.colony/conventions.md