Multi-Agent Orchestration: Lessons from Running 31 Repos with AI

7,413 commits. 31 repos. Multiple AI agents running in parallel across infrastructure, applications, and tooling.

This isn't a demo. It's what happens when you try to scale AI-assisted development to an actual workload: DoD and Intel environments, GPU clusters, Kubernetes platforms, and the tooling that holds it all together.

Here's what I learned.

The Architecture

The workspace is called Gas Town. Each repository is a rig. Each rig has its own issue tracker (beads), its own context, its own agents. A central registry tracks which rigs exist, which agents are assigned, and what work is available.

Gas Town (Mayor)
  ├── Rigs (31 repositories)
  │   ├── agentops
  │   ├── personal_site
  │   ├── gastown_operator
  │   ├── ocpeast, ocphpc, ocppoc...
  │   └── ...27 more
  ├── Beads (per-rig issue tracking)
  ├── Agents (polecat workers)
  └── Knowledge (flywheel across all rigs)

The key design decision: agents are stateless workers. They pick up work from a hook, execute it, and put it back. No agent owns a rig permanently. No agent carries state between sessions. All state lives in the rig: in git, in beads, in .agents/ directories.

This is the cattle-not-pets pattern from infrastructure. Agents are cattle. Rigs are the infrastructure.

Lesson 1: Workers Write, Leads Commit

The first time I ran multiple agents against the same repo, git corrupted within 20 minutes.

Two agents writing to the same git index simultaneously is a race condition. Git's index file isn't designed for concurrent writers. The result: corrupted index, lost work, manual recovery.

The fix: workers write files. A team lead commits. One agent owns the git index at any time. Workers produce artifacts. The lead stages, reviews, and commits them.

Worker A → writes files → ready for review
Worker B → writes files → ready for review
Worker C → writes files → ready for review
          ↓
Team Lead → reviews → stages → commits (one at a time)

This is the same pattern as a merge queue in CI. Serialize the commits. Parallelize the work.

Lesson 2: Pre-Identify Shared Files

Parallel agents will collide on shared files. Config files, type definitions, index modules, route registrations. Anything that multiple features touch.

Discovering this during implementation means merge conflicts, duplicated work, and agents overwriting each other's changes. Discovering it during planning means you can either:

Assign shared files to one agent who handles all changes
Sequence the work so shared-file changes happen in order
Define interfaces so agents work against contracts, not shared state

Option 1 is simplest. Option 3 is best. In practice, I use a mix depending on how intertwined the shared files are.

The pattern: during planning, grep for files that appear in multiple beads. Flag them. Decide the coordination strategy before anyone writes code.

Lesson 3: One Commit Per Issue

Every bead gets its own commit. No batching. No "fixed a few things" commits.

This isn't just cleanliness. It's operational infrastructure.

When something breaks in production, git bisect needs atomic commits to find the cause. When a feature needs to be reverted, a clean single commit reverts cleanly. When you're reviewing agent output, one commit per issue lets you evaluate each unit of work independently.

# Clean revert
git revert abc123   # Reverts exactly one feature

# vs. batched commits
git revert def456   # Reverts three features, two of which were fine

The agents resist this, by the way. They want to batch. They want to "also fix this other thing I noticed." Fight it. One bead, one commit.

Lesson 4: Fewer Workers, More Waves

My first instinct was maximum parallelism. Five agents, five features, everything at once. Fast.

Reality: foundation work can't be parallelized effectively. When multiple agents need to modify the same architectural layer (adding a new module type, changing a shared interface, restructuring a directory) parallel execution creates cascading merge conflicts.

The fix: waves.

Wave 1 (2 agents): Foundation — types, interfaces, shared config
Wave 2 (3 agents): Features — independent modules built on wave 1
Wave 3 (2 agents): Integration — connect features, update routes, final tests

Each wave completes and commits before the next wave starts. Within a wave, agents work on files that don't overlap. Between waves, the shared state is stable.

More waves, fewer agents per wave, clear handoff points. Slower wall-clock time but dramatically fewer failures. Net throughput is higher because you're not spending half your time resolving conflicts.

Lesson 5: The Hook System

Agents need a way to find work. Not "here's a task," which requires a human in the loop for every assignment. The system should be self-serve.

The hook pattern: available work hangs on a hook. Agents check the hook, grab work, execute, return results.

gt hook              # What's on my hook?
gt sling ps-03 neo   # Assign work to an agent
bd ready             # What work is available?

The hook is a coordination primitive. It answers "what should I work on next?" without requiring a human to answer that question every time. Agents pull work; humans stock the hook.

This scales. Five agents checking the hook in parallel will each grab different work. The hook is the load balancer.

Lesson 6: Knowledge Crosses Rigs

A learning from the personal_site rig applies to the agentops rig. A pattern discovered in ocpeast matters for ocphpc. Knowledge doesn't respect repository boundaries.

The flywheel operates at the workspace level, not the rig level. When an agent discovers that "acceptance checks must be token-specific, not category-level," that learning applies everywhere, not just the repo where it was discovered.

This means the inject system needs to be cross-rig. An agent starting work on the CI container should have access to learnings from the Kubernetes platform rigs, because the failure patterns are similar.

ao inject            # Loads workspace-level knowledge, not just rig-level

The cross-pollination effect is real. Patterns from infrastructure work improved my application development. Patterns from writing tooling improved my infrastructure automation. The knowledge flywheel works best when it's not siloed.

What Breaks

Honesty section. Here's what still doesn't work well:

Context overflow. Large repos with deep dependency trees blow past the 40% context budget. Agents start hallucinating file paths, inventing APIs, and confidently referencing code that doesn't exist. The fix is aggressive scoping: never load the whole repo, only load what's relevant to the current bead.

Cross-repo dependencies. When a change in one rig requires a coordinated change in another, the orchestration gets manual fast. No good automation for "update the API in rig A, then update the client in rig B, then test the integration." This is a gap.

Agent drift. Long sessions where the agent gradually loses the plot. Starts strong, makes good progress, then somewhere around the 40-minute mark begins making changes that don't serve the original goal. The fix is short sessions with hard scope boundaries. But it means more session overhead.

The Scale Test

31 repos isn't a toy example. It includes:

Kubernetes platform management (3 clusters)
GPU infrastructure (100+ GPUs)
Application deployment (50+ AI applications)
Developer tooling (CLI tools, MCP servers, automation)
Knowledge management (the flywheel itself)

The orchestration patterns that work at this scale aren't sophisticated. They're simple, enforced, and relentlessly consistent. Serialize commits. Pre-identify shared files. One commit per issue. Waves over parallelism. Knowledge crosses boundaries.

No magic. Just operational discipline applied to a new domain.

The REPL Is Dead. Long Live the Factory.

The factory model that makes multi-agent orchestration possible

Devlog #5: When the Platform Catches Up

What changed when Claude shipped native multi-agent

12-Factor AgentOps

The operational framework behind the orchestration