Devlog #5: When the Platform Catches Up

In devlog 1, the multi-agent setup collapsed under its own weight. In devlog 2, I consolidated from 60+ agents down to 4. In devlog 3, the tools changed under my feet. In devlog 4, I learned the spec is the leverage. This is what happened when I tried to replace everything I built.

Claude 4.6 shipped native Agent Teams. Spawn teammates, send messages, track tasks, shut them down. All built into Claude Code. No MCP servers. No tmux. No external plumbing.

I've been building this exact thing for months.

So I did what anyone would do. I turned off our stack and ran native teams on a real task. Three agents, a feature branch, a Friday afternoon. The messaging was instant. Zero setup. I felt the kind of stupid you feel when you realize you've been hand-rolling something the framework just... ships.

That lasted about two hours.

I closed the terminal to grab coffee. Came back, reopened Claude Code. Everything was gone. The task list, the agent state, the coordination context. Evaporated. I ran bd ready out of habit. My beads were still there. The native work wasn't.

That's when the thesis crystallized.

The plumbing was never the hard part.

What Worked (And Where It Broke)

Credit where it's due. The messaging is genuinely better than what we had. Real peer-to-peer DMs instead of polling a mailbox file. Broadcast with automatic delivery instead of write-and-hope. The shutdown protocol is cleaner than killing tmux panes and hoping nothing was mid-write.

Setup is zero. One environment variable. Our stack requires tmux, an MCP server for Agent Mail, named session management, and a beads directory. Native teams require... nothing.

I spawned three agents on a refactoring task. They picked up work, messaged each other, tracked dependencies. It felt like watching someone rebuild your garage from nicer materials.

Then I needed to validate what they'd done.

There's no validation layer. No /vibe check. No pre-mortem. No "prove this actually works before you mark it done." The agents said "complete" and I was supposed to believe them. If you've followed this series since devlog 1, you know how that ends. It's the "tests passing" lie, except now it's three agents lying in parallel.

I let one agent run overnight. In the morning, it had closed 4 tasks. Two were actually done. One was half-done with a passing test that didn't test what it claimed. One was just... wrong. Confidently, cleanly wrong.

That's 50% accuracy with no audit trail. Our stack catches this. Native teams don't even try.

Plumbing vs. Orchestration

Here's the distinction that keeps surviving every rebuild.

Plumbing is how agents talk to each other. Spawn, message, shutdown. Pipes and sockets. Native teams handle this well.

Orchestration is why they talk. What work to assign. When to validate. How to learn from what just happened. When to stop. What to do when an agent gets stuck. How to recover when you close the terminal (or it closes itself).

Native Teams:
  Spawn → Message → Hope → Shutdown

Our Stack:
  Research → Plan → Pre-mortem → Spawn → Wave orchestration
  → Validation contracts → Knowledge extraction → Compound

One is plumbing. The other is a system.

Claude just shipped Docker. The Kubernetes layer is still ours.

Knowledge Compounding: The Gap That Matters

Native teams are a great tutorial zone. Zero setup, instant feedback, you learn the patterns fast. But there's no endgame. No progression system. No XP that persists between runs.

Every native team starts from zero. Spawn agents, do work, session ends, knowledge evaporates. It's permadeath. Every run is a fresh character with no save file, no inherited gear, no memory of the last dungeon.

Our stack compounds.

In The REPL Is Dead, I reported the numbers: 1,005 closed issues, 47 reusable formulas, searchable history across every session. When I start a new task, I search past work first. The system doesn't just execute. It accumulates.

Devlog 4 made the case for spec versioning: my v1 specs now start at the sophistication level of my old v3s. Not because I got smarter. Because I'm querying prior failures. That's compounding.

With native teams, I'd lose that. Every Monday morning would be floor 1 of the roguelike again.

Here's what the flywheel looks like in practice. An agent closes an issue. /forge extracts what it learned: what assumptions were wrong, what context was missing, what pattern emerged. That learning goes into the knowledge pool. Next session, /inject preloads it.

The agent that picks up related work starts with context its predecessor earned.

Native teams have none of this. Not the extraction. Not the pooling. Not the injection. Not even the persistence to make extraction possible.

I think this is the gap that matters most. n=1 across personal projects, ask me again in a month. But sessions that start with injected context produce fewer rework cycles. The pattern holds.

I've rebuilt the transport layer three times. The accumulated learnings from 1,005 issues didn't care which messaging system delivered them.

The Trust Problem Scales

I expected this to matter less than it did.

Our stack has a principle baked into the foundation: never trust agent claims. Mandatory verification before marking anything complete. The /vibe check runs automated validation gates. The /pre-mortem simulates failures before you burn implementation cycles.

I thought native teams might not need this, that maybe Claude 4.6 was reliable enough.

It isn't. Not at scale. Not overnight. Not when you're running three agents and can't babysit all of them.

The wave orchestration (the /crank FIRE loop that finds ready work, implements, reports, evaluates) biases toward forward progress because it has gates. Native teams are ad-hoc. You spawn agents and they coordinate through good intentions.

Sometimes that works. Usually it doesn't.

I should've known better. Devlog 1 documented this exact failure mode: "18% of tasks needed a second pass. If you're not validating AI output, you're not shipping code. You're shipping hope."

That was true with one agent. It's more true with three.

Try It

Don't take my word for it. Feel the gap yourself.

Enable native teams: set CLAUDE_AGENT_TEAMS=1 in your environment
Spawn 3 agents on a real task, not a demo, something you care about
Let them work for an hour. Watch the messaging. Appreciate the zero-setup.
Close the terminal.
Reopen it.

Everything's gone. The task list. The agent context. The work assignments.

Now build the simplest persistence layer you can. A JSON file that survives the session. A git-backed issue that remembers what was assigned. Anything that outlasts the terminal.

That's the gap. Once you feel it, you'll understand why the orchestration layer exists.

What's Next

Native teams are real, and they'll get better. Persistence will come. Maybe knowledge management after that. The platform is climbing the stack: single-agent chat, then tool use, then subagents, now teams.

The smart move is probably hybrid: take the better messaging and zero-setup, layer our orchestration on top. That's the next experiment.

Last devlog I said the spec is the leverage. That's still true. But the spec only matters if the system remembers what it learned from the last one. Native teams don't. Not yet.

Devlog 6 will be about the knowledge flywheel with actual metrics: what compounds, what doesn't, and whether the whole thesis holds up when I measure it honestly instead of just believing it.

The plumbing got absorbed. The orchestration didn't. The knowledge won't.

Previous: Devlog #4: Why Your AI Coding Specs Keep Failing. Next: the knowledge flywheel, measured.