Skip to content

AgentOps: It's Not Agent Orchestration. It's Context Orchestration.

February 14, 2026·14 min read
#ai-agents#agentops#context-orchestration#open-source#vibe-coding#coding-agents#knowledge-flywheel

Everyone, including me, has been calling it "agent orchestration."

Anthropic released agent teams in Opus 4.6. Gas Town. Ralph. Multi-agent this, orchestration that. I've written thousands of words about it. I have a whole article called Multi-Agent Orchestration. I was part of the problem.

But after months of shipping code with these systems, I've realized we've been naming the wrong thing. I'm actively building this stuff, running multi-agent teams on real production work every day, and the thing that actually matters isn't how agents talk to each other. It's what's in their context window when they start working.

The high-leverage problem is context. Loading the right knowledge into the right window at just the right moment. That's where agents succeed or fail. Not in the routing, not in the message passing, not in the communication protocol between them. In what they know when they sit down to work. That's not agent orchestration. That's context orchestration. The industry is pouring effort into the wrong one.

The agent orchestration trap


The Context Window IS the Intelligence

Nobody frames this correctly.

Your agent has a context window. That's it. Every time it starts, it's a blank slate with a text box. It doesn't know what you worked on yesterday. It doesn't know what broke last time. It doesn't know which of the three middleware stacks in your codebase is the right one. It has whatever you put into the window, and nothing else.

The context window is the agent's working memory, its reasoning space, its entire world. Every tool call, every API integration? Scaffolding. The intelligence lives in the window.

So when your agents fail, look at what was actually in the window:

  • Bad answers? Wrong context loaded.
  • Contradictions? Context wasn't shared between agents.
  • Hallucinations? Context was too sparse.
  • Drifting off-task? Context signal-to-noise ratio collapsed.

Every single one of those is a context engineering problem, not a routing problem. You can have the most elegant message bus in the world and your agents will still fail if the window is wrong.

Stop asking "how do I orchestrate my agents?" Start asking "what does this agent need to know right now?"

The context window is the intelligence


Why This Matters at Scale

People figure this out in every other domain but haven't for agents: you can't individually brief every worker.

If you have a hundred agents, or even five running in parallel, you cannot hand-craft the context for each one. You need a system that ensures each agent gets exactly what it needs, when it needs it, without you being in the loop. That's not optional. That's the only way any of this works.

Think about how any real organization operates. You don't walk up to every person and explain the mission, the constraints, the history, the current state. You build systems: onboarding, documentation, institutional knowledge, review processes. When someone shows up to work, they already have what they need to make good decisions. The organization's intelligence isn't in any single person. It's in the system that loads the right context into the right person at the right moment.

That's what context orchestration is. When you have it, something changes: you stop hoping your agents will do the right thing and start knowing they will. Not because they're smarter. Because you engineered the system that fills their context window. You give your intent, and you can trust the output, because you've built the pipeline that turns intent into the right context for every agent at every step.

That's what I've been working toward. Most of the ecosystem is still building the other thing.

You can't brief every worker


The Three Gaps

The tooling right now is focused on two things: what goes into the context window before the agent starts (specs, plans, context engineering), and how to route work between agents (message buses, state machines). Both important.

But the rest? Validation, learning, the refinement loop that makes each successive context window smarter? Almost nobody's building it. You write the spec. You run the plan. The agent pushes code. Then what? You eyeball it. You run it again when something breaks. You fix the same bug from last week because the agent doesn't remember solving it. The refinement loop, the actual work, you're still doing entirely by hand.

It's like molding clay on a wheel. You shape it, check it, reshape it, check it again. That's not a failure of the process. That's the process. Right now, most of the tooling stops after the first shape.

I spent months testing every framework, every approach, every tool I could find. I kept running into the same three gaps, all context problems:

Validation. Not "does it compile" validation. Judgment validation. Your spec said "add middleware." Your codebase has three middleware stacks. Nobody asked "what could go wrong?" before the agent picked one and wrote 400 lines against it. After it built the thing, nobody checked whether the edge cases were handled. The happy path looked beautiful. Everything else? Broken. The validation context was never loaded.

Learning. You solved a tricky auth bug on Monday. Wednesday the agent makes the exact same mistake. Groundhog Day, except instead of learning piano and ice sculpting, it's rediscovering the same dead-end solutions you already tried last week. Most memory plugins give you recall. Almost none give you learning. The agent doesn't retain judgment across sessions. It just carries notes. The right knowledge never made it into the window.

Closing the loop. What did this session teach us? What should we work on next? That extraction step, pulling learnings out of the work and feeding them back so the next context window starts smarter, is where compounding happens. It's the part everyone skips.

These aren't spec problems. These aren't routing problems. These are context lifecycle problems: everything that happens between "agent wrote code" and "code is actually good and the system learned something."


How AgentOps Was Born

I didn't sit down and design this system. I fell into it.

I was doing the full cycle by hand. Research the codebase, write a plan, run a pre-mortem on the plan, decompose it into waves, crank through the implementation, validate the output, extract learnings. Every time. By hand. I called it "the pottery barn" because I was literally sitting at the wheel shaping clay all day. It was working. The cycle was genuinely producing better code than anything else I tried.

But after a few weeks of the pottery barn, I started getting annoyed. Why am I manually loading context that the system already produced? Why am I running the same validation steps every time? Why am I hand-crafting context scoping that follows the same pattern every time? (This is the part where I spent an entire Saturday doing the exact same six steps I'd done on Friday and thought "I am the bot now.")

The pottery barn

So I started building skills to automate each step. /research to explore the codebase and load prior knowledge. /plan to decompose work. /pre-mortem to validate before building. /crank to orchestrate the workers. /vibe to validate the output. /post-mortem to extract learnings.

Then I started wiring them together. If I always run pre-mortem after plan, why not automate that? If I always extract learnings after implementation, why not make it automatic? That's where the hooks came from. Enforcement that fires whether you remember or not. The system does the right thing by default because I got tired of remembering to do the right thing manually.

Somewhere in the middle of all that, I realized what I'd actually built. Not an agent orchestrator. A context orchestrator. Every skill, every hook, every piece of the flywheel: they all do one thing. Put the right context in the right window at the right time. When I give it my intent, I'm not hoping. I know what it's going to do, because I built the pipeline that determines what every agent sees.


What It Actually Does

Here's what context orchestration looks like when it's actually engineered.

AgentOps is a Claude Code plugin suite: 34 skills, 12 enforcement hooks, and a knowledge flywheel. The whole thing is designed around one principle. Treat the context window like a network perimeter.

Least-privilege loading. Agents receive only the context necessary for the task. Not everything, just what matters right now.

Validation gates. Context is vetted for relevance and currency before entering the window. Stale knowledge decays. Bad patterns get demoted.

Auditability. If an agent fails, you can reconstruct exactly what was in the window at that moment.

One command runs the full cycle:

/rpi "add rate limiting to the API"

Six things happen:

Research. Before touching code, the system checks what it already knows. Previous session learned something relevant? It loads it. Just-in-time, least-privilege context. Not everything the agent has ever seen, just the stuff that matters for this task right now. (I wasted absurd amounts of time re-explaining things the agent had already figured out last Tuesday.)

Plan. Breaks the work into tracked issues with dependency waves. What runs in parallel, what blocks what. Structure, not vibes.

Pre-mortem. My favorite part. Before writing a single line of code, multiple AI judges review the plan and ask "what could go wrong?" This is where you catch the bad assumptions, the missing requirements, the scope creep. A validation gate on the plan's context before it becomes code. Problems found on paper cost nothing. Problems found in code cost your whole evening.

Crank. Spawns workers to build it. Each worker gets fresh context and its own issue, least-privilege by design. The lead coordinates and commits. It's like being a raid leader who just calls out mechanics while the team executes. You're managing work, not writing code.

Vibe. The quality gate. Multiple judges review the code before it enters your repo. An automated code review that can't be guilt-tripped into approving a sketchy PR at 5pm on Friday.

Post-mortem. Extracts what the system learned and proposes what to work on next. This is the part that closes the loop. The learnings from this cycle become the intelligence loaded into the next context window.

Then you do it again. The cycle IS the product. Not any single step. The fact that each cycle makes the next context window smarter than the last.

The context orchestrator


The Knowledge Flywheel

This is the part that makes the cycle compound. It's the thing I'm most proud of.

Every session, the system learns. Not "dumps text into a file" learns. It validates what it discovered, weighs it by confidence, and injects the relevant pieces into future sessions. Monday teaches Tuesday. Tuesday teaches Wednesday.

In RPGs, there's a difference between a character who levels up and one who just gets new gear. Gear is what most memory plugins give you: stuff you carry around. The flywheel is actual XP. Your agent is leveling up. It's not just remembering "don't use the old middleware." It's developing judgment about middleware patterns in general.

The mechanics: it forges learnings from session transcripts, pools them by quality, applies freshness decay so stale knowledge fades, and injects the relevant ones when you start new work. The good learnings get promoted. The bad ones get demoted. Over time, your agent's context window fills with the right knowledge by default. Not because you remembered to paste it in, but because the system learned what matters.

The flywheel spins

I wrote a deep dive on the flywheel if you want the full mechanics. Short version: your agents compound knowledge instead of starting from scratch every time. After a few weeks, the system has seen enough of your codebase that it stops making the same mistakes. That's what an intelligent context window looks like. Not you at the pottery barn every Saturday, but the system getting smarter each pass.


Where It's Going

Right now you give AgentOps a task and it runs the cycle. That's useful. But that's not the end goal.

The end goal is you give it your intent, your goals, your product documents, what you're trying to build, and it runs continuous cycles until the work is done. Not one cycle. As many as it takes. It picks the next task based on what the last cycle taught it. Each pass makes the context window more intelligent.

I have an /evolve skill that's getting close to this. It measures where the gaps are, picks the worst one, runs a full cycle on it, extracts learnings, and loops. Give it a product doc and walk away. Come back to progress.

It's not there yet. But it's getting there. So far it's doing pretty well.

If the context lifecycle is fully automated (loading, validation, learning) then there's no reason the system can't run the next cycle on its own. You define what you want to build, and the system figures out how to get there through successive refinement. You can trust it because you've engineered the context pipeline. Every agent gets exactly what it needs, validation catches what slips, learnings compound so the system gets smarter. You're not hoping. You're not micromanaging. You built the system that makes each context window intelligent, and now you let it work.

Intent to trust


Other Context Tools

Beyond the core cycle, these handle specific context problems:

  • /council: Multi-perspective context validation. Spawn parallel AI judges to review anything: code, plans, articles. Debate mode where judges argue with each other.

  • /research: Context loading before work begins. Deep codebase exploration, knowledge base search, structured report.

  • /implement: Single-issue execution with full context lifecycle. Quality gates without spinning up a whole epic.

  • /swarm: Least-privilege parallel execution. The engine underneath /crank.

  • /quickstart: Walks you through your own codebase in under ten minutes.

34 skills total. Browse them after install.


What's Still Broken

I'm not going to pretend this is perfect.

I'm the only user so far. The individual patterns are proven: autonomous loops, multi-model review, knowledge persistence. But wiring them together into one context orchestration system is my hypothesis. I think it's better than the sum of the parts. I've also thought that about three other systems I built that turned out to be Rube Goldberg machines. Fair warning.

The flywheel can inject stale learnings. Last Thursday it confidently applied a middleware pattern from three weeks ago that had been deprecated in a dependency upgrade I'd done on Monday. Broke the build. Spent 90 minutes rolling it back instead of watching the new season of Severance. Decay modeling helps, but it's not fully solved.

Even with all the validation gates, stuff slips through. You still need good monitoring, good rollback, good hygiene. The cycle makes it better, not perfect.

I'm sharing it because I want other people stress-testing it. The more people running this on real work, the faster we find the gaps.


Try It

npx skills@latest add boshu2/agentops --all -g

Run /quickstart. It walks you through your own codebase in under ten minutes. Fork it, break it, make it yours.

Don't want the full system? The minimum viable context loop is three things: persist learnings so they load automatically next session, gate plans through a "what could go wrong?" check before code gets written, and track work in something that outlives the session. That's the smallest engineered cycle that compounds.

If you're running AI agents on real work, I want to hear what breaks.

Your turn


Where This Came From

I spent months testing every approach I could find and writing about what works when you ship real code with AI agents.