Skip to content

AgentOps Is Context Orchestration

The reliable-agent problem is what context survives, loads, and guides the next execution.

February 14, 2026·13 min read
#ai-agents#agentops#context-orchestration#open-source#vibe-coding#coding-agents#knowledge-flywheel

This essay is part of the reliable AI-assisted delivery trail: proof, method, and judgment for making fast AI work reviewable and safe to ship. Start with the curated writing paths or inspect the proof.

Everyone, including me, has been calling it "agent orchestration."

Anthropic shipped agent teams. My workspace. Ralph. Multi-agent this, orchestration that. I've written thousands of words about it. I have a whole article called Multi-Agent Orchestration. I was part of the problem.

After months of shipping code with these systems, I've realized we've been naming the wrong thing. I'm actively building this stuff, running multi-agent teams on real production work every day, and the thing that actually matters is what's in the context window when an agent starts working.

The high-leverage problem is context: loading the right knowledge into the right window at the right moment. That's where agents succeed or fail. The routing, the message passing, the communication protocol don't move the needle. What they know when they sit down to work does. Call it context orchestration.

The agent orchestration trap


The Context Window IS the Intelligence

Your agent has a context window. That's it. Every time it starts, it's a blank slate with a text box. It doesn't know what you worked on yesterday. It doesn't know what broke last time. It doesn't know which of the three middleware stacks in your codebase is the right one. It has whatever you put into the window, and nothing else.

The context window is the agent's working memory, its reasoning space, its entire world. Tool calls and API integrations are scaffolding. The intelligence lives in the context.

So when your agents fail, look at what was actually in the window:

  • Bad answers? Wrong context loaded.
  • Contradictions? Context wasn't shared between agents.
  • Hallucinations? Context was too sparse.
  • Drifting off-task? Context signal-to-noise ratio collapsed.

Every one of those is a context engineering problem. The most elegant message bus in the world still ships agents that fail if the window is wrong.

The right question is "what does this agent need to know right now?"

The context window is the intelligence


Why This Matters at Scale

People figure this out in every other domain but haven't for agents: you can't individually brief every worker.

If you have a hundred agents, or even five running in parallel, you cannot hand-craft the context for each one. You need a system that ensures each agent gets exactly what it needs, when it needs it, without you being in the loop. That's the only way any of this works.

Real organizations operate this way. You don't walk up to every person and explain the mission, the constraints, the history, the current state. You build systems: onboarding, documentation, institutional knowledge, review processes. When someone shows up to work, they already have what they need to make good decisions. The organization's intelligence isn't in any single person; it lives in the system that loads the right context into the right person at the right moment.

That's what context orchestration is. When you have it, something changes: you stop hoping your agents will do the right thing and start knowing they will, because you engineered the system that fills their context window. Give it your intent, trust the output, because you built the pipeline that turns intent into the right context for every agent at every step.

That's what I've been working toward. Most of the ecosystem is still building the other thing.

You can't brief every worker


The Three Gaps

The tooling right now is focused on two things: what goes into the context window before the agent starts (specs, plans, context engineering), and how to route work between agents (message buses, state machines). Both important.

Validation, learning, the refinement loop that makes each successive context window smarter: almost nobody's building it. You write the spec. You run the plan. The agent pushes code. Then what? You eyeball it. You run it again when something breaks. You fix the same bug from last week because the agent doesn't remember solving it. The refinement loop, the actual work, you're still doing entirely by hand.

It's like molding clay on a wheel. Shape it, check it, reshape it, check it again. That is the process. Right now, most of the tooling stops after the first shape.

I spent months testing every framework, every approach, every tool I could find. I kept running into the same three gaps, all context problems:

Validation. Judgment validation, not "does it compile" validation. Your spec said "add middleware." Your codebase has three middleware stacks. Nobody asked "what could go wrong?" before the agent picked one and wrote 400 lines against it. After it built the thing, nobody checked whether the edge cases were handled. The happy path looked beautiful. Everything else was broken. The validation context was never loaded.

Learning. You solved a tricky auth bug on Monday. Wednesday the agent makes the exact same mistake. Groundhog Day, except instead of learning piano and ice sculpting, it's rediscovering the same dead-end solutions you already tried last week. Most memory plugins give you recall. Almost none give you learning. The agent doesn't retain judgment across sessions; it just carries notes. The right knowledge never made it into the window.

Closing the loop. What did this session teach us? What should we work on next? That extraction step is where compounding happens. Everyone skips it.

These are context lifecycle problems: everything that happens between "agent wrote code" and "code is actually good and the system learned something."


How AgentOps Was Born

I didn't sit down and design this system. I fell into it.

I was doing the full cycle by hand. Research the codebase, write a plan, run a pre-mortem on the plan, decompose it into waves, crank through the implementation, validate the output, extract learnings. Every time. By hand. I called it "the pottery barn" because I was literally sitting at the wheel shaping clay all day. It was working. The cycle produced better code than anything else I tried.

After a few weeks of the pottery barn, I started getting annoyed. Why am I manually loading context that the system already produced? Why am I running the same validation steps every time? Why am I hand-crafting context scoping that follows the same pattern every time? (This is the part where I spent an entire Saturday doing the exact same six steps I'd done on Friday and thought "I am the bot now.")

The pottery barn

So I started building skills to automate each step. /research to explore the codebase and load prior knowledge. /plan to decompose work. /pre-mortem to validate before building. /crank to orchestrate the workers. /vibe to validate the output. /post-mortem to extract learnings.

Then I started wiring them together. If I always run pre-mortem after plan, why not automate that? If I always extract learnings after implementation, why not make it automatic? That's where the hooks came from: enforcement that fires whether you remember or not. The system does the right thing by default because I got tired of remembering to do the right thing manually.

Somewhere in the middle of all that, I realized what I'd actually built was a context orchestrator. Every skill, every hook, every piece of the flywheel does one thing: put the right context in the right window at the right time. When I give it my intent, I know what it's going to do, because I built the pipeline that determines what every agent sees.


What It Actually Does

This is what context orchestration looks like when it's actually engineered.

AgentOps is a Claude Code plugin suite: a suite of skills, enforcement hooks, and a knowledge flywheel. The whole thing is designed around one principle: treat the context window like a network perimeter.

Least-privilege loading. Agents receive only the context necessary for the task. Just what matters right now.

Validation gates. Context is vetted for relevance and currency before entering the window. Stale knowledge decays. Bad patterns get demoted.

Auditability. If an agent fails, you can reconstruct exactly what was in the window at that moment.

One command runs the full cycle:

/rpi "add rate limiting to the API"

Six things happen:

Research. Before touching code, the system checks what it already knows. Previous session learned something relevant? It loads it. Just-in-time, least-privilege context. Just the stuff that matters for this task right now. (I wasted absurd amounts of time re-explaining things the agent had already figured out last Tuesday.)

Plan. Breaks the work into tracked issues with dependency waves. What runs in parallel, what blocks what. Structure, not vibes.

Pre-mortem. My favorite part. Before writing a single line of code, multiple AI judges review the plan and ask "what could go wrong?" This is where you catch the bad assumptions, the missing requirements, the scope creep. A validation gate on the plan's context before it becomes code. Problems found on paper cost nothing. Problems found in code cost your whole evening.

Crank. Spawns workers to build it. Each worker gets fresh context and its own issue, least-privilege by design. The lead coordinates and commits. It's like being a raid leader who just calls out mechanics while the team executes. You manage the work and let the workers write the code.

Vibe. The quality gate. Multiple judges review the code before it enters your repo. An automated code review that can't be guilt-tripped into approving a sketchy PR at 5pm on Friday.

Post-mortem. Extracts what the system learned and proposes what to work on next. The learnings from this cycle become the intelligence loaded into the next context window.

Then you do it again. The cycle IS the product. Each pass makes the next context window smarter than the last.

The context orchestrator


The Knowledge Flywheel

Every session, the system extracts learnings, weighs them by confidence, applies freshness decay, and injects the relevant pieces into future sessions. Monday teaches Tuesday. Tuesday teaches Wednesday. Over time, your agent's context window fills with the right knowledge by default because the system learned what matters.

The flywheel spins

The full mechanics are in the deep dive. Short version: after a few weeks, the system has seen enough of your codebase that it stops making the same mistakes.


Where It's Going

Right now you give AgentOps a task and it runs the cycle. That's useful. But that's not the end goal.

The end goal: you give it your intent, your goals and product documents, and it runs continuous cycles until the work is done. As many cycles as it takes. It picks the next task based on what the last cycle taught it. Each pass makes the context window more intelligent.

I have an /evolve skill that's getting close to this. It measures where the gaps are, picks the worst one, runs a full cycle on it, extracts learnings, and loops. Give it a product doc and walk away. Come back to progress.

It's not there yet, but each pass closes the gap.

If the context lifecycle is fully automated (loading, validation, learning) then the system can run the next cycle on its own. You define what you want to build, and the system figures out how to get there through successive refinement. You can trust it because you've engineered the context pipeline. Every agent gets exactly what it needs, validation catches what slips, learnings compound so the system gets smarter. You built the system that makes each context window intelligent, and now you let it work.

Intent to trust


Other Context Tools

Beyond the core cycle, these handle specific context problems:

  • /council: Multi-perspective context validation. Spawn parallel AI judges to review anything: code, plans, articles. Debate mode where judges argue with each other.

  • /research: Context loading before work begins. Deep codebase exploration, knowledge base search, structured report.

  • /implement: Single-issue execution with full context lifecycle. Quality gates without spinning up a whole epic.

  • /swarm: Least-privilege parallel execution. The engine underneath /crank.

  • /quickstart: Walks you through your own codebase in under ten minutes.

There are more where these came from. Browse them after install.


What's Still Broken

This has real gaps.

I now run this on real production work at the highest reliability and security bar I've worked under. The individual patterns are proven: autonomous loops, multi-model review, knowledge persistence. Wiring them together into one context orchestration system is the bet I'm still pressure-testing. I think it's better than the sum of the parts. I've also thought that about three other systems I built that turned out to be Rube Goldberg machines. Fair warning.

The flywheel can inject stale learnings. Last Thursday it cheerfully applied a middleware pattern from three weeks ago that had been deprecated in a dependency upgrade I'd done on Monday. Broke the build. Spent 90 minutes rolling it back instead of shipping anything else that evening. Decay modeling helps, but it's not fully solved.

Even with all the validation gates, stuff slips through. You still need good monitoring, good rollback, good hygiene. The cycle improves quality without guaranteeing it.

I'm sharing it because I want other people stress-testing it. The more people running this on real work, the faster we find the gaps.


Try It

npx skills@latest add boshu2/agentops --all -g

Run /quickstart. It walks you through your own codebase in under ten minutes. Fork it, break it, make it yours.

Don't want the full system? The minimum viable context loop is three things: persist learnings so they load automatically next session, gate plans through a "what could go wrong?" check before code gets written, and track work in something that outlives the session. That's the smallest engineered cycle that compounds.

This is the discipline I built for engineers. The same loop, made safe and plain, is what I'm translating for people who aren't.

If you're running AI agents on real work, I want to hear what breaks.

Your turn


Where This Came From

I spent months testing every approach I could find and writing about what works when you ship real code with AI agents.