Skip to content

12-Factor AgentOps

November 30, 2025·12 min read
#ai-agents#devops#infrastructure#open-source#vibe-coding#shift-left#coding-agents

AI agents hallucinate. They lose context mid-session. They claim success on code that doesn't compile. Debugging their output takes longer than writing it yourself.

These are operational problems. Infrastructure had the same problems fifteen years ago, and we fixed them. Not by making servers smarter, but by building practices around unreliable components.

12-Factor AgentOps is DevOps for vibe-coding. Shift-left validation for coding agents. Catch it before you ship it.

12-Factor AgentOps applies that approach to AI workflows. For why this matters, see The REPL Is Dead, my prediction for where coding agents are headed.


The Core Insight

AI agents fail in the same ways infrastructure used to fail. And we already know how to fix it.

Servers crashed. We didn't make them crash-proof. We built Kubernetes. Deploys broke production. We didn't make deploys perfect. We built CI/CD. Config drift caused outages. We didn't eliminate drift. We built GitOps.

The gap with AI agents isn't intelligence. It's operational discipline.

I've deployed 50+ production AI applications across DoD and Intelligence Community environments. 100+ GPUs in air-gapped networks where "just restart it" isn't an option and "it works on my machine" gets you escorted out. Every failure pattern I've seen with coding agents (context overflow, silent degradation, confident wrong answers) has a direct parallel in infrastructure. And infrastructure already solved it.


Shift-Left Validation

Traditional workflow:

Write code → Ship → CI catches problems → Fix → Repeat

Shift-left workflow:

/pre-mortem → Implement → /vibe → Commit → Knowledge compounds

The validation loop happens BEFORE code ships, not after. This is DevOps applied to coding agents.

DevOps PrincipleCoding Agent Application
Infrastructure as CodePrompts and workflows as versioned artifacts
Shift-Left Testing/pre-mortem before implementation
Continuous Integration/vibe checks before every commit
Post-Mortems/retro to extract and compound learnings
ObservabilityKnowledge flywheel tracks what works

The Three Core Skills

1. /pre-mortem: Simulate failures before implementing

"What could go wrong with this plan?"

Run BEFORE implementing. Identifies risks, missing requirements, edge cases. Validation starts before code exists.

In practice: before my agents touch a K8s operator, /pre-mortem catches things like "this CRD change breaks backward compatibility" or "the reconciler will infinite-loop if the finalizer isn't idempotent." These aren't hypothetical. They're bugs I've shipped. The pre-mortem turns experience into prevention.

2. /vibe: Validate before you commit

"Does this code do what you intended?"

The semantic vibe check. Not just syntax. Does the implementation match intent? Run BEFORE every commit.

This is the admission controller for your codebase. In K8s, you don't let a pod into the cluster without validation. Same principle: no commit enters the repo without a vibe check. Across 7,413 commits in 4 months, the ones that caused rework were almost always the ones I skipped this step on.

3. /retro: Extract learnings to compound knowledge

"What did we learn that makes the next session better?"

Closes the loop. Extracts learnings, feeds the flywheel. Every session makes the next one better.

This is the postmortem that doesn't require an outage. In infrastructure, we wait for something to break, then write the postmortem. In AgentOps, every session gets a retro, good or bad. The learnings compound. Session 50 is dramatically better than session 1, not because the model improved, but because the operational knowledge did.


Coding Agent Specific

This framework focuses specifically on coding agents, AI assistants that write, modify, and review code:

  • Claude Code running in terminal/IDE
  • AI pair programming sessions
  • Code generation with validation workflows
  • Agents using Read, Edit, Write, Bash for development

We are NOT building:

  • A framework for customer service chatbots
  • A platform for RAG-based Q&A systems
  • An SDK for multi-modal agents
  • A solution for general autonomous production agents

For general autonomous agents, see 12-Factor Agents by Dex Horthy. We cite them for users who need general agent patterns. We're coding-specific.


The 12 Factors

None of this is new. It's infrastructure patterns applied to a new domain. We already know how to make unreliable things reliable.

Foundation (I-IV)

I. Automated Tracking: Track everything in git.

Infrastructure parallel: Infrastructure as Code.

Every agent action, every decision, every output, versioned and traceable. Not in a chat log that disappears. In git, where it's searchable, diffable, and survives the session.

I track agent work through Beads, a git-backed issue tracker where every issue is a markdown file in the repo. When an agent closes an issue, the evidence lives in the commit. When it claims something works, I can git log the proof. Across 31 repos, this is the difference between "I think that agent did something" and "here's exactly what happened."

II. Context Loading: Stay under 40% context.

Infrastructure parallel: Memory management.

You don't run a server at 95% RAM utilization. You don't pack a container until the OOM killer shows up. Same rule applies to agent context windows.

Load documentation just-in-time, not up front. Compress aggressively. Start fresh with each workflow phase. The 40% threshold is where I've consistently seen the quality cliff. Not gradual degradation, a cliff. More on this below.

III. Focused Agents: One agent, one job.

Infrastructure parallel: Microservices.

Stop asking one agent to do everything. A raid team doesn't send the healer to tank the boss. You have a tank, DPS, and healers because specialization works. Same with agents.

In my multi-agent setup (Gas Town), each worker (called a "polecat") gets one issue. Not a feature. Not an epic. One issue. Write the JWT validation middleware. That's it. The polecat doesn't know about the rest of the auth system. It doesn't need to. Focused agents produce better code because they stay within the context they can actually reason about.

IV. Continuous Validation: Check at every step.

Infrastructure parallel: CI/CD pipelines.

Don't trust the agent. Verify at every step. Not at the end. At every step.

This is the pipeline: /pre-mortem before work starts, /vibe before every commit, automated tests after every change. Three gates. If the agent claims it's done and any gate fails, it's not done. Build passes are not delivery. I've had agents produce code that compiles, passes tests, and is completely wrong. The vibe check catches semantic correctness that tests miss.

Operations (V-VIII)

V. Measure Everything: Observe agent behavior.

Infrastructure parallel: Prometheus/Grafana.

You can't improve what you can't measure. Track completion rates, rework rates, context utilization, cost per issue. Without metrics, you're vibing in the dark.

The AgentOps flywheel (ao) tracks session quality over time. Which prompts produce clean first-pass code? Which patterns cause rework? After hundreds of sessions, the data tells you things intuition misses, like which factor of complexity causes the sharpest quality dropoff.

VI. Resume Work: Save state, pick up later.

Infrastructure parallel: Persistent volumes.

Sessions crash. Context fills up. Machines restart. If your agent workflow can't survive an interruption, it's a toy.

Every piece of in-progress work lives in the issue tracker, not in the agent's head. When a session dies (and they do), the next session reads the hook, sees the state, and continues. No re-explaining. No context reconstruction. The work survives the worker. This is the persistent volume for agent state.

VII. Smart Routing: Send to right specialist.

Infrastructure parallel: Load balancing.

Not every task needs your most expensive model. Routing simple linting to Opus is like routing health checks through your WAF. Match the task to the capability.

In practice, I use different configurations for different work. Research spikes get the full context treatment. Mechanical refactors get lightweight agents with tight prompts. The routing decision is part of the dispatch, made at planning time, not runtime.

VIII. Human Validation: Humans approve critical steps.

Infrastructure parallel: Change management.

Agents don't get to merge to main unsupervised. Period.

This is the change advisory board for your codebase. Automated gates catch the mechanical failures. Human review catches the "this works but it's the wrong approach" failures. In air-gapped DoD environments, there's no "move fast and break things." Every change gets human eyes. That discipline transfers directly to agent workflows.

Improvement (IX-XII)

IX. Mine Patterns: Extract learnings.

Infrastructure parallel: Postmortems.

Every session contains signal. Most people throw it away. The retro extracts it. The flywheel stores it. The next session loads it.

I have 35+ extracted learnings from agent sessions. Things like "workers write files, team lead commits (prevents git index corruption)" and "pre-identify shared files during planning (merge issues or assign same worker)." These aren't theoretical. They're bugs I hit, extracted into rules that prevent the next team from hitting them.

X. Small Iterations: Continuous improvement.

Infrastructure parallel: Kaizen.

Big changes break things. Small changes compound. Ship a factor at a time. Measure. Adjust. Repeat.

Don't try to adopt all 12 factors in a weekend. Start with tracking (I), context management (II), and focused agents (III). Add validation (IV) when those feel natural. The framework is modular by design, same as the 12-Factor App. You don't need all of them on day one.

XI. Fail-Safe Checks: Prevent repeat mistakes.

Infrastructure parallel: Admission controllers.

Once you know a failure mode, encode it. Make it impossible to repeat. This is the admission webhook for your agent workflow, policy enforcement that doesn't depend on anyone remembering.

Example: agents used to overwrite shared files when running in parallel. Now the dispatch phase marks file ownership per worker. The check is automatic. The bug can't recur. This is how you turn one-time fixes into permanent improvements.

XII. Package Patterns: Bundle what works.

Infrastructure parallel: Helm charts.

When a workflow works, package it. Make it repeatable. Make it shareable.

In Gas Town, these are "formulas," workflow templates that encode proven approaches. Need to add a new API endpoint with tests? There's a formula for that. Need to refactor a module with backward compatibility? Formula. The packaging is what turns individual skill into organizational capability. One person figures it out, everyone benefits.

> TIP:

Start with Factors I-III. Add others as you scale. You don't need all 12 on day one.


The 40% Rule

Both humans and AI fall off a cliff when overloaded. Not gradual decline. A cliff.

For AI agents, the threshold is around 40% of their context window. Beyond that, hallucinations spike and reasoning degrades. Chroma's research confirmed this isn't paranoia. It's measurable.

In practice, this means:

  • Never exceed 40% context utilization in a single workflow phase
  • Load documentation just-in-time (JIT) rather than pre-loading everything
  • Compress information aggressively before feeding it to agents
  • Start fresh with each new workflow phase to maintain peak performance

The parallel to infrastructure: you don't run servers at 95% CPU utilization. You don't stuff a pod until the OOM killer intervenes. Same principle, different domain.

I've watched agents produce perfect code at 30% context, then produce confident garbage at 60%. The failure isn't obvious. The code still looks right. It compiles. It might even pass some tests. But the logic is wrong in ways that take longer to debug than rewriting from scratch. The 40% rule exists because I've been on the wrong side of that cliff too many times.


The Origin

These factors came from building, not theorizing. 7,413 commits across 31 repos in 4 months. 50+ production AI applications in environments where failure means more than a Slack alert. 100+ GPUs in air-gapped networks where you can't just "roll it back."

The infrastructure patterns (validation gates, context management, pattern extraction) were already in my muscle memory from years of K8s operations. The same operational philosophy that Gene Kim described in The Phoenix Project (flow, feedback, continuous learning) applies directly to coding agents.

The toolchain that emerged:

  • Gas Town (gt): Multi-agent orchestration. The control plane.
  • Beads (bd): Git-backed issue tracking. The state store.
  • AgentOps (ao): Knowledge flywheel. The observability layer.

Each tool maps to infrastructure I've operated before. Gas Town is the Kubernetes. Beads is the etcd. AgentOps is the Prometheus. Different domain, same architecture.


The Takeaway

We didn't make infrastructure reliable by making servers better. We made it reliable by building operational practices around unreliable components.

Coding agents are the new unreliable component. The solution is the same: operational discipline, shifted left.

The 12 factors aren't aspirational. They're extracted from production use, from sessions that failed and sessions that shipped, from days of burned tokens and from weeks where the flywheel hummed. Every factor exists because its absence caused a specific, concrete failure.


Try It

bash

Clone the framework

git clone https://github.com/boshu2/12-factor-agentops

Start with the quick-load summary (AI sessions)

cat docs/00-SUMMARY.md

Or dive into specific factors

cat factors/02-context-loading.md # The 40% rule cat factors/04-continuous-validation.md # Validation gates

Links: 12factoragentops.com · GitHub · Gene Kim's Vibe Coding · Original 12-Factor App · 12-Factor Agents (general agents)

The 12 factors aren't theory. They're what works when you stop treating coding agents like magic and start treating them like infrastructure. Shift-left validation. Catch it before you ship it.