Getting Started with Vibe Coding: An AgentOps Quickstart
The principles that make AI-assisted coding reliable: bookkeeping, context, validation gates, the knowledge flywheel, and the first command to run.
This essay is part of the reliable AI-assisted delivery trail: proof, method, and judgment for making fast AI work reviewable and safe to ship. Start with the curated writing paths or inspect the proof.
Vibe coding is the entry. AgentOps is how it stays reliable across more than one session. This article is the principle-level quickstart; the full operating loop is at /workflow.
Andrej Karpathy named it: let the AI write code while you direct. That works for a one-off. It falls apart by session three.
The failures are operational. Agents forget what they tried. Plans pass review and ship bugs anyway. Lessons evaporate between sessions. Longer prompts make this worse past about 40% context utilization. Generation is cheap; proving the output is correct, safe, and worth shipping is the scarce part. That gap is what these practices close. What helps: fewer tokens, each paid for by intent, evidence, or a constraint that changes the next run.
AgentOps applies the practices software teams already trust (XP, BDD, DDD, TDD, CI/CD) to the workflow your agents actually run. It keeps a file-backed record of what happened, so the next session starts loaded.
The Four Layers
Each layer solves a different failure mode. All four compound.
| Layer | Failure mode | What changes |
|---|---|---|
| Bookkeeping | Agents forget what they tried, why they changed course, what mattered | .agents/ captures findings, decisions, verdicts, retros, and post-mortems. The work leaves a trace. |
| Context Compiler | Every session starts from zero | ao context assemble builds phase-scoped packets. ao lookup retrieves decay-ranked knowledge on demand. The agent starts loaded. |
| Validation Gates | Agents ship confident garbage | /pre-mortem, /vibe, /council: multi-model consensus blocks plans and code before you ship them. |
| Knowledge Flywheel | Lessons disappear between sessions | /forge extracts learnings from the bookkeeping trail. /evolve fixes the worst gap. Session 15 starts with everything session 1 learned. |
All state lives in local .agents/ as plain text you can grep, diff, and review. No hosted control plane. Runtime-neutral across Claude Code, Codex CLI, Cursor, and OpenCode.
The Operating Loop
One repeatable loop. Every skill is one move inside it. No artifact exists unless it advances the loop.
BDD-shaped intent
→ vertical slices (each one a behavior, not a layer)
→ TDD per slice (first failing test, then implementation)
→ conflict-free parallel wave (only if write scopes don't collide)
→ bead closed when acceptance examples pass
→ evidence + learning captured under the promotion ratchet
A few rules carry the rest:
- Behavior is the unit of work, not a layer. A slice cuts vertically through whatever layers it needs to demonstrate one Given/When/Then.
- The first failing test is the contract. Code without a failing test has no acceptance surface; the agent can't know when it's done.
- Parallelism is explicit ownership. Default to sequential. Run a wave only when the write scopes are provably disjoint.
- Context crosses boundaries as artifacts, not as accumulated chat.
Your First Cycle
Three commands. About fifteen minutes.
1. Install
# Claude Code
claude plugin marketplace add boshu2/agentops
claude plugin install agentops@agentops-marketplace
# Codex CLI (macOS / Linux / WSL)
curl -fsSL https://raw.githubusercontent.com/boshu2/agentops/main/scripts/install-codex.sh | bash
Then install the ao CLI for repo seeding and health checks:
brew tap boshu2/agentops https://github.com/boshu2/homebrew-agentops
brew install agentops
ao doctor
ao doctor is the canonical health check. Non-zero exit means a real problem.
2. Seed the repo
cd <your-repo>
ao quickstart
This creates .agents/ and prints the single next action for your state. Re-runnable. Idempotent.
3. Run one full loop
/rpi "a small goal"
Pick something you'd normally finish in 30 to 60 minutes: one endpoint, one component, one bug. /rpi runs the six phases automatically:
/research → /plan → /pre-mortem → /crank → /vibe → /retro
Each phase leaves a file in .agents/. When it's done, ls .agents/runs/ shows the receipt.
If you'd rather drive each phase by hand, run them individually. Same discipline, more steering.
Behavioral Discipline
The skills enforce four habits.
| Habit | What it prevents |
|---|---|
| Think before coding | Hidden assumptions, silent confusion, wrong interpretation |
| Simplicity first | Speculative flexibility, bloated abstractions, oversized patches |
| Surgical changes | Drive-by refactors, unrelated edits, noisy diffs |
| Goal-driven execution | Weak verification, "looks done" changes, proof by assertion |
A concrete example. User says: "Make search faster." The default agent picks a meaning (latency? throughput? perceived?), adds caching, and ships a larger patch than the question justified. The disciplined agent asks which metric, picks the smallest change that moves it, and verifies against the metric that actually mattered.
The Promotion Ratchet
Most observations are noise. The ratchet is the rule that decides what survives.
| Trigger | Goes to |
|---|---|
| Noticed once | Stays in the session handoff. Dies when the handoff ages out. |
| Repeats twice across sessions | .agents/learnings/<slug>.md |
| Changes how a skill or plan runs | .agents/planning-rules/ or the relevant skill |
| Changes the contract | A test, a gate, or a hook |
Promotion is the mechanism that keeps .agents/ lean. Most observations die at the first row, so the surviving corpus stays small, dense, and load-bearing.
Validate Before You Ship
Three skills cover the failure modes a single agent misses.
/pre-mortem: simulate failures before implementing. Runs before code exists. Catches things like "this CRD breaks backward compatibility" or "the reconciler infinite-loops if the finalizer isn't idempotent."
/vibe: quick sanity check on recent work. Use before every commit on anything non-trivial.
/council: multi-judge consensus when stakes are high. Mix Claude and Codex judges; record the verdict to .agents/council/. Fresh context per judge catches what one agent can't.
> /council --mixed validate this PR
[council] sealed evidence packet → 6 judges across Claude Code and Codex CLI
[claude/judge-1] WARN, rate limiting missing on /login
[codex/judge-1] WARN, token bucket refill lacks jitter under burst
Consensus: WARN, fix /login rate limit and add refill jitter before shipping
Common Mistakes
I made all of these.
1. Treating .agents/ as documentation. It's an executable context library that the next session reads. If you find yourself hand-writing files into it, the skills aren't doing their job; let them produce the files and review the diffs.
2. Running /crank on a plan you didn't pre-mortem. Parallel execution amplifies bad plans. Spend the cheap minute on /pre-mortem before you spend the expensive hour on cleanup.
3. Letting learnings file themselves. Apply the promotion ratchet above every time something feels notable. Skip the gate and the learnings folder fills with half-thoughts nobody re-reads.
4. Skipping the first failing test. Without it, the slice has no acceptance surface. The agent will declare victory on code that doesn't compile.
5. One giant prompt. As noted up top, context past the threshold degrades reliability. Break work into slices with clean boundaries; let the compiler hand each slice the bounded packet it needs.
What's Next
| When you're comfortable with… | Try… |
|---|---|
One /rpi cycle | Multi-session work with bd (beads) issue tracking |
| Hand-driven phases | /swarm for wave-based parallel execution |
| Single-agent loops | /council --mixed for multi-model validation |
| Reactive sessions | ao daemon and a nightly /dream so the corpus compounds while you sleep |
The full method is at /workflow. The doctrine, including why this works the way it does, is at 12-Factor AgentOps.
Try It
# Install
claude plugin install agentops@agentops-marketplace
brew install boshu2/agentops/agentops
# Seed
cd <your-repo>
ao quickstart
# Run
/rpi "a small goal"
# Inspect the receipt
ls .agents/runs/
The first session won't tell you much. By the third, the corpus starts pulling its weight. By the tenth, the agent will surface a prior decision you'd already forgotten.
The goal is a repo that remembers, so each session is harder for the agent to get wrong than the one before.
This is the engineering edge of a larger project: go to the AI frontier, learn what actually works, then translate it into safe practice for people who aren't engineers. This essay is the engineer-facing half; the translation is the rest.
Related
- 12-Factor AgentOps: The doctrine. DevOps for vibe coding.
- The Knowledge Flywheel: How sessions teach each other over time.
- Building This Website: the discipline applied to a real Next.js build. 48 hours to production, with the bookkeeping to prove it.
- AI Partner: the same discipline, translated for people who don't write code.