Skip to content

Getting Started with Vibe Coding: An AgentOps Quickstart

The principles that make AI-assisted coding reliable: bookkeeping, context, validation gates, the knowledge flywheel, and the first command to run.

May 20, 2026·8 min read
#ai-development#agentops#vibe-coding#developer-tools#tutorial#workflow

This essay is part of the reliable AI-assisted delivery trail: proof, method, and judgment for making fast AI work reviewable and safe to ship. Start with the curated writing paths or inspect the proof.

Vibe coding is the entry. AgentOps is how it stays reliable across more than one session. This article is the principle-level quickstart; the full operating loop is at /workflow.

Andrej Karpathy named it: let the AI write code while you direct. That works for a one-off. It falls apart by session three.

The failures are operational. Agents forget what they tried. Plans pass review and ship bugs anyway. Lessons evaporate between sessions. Longer prompts make this worse past about 40% context utilization. Generation is cheap; proving the output is correct, safe, and worth shipping is the scarce part. That gap is what these practices close. What helps: fewer tokens, each paid for by intent, evidence, or a constraint that changes the next run.

AgentOps applies the practices software teams already trust (XP, BDD, DDD, TDD, CI/CD) to the workflow your agents actually run. It keeps a file-backed record of what happened, so the next session starts loaded.


The Four Layers

Each layer solves a different failure mode. All four compound.

LayerFailure modeWhat changes
BookkeepingAgents forget what they tried, why they changed course, what mattered.agents/ captures findings, decisions, verdicts, retros, and post-mortems. The work leaves a trace.
Context CompilerEvery session starts from zeroao context assemble builds phase-scoped packets. ao lookup retrieves decay-ranked knowledge on demand. The agent starts loaded.
Validation GatesAgents ship confident garbage/pre-mortem, /vibe, /council: multi-model consensus blocks plans and code before you ship them.
Knowledge FlywheelLessons disappear between sessions/forge extracts learnings from the bookkeeping trail. /evolve fixes the worst gap. Session 15 starts with everything session 1 learned.

All state lives in local .agents/ as plain text you can grep, diff, and review. No hosted control plane. Runtime-neutral across Claude Code, Codex CLI, Cursor, and OpenCode.


The Operating Loop

One repeatable loop. Every skill is one move inside it. No artifact exists unless it advances the loop.

BDD-shaped intent
  → vertical slices (each one a behavior, not a layer)
  → TDD per slice (first failing test, then implementation)
  → conflict-free parallel wave (only if write scopes don't collide)
  → bead closed when acceptance examples pass
  → evidence + learning captured under the promotion ratchet

A few rules carry the rest:

  • Behavior is the unit of work, not a layer. A slice cuts vertically through whatever layers it needs to demonstrate one Given/When/Then.
  • The first failing test is the contract. Code without a failing test has no acceptance surface; the agent can't know when it's done.
  • Parallelism is explicit ownership. Default to sequential. Run a wave only when the write scopes are provably disjoint.
  • Context crosses boundaries as artifacts, not as accumulated chat.

Your First Cycle

Three commands. About fifteen minutes.

1. Install

# Claude Code
claude plugin marketplace add boshu2/agentops
claude plugin install agentops@agentops-marketplace

# Codex CLI (macOS / Linux / WSL)
curl -fsSL https://raw.githubusercontent.com/boshu2/agentops/main/scripts/install-codex.sh | bash

Then install the ao CLI for repo seeding and health checks:

brew tap boshu2/agentops https://github.com/boshu2/homebrew-agentops
brew install agentops
ao doctor

ao doctor is the canonical health check. Non-zero exit means a real problem.

2. Seed the repo

cd <your-repo>
ao quickstart

This creates .agents/ and prints the single next action for your state. Re-runnable. Idempotent.

3. Run one full loop

/rpi "a small goal"

Pick something you'd normally finish in 30 to 60 minutes: one endpoint, one component, one bug. /rpi runs the six phases automatically:

/research → /plan → /pre-mortem → /crank → /vibe → /retro

Each phase leaves a file in .agents/. When it's done, ls .agents/runs/ shows the receipt.

If you'd rather drive each phase by hand, run them individually. Same discipline, more steering.


Behavioral Discipline

The skills enforce four habits.

HabitWhat it prevents
Think before codingHidden assumptions, silent confusion, wrong interpretation
Simplicity firstSpeculative flexibility, bloated abstractions, oversized patches
Surgical changesDrive-by refactors, unrelated edits, noisy diffs
Goal-driven executionWeak verification, "looks done" changes, proof by assertion

A concrete example. User says: "Make search faster." The default agent picks a meaning (latency? throughput? perceived?), adds caching, and ships a larger patch than the question justified. The disciplined agent asks which metric, picks the smallest change that moves it, and verifies against the metric that actually mattered.


The Promotion Ratchet

Most observations are noise. The ratchet is the rule that decides what survives.

TriggerGoes to
Noticed onceStays in the session handoff. Dies when the handoff ages out.
Repeats twice across sessions.agents/learnings/<slug>.md
Changes how a skill or plan runs.agents/planning-rules/ or the relevant skill
Changes the contractA test, a gate, or a hook

Promotion is the mechanism that keeps .agents/ lean. Most observations die at the first row, so the surviving corpus stays small, dense, and load-bearing.


Validate Before You Ship

Three skills cover the failure modes a single agent misses.

/pre-mortem: simulate failures before implementing. Runs before code exists. Catches things like "this CRD breaks backward compatibility" or "the reconciler infinite-loops if the finalizer isn't idempotent."

/vibe: quick sanity check on recent work. Use before every commit on anything non-trivial.

/council: multi-judge consensus when stakes are high. Mix Claude and Codex judges; record the verdict to .agents/council/. Fresh context per judge catches what one agent can't.

> /council --mixed validate this PR

[council] sealed evidence packet → 6 judges across Claude Code and Codex CLI
[claude/judge-1] WARN, rate limiting missing on /login
[codex/judge-1]  WARN, token bucket refill lacks jitter under burst
Consensus: WARN, fix /login rate limit and add refill jitter before shipping

Common Mistakes

I made all of these.

1. Treating .agents/ as documentation. It's an executable context library that the next session reads. If you find yourself hand-writing files into it, the skills aren't doing their job; let them produce the files and review the diffs.

2. Running /crank on a plan you didn't pre-mortem. Parallel execution amplifies bad plans. Spend the cheap minute on /pre-mortem before you spend the expensive hour on cleanup.

3. Letting learnings file themselves. Apply the promotion ratchet above every time something feels notable. Skip the gate and the learnings folder fills with half-thoughts nobody re-reads.

4. Skipping the first failing test. Without it, the slice has no acceptance surface. The agent will declare victory on code that doesn't compile.

5. One giant prompt. As noted up top, context past the threshold degrades reliability. Break work into slices with clean boundaries; let the compiler hand each slice the bounded packet it needs.


What's Next

When you're comfortable with…Try…
One /rpi cycleMulti-session work with bd (beads) issue tracking
Hand-driven phases/swarm for wave-based parallel execution
Single-agent loops/council --mixed for multi-model validation
Reactive sessionsao daemon and a nightly /dream so the corpus compounds while you sleep

The full method is at /workflow. The doctrine, including why this works the way it does, is at 12-Factor AgentOps.


Try It

# Install
claude plugin install agentops@agentops-marketplace
brew install boshu2/agentops/agentops

# Seed
cd <your-repo>
ao quickstart

# Run
/rpi "a small goal"

# Inspect the receipt
ls .agents/runs/

The first session won't tell you much. By the third, the corpus starts pulling its weight. By the tenth, the agent will surface a prior decision you'd already forgotten.

The goal is a repo that remembers, so each session is harder for the agent to get wrong than the one before.

This is the engineering edge of a larger project: go to the AI frontier, learn what actually works, then translate it into safe practice for people who aren't engineers. This essay is the engineer-facing half; the translation is the rest.