Devlog #6: The Product Was Never the Flywheel
I bet twice on a self-improving corpus that gets smarter on its own. I instrumented both bets. Both came back unproven. The thing that survived every honest test was the gatekeeper that refuses to call anything done without independent proof.
In devlog 1, I ran 8 agents in parallel and watched them collapse. In devlog 2, I cut 60 agents down to 4. In devlog 3, the tools changed under my feet. In devlog 4, the spec turned out to be the leverage. In devlog 5, the platform absorbed my plumbing and I ended on a question. This is me answering it.
Four months ago I ended devlog 5 with one honest question:
Does the Knowledge Flywheel hold up under honest metrics, or just under belief?
I'd been telling everyone the flywheel was the moat. Sessions compound. Each one stands on the shoulders of the last. The model is commodity; the accumulated knowledge is the asset. I wrote a whole essay about it. I believed it.
So I built the instrument and measured it.
The number came back negative.
The test I should have run a year ago
The flywheel claim is testable. It says: an agent with my accumulated knowledge does better than the same agent without it. That's an A/B test. Two arms.
- Control: the agent runs the task cold. No corpus.
- Treatment: the same agent, same task, with the curated knowledge injected.
Then you grade both on the same rubric, with a judge from a different model family so it can't just rubber-stamp its own house style. You isolate the arms so the control can't cheat and read the corpus off disk. You screen for the case where the task is so easy the corpus can't possibly help.
I built all of that. It took weeks, and most of the work was making the ruler honest, not running it. A ruler you can fool is worse than no ruler.
Then I ran it.
The treatment arm scored lower than the control. Injecting the knowledge made the output worse. The agent without my hard-won corpus scored near the top of the rubric on its own. The corpus-on arm paid five times the tokens to do slightly worse work.
That's not a rounding error. That's the bet failing.
Why it failed (it's not what I wanted to hear)
My first instinct was to blame the plumbing. Bad retrieval. Thin learnings. Wrong injection. Some of that was real and I fixed it. The number didn't move.
The actual reason is harder.
A frontier model already knows most of what's in my corpus. I'd been writing down operational truths like "acceptance checks must be token-specific" and "merge early, merge often" and treating them as proprietary gold. They're not proprietary. They're in the training data. When I hand a 2026-era model my notes on how to write a good spec, I'm telling a chess engine how knights move. It nods politely and plays the same move it was going to play anyway, except now its context window is full of my notes.
On a single, cleanly-defined task, the corpus has no room to add value, because the model isn't the thing failing.
I was grinding the wrong stat again. Devlog 4 warned about this and I did it anyway: I optimized the part I controlled and felt productive, instead of measuring whether it mattered.
So I wrote it down where it counts. The internal record now says, in plain language: the corpus moat is unproven. Not disproven, unproven. A weaker model, or a long campaign of hundreds of tasks instead of one, might still show the compounding. But I am not allowed to claim a moat I can't measure. The ruler refused to certify it, and I let the ruler win.
If you read the flywheel essay as "this is proven," read this paragraph as the correction. The flywheel as a personal practice is still worth the two minutes. The flywheel as a moat is a hypothesis that just failed its first honest exam.
What didn't fail
One thing kept me up.
Across every rebuild in this series, it never got absorbed by the platform and it never failed a test: the discipline of not trusting the agent's claim of "done." Devlog 1 called it the gatekeeper. Devlog 5 watched native agent teams ship without one and close four tasks overnight where two were actually done.
That gatekeeper is the product. I stopped hedging about it. The whole system now has one job: take stochastic output and decide whether it's proven enough to trust. Nothing an agent generates is "done" until something independent proves it. No verdict, not done.
But I got something wrong in devlog 1. I described the gatekeeper as type checks, linting, complexity analysis, builds. Static tools. Those are real and they matter, and they catch exactly one kind of wrong: the syntactically broken kind. The code that won't compile.
They are useless against the dangerous kind. The code that compiles, types clean, passes the test it wrote for itself, and is confidently, beautifully wrong about what it was supposed to do. A linter has no opinion about whether the code does what you asked. It can't. That's not its job.
So the gatekeeper grew a second stage that a static tool can't be: an independent reviewer that never saw the work get made. Fresh context. A different model family from the one that wrote the code. Handed the change, handed the acceptance contract, and asked one question: does this satisfy what was promised? It can't rubber-stamp, because it has no memory of writing the thing and no stake in it being right.
The QC sensor on the conveyor belt catches the malformed part. This is the inspector who walks the floor and never saw the part get stamped. The foundry needs both.
I reached for the next flywheel. The same ruler killed it too.
Now the part I'm not proud of.
The negative result stung, so I did what I always do. I reached for the next beautiful structure. If knowledge wasn't the thing that compounds, maybe the gatekeeper's own mistakes were.
The idea is good. Call it an escape: the gate looks at a change, says "confirmed, this is done," it ships, and later it turns out it was wrong. The gate missed it. That miss is the most valuable artifact the system can produce, because it's a real example of the looks-right-but-wrong failure no linter sees. Turn each escape into a check, and the gatekeeper gets smarter from its own misses instead of from my notes. The mechanism works. I watched a miss become a check that blocks it next time, end to end, on the real system. That part is proven.
Then I asked the only question that matters: does it compound? Does the escape log fill up and pay off over time?
I put it under the same ruler. The number: across 130 real verdicts in production, the gate produced zero escapes. It caught everything at review.
Sit with why, because it's the whole point. A gatekeeper good enough to catch the confident-wrong stuff catches it before it ever becomes an escape. The better the gate, the fewer misses it has to learn from. Self-improvement from escapes is anti-correlated with how good the gate already is. To even generate a miss in the lab I had to hand the work to a weak model, and when I gave it a strong model writing genuinely subtle bugs, the gate caught those too, three for three. The one real miss I could manufacture didn't reliably stay fixed on re-measure.
I wanted that to be the flywheel. It isn't, yet. It's a second unproven hypothesis, and this one comes with a structural reason it might never pay off: a line good enough to catch every defect at inspection has no defect log to compound from. I reached past a true answer, that the gate works, to grab a prettier one. The ruler caught me doing it. That's twice now.
So, the honest scoreboard. Two flywheels. Two bets on a system that gets smarter by itself. Both unproven.
Updating the prior devlogs
Since the premise of this one is "be honest about what held up," the scorecard:
- Devlog 1 (the gatekeeper, the factory): holds, and got sharper. The factory is real. The gatekeeper is real and is now the whole point. The one correction: static analysis is only the floor. The hard layer is an independent reviewer with no stake in the answer.
- Devlog 4 (the spec is the leverage): still true, and now load-bearing. The spec's real job turned out to be the acceptance contract, the thing the gatekeeper checks "done" against. A spec you can't verify against is just a wish.
- Devlog 5 (knowledge compounding is the gap that matters most): half right. Knowledge compounding is real as a habit and worthless as a moat against a current model, on the evidence I have. The gap that matters most turned out to be verification, not memory.
- The Knowledge Flywheel essay: the practice section stands. The "this is your moat, the system gets smarter every day" framing is now a hypothesis under measurement, and the first measurement was negative. I'd rather tell you that than let the essay age into a claim I quietly stopped believing.
The product was never the flywheel
The thing that survived both bets wasn't a flywheel at all.
It was the gatekeeper. The discipline of not trusting the agent's claim of done until something independent proves it. No verdict, not done. That doesn't need to compound to be worth everything. It just needs to be the check that stands between confident-wrong output and your main branch, and that, unlike either flywheel, is proven and ships today.
I kept wanting the product to be a self-improving cathedral, a thing that gets cleverer every night while I sleep. It's a gatekeeper. Smaller, less romantic, and real. The corpus was supposed to be the moat. The self-improvement was supposed to be the magic. Both are still hypotheses. The gatekeeper is the thing I rebuilt every single time because I couldn't ship without it.
The thing you can't delete is usually the product.
What this means if you don't write code
Same as it ever was, just earned harder this time.
The model will hand you work and call it done. Whether it holds up is a separate question, and the model is the last thing that should answer it. Generation got cheap and it's getting cheaper; the scarce skill is checking that what came out is right and safe to ship. That skill doesn't require you to understand how the model works, any more than QC at a factory requires you to understand semiconductor physics. You need to know what "correct" looks like and you need something independent checking against it.
And when the check misses, you turn that miss into a check that catches it next time. That's the right instinct, and it's the mechanism this whole system runs on. Whether enough misses ever pile up to make the system smarter on its own is a question I'm still measuring. The part that already works, the part you can use today, is having something independent check the work and refuse to call it done until it's proven.
What's Next
I have a system that catches the agent saying "done" when it isn't. That part is real, and it's the product. What I don't have is proof that it gets better by itself. Both the knowledge corpus and the escape log are bets I've now failed to confirm, and the escape one may be structurally self-starving, because a good gatekeeper catches its own misses before they ever become data. Whether that ever changes is the next number I owe you. Same discipline as this whole devlog: build the ruler, run it, publish the number even when it's the one I didn't want.
I spent two devlogs and a lot of belief chasing the flywheel. The answer that survived contact with an honest ruler is plainer than I wanted it to be: the gatekeeper that won't call anything done without proof. That's the fire worth carrying.
Previous: Devlog #5: When the Platform Catches Up. Related: When It's Lying, the same trust problem for people who don't write code.