How governance turned 1,616 vibe-coded commits into a system.

I’ve spent the last nine weeks building an autonomous software factory. It’s still a work in progress, not a finished product, and the numbers below are a snapshot of a moving target. 1,616 commits so far, one developer directing agents, and not a single line of it written by hand. The first three weeks were pure vibe coding: describe a feature, let the agent write it, run the app, fix what broke, repeat. The next six were different.

I went back and read the whole git history. There are two clear eras, with a hard line between them: a single weekend in mid-April when the project changed character. What changed was not the agent, and not the speed. It was that the work got bracketed on both ends: a plan going in, and a gate coming out. Before that weekend it had neither, and that is the whole story.

center

What vibe coding actually produces

Vibe coding looks like it works, and that is exactly the problem. You get something that runs fast. Within three weeks the factory had a pipeline, an orchestrator, and a dashboard you could click through, about 36,000 lines across 400 files, none of it typed by a human. It all ran. What you don’t get is any sense of whether it holds together underneath, and it mostly didn’t. Here is the vibe era in numbers:

  • Commits averaged 826 lines of change, and only 30% of that was deletion. You write fast, you write new, and you almost never go back. The code only ever piles up; it never gets reshaped.
  • 1% of commits were tests. Effectively none.
  • 0% of commits were scoped — every message was feat: implement factory pipeline phases 0-3, a broad gesture at a broad change.

Then look at what broke. I categorized every fix from the era, and 32% of vibe-era bugs were UI and runtime-state failures: empty states rendering instead of error states, loading skeletons that never resolved, WebSocket data split mid-UTF-8-character, a subprocess that mis-parsed its arguments, a container missing a dependency at runtime.

None of those are logic bugs. They are “it doesn’t actually run” bugs, and there was only one way to find them: run the app and hit them. Production was my test suite. The system was a Black Box. When something failed, there was no signal underneath it telling me which stage broke, which contract got violated, or what had quietly rotted.

That is what vibe coding produces. Not broken code, exactly. A system that runs at the surface, with no instruments underneath it, piling up faster than anyone can watch it rot. The slop was never the code itself. It was the decay nobody could see.

The weekend everything changed

On April 18th and 19th, two commits landed back to back:

Apr 18 18:41  docs: add negentropic AGENTS.md rulebooks and violations backlog
Apr 19 08:06  chore: add Semgrep architectural rules for all orchestrator layers

A written set of rules, and a machine to enforce them on every build. That was the turning point. Everything before it belongs to one era, everything after to another.

Two disciplines arrived that weekend, and together they bracket the agent’s work from both sides.

Going in, a plan. In the vibe era a feature was just an idea sent straight to the agent, and whatever came back, I ran. Now every feature starts on paper: requirements written down, edge cases enumerated, and a failing test written before the code exists to pass it. That change alone moved tests from 1% of commits to 8%, and it is most of the reason the bugs that survived were narrow boundary errors instead of “it doesn’t run.”

Coming out, the gate. The build refuses to merge code that breaks the rules. The plan declares what the code is for; the gate proves it still holds. Neither is worth much alone. A plan with no gate is a good intention, and a gate with no plan only enforces the absence of one. The magic is the pair.

This is the part willpower can’t take credit for. Once the rules existed and a gate enforced them, I had a measurement: how far the existing code sat from the standard. The commit log fills up with the payoff:

refactor(context-service): fix 9 architecture violations in core, impl, and routers
refactor(frontend): fix 4 architecture violations — config, component purity, error boundaries
chore: final cleanup — fix Semgrep false positive, mark 16 violations Fixed

This is the Sensor principle doing its job. You can’t stop decay you can’t see, and for three weeks I couldn’t see any of it, so it built up. The moment the linter started failing the build on a structural violation, the rot turned into a number, and a number is something you can drive down. The rulebook said what “fixed” meant. Semgrep made it mandatory.

What a gate actually looks like

A rule is small. A few lines of YAML that match the one pattern you’ve banned, and fail the build when they hit it. This one is a security gate. It stops any controller from going anonymous, apart from the few endpoints that authenticate themselves another way:

- id: ORCH-API-R012
  pattern-regex: '\[AllowAnonymous\]'
  paths:
    include: ["**/FactoryOrchestrator.Api/Controllers/"]
    exclude: ["**/WebhooksController.cs"]   # does its own HMAC check
  message: "[AllowAnonymous] bypasses the default-deny authorization policy."
  severity: ERROR

The next one is architectural, the Buffer principle written as a build rule. A pipeline node has to take its config as parameters instead of reaching into the environment, which keeps it testable with no network and no ambient state:

- id: FACTORY-NODES-R011
  pattern-either:
    - pattern: os.environ.get(...)
    - pattern: os.environ[...]
  paths:
    include: ["factory/nodes/"]
  message: "No os.environ reads in the node layer — accept config as parameters."
  severity: ERROR

That second rule lit up immediately. This line had been sitting in the node layer since the vibe era:

# factory/nodes/_shared.py — what the gate flagged
base_url = os.environ.get("FACTORY_CONTEXT_SERVICE_URL", "http://localhost:8090")

One line, two violations. The node reads its environment, which makes it an Infected Core that can’t be tested without setting env vars first. And the http://localhost:8090 default is a Silent Fallback: it hides missing config behind a value that happens to work on my laptop. The rule bans the pattern, so the config now has to be passed in by the caller.

The point was never about writing better code. The discipline held for 900 commits because a gate enforced it, not because I kept resolving to be careful. Resolve fades. A build that fails does not.

What discipline changed, and what it didn’t

This is where the story gets less flattering, and more useful.

The fix rate did not drop. Vibe era: 18% of commits were fixes. Disciplined era: 21%. If you expected governance to mean fewer bugs, the data says otherwise.

The increase is not what it looks like. The fixes didn’t climb because discipline introduced bugs. They climbed because the gate started finding them. Twenty-two of the commits in the ten days after the gate were violation sweeps, and none of those bugs were new. They had been in the vibe-era code all along, invisible, waiting for a rule to point at them. (The os.environ read above was one of dozens.) A rising fix count is the Sensor working, not the code getting worse.

It held that rate while the codebase more than doubled, from those 36,000 vibe-era lines to 80,000 across 970 files today. Twice the surface area, the same fix rate, and a backlog of old defects paid down at the same time. The bugs were always there. For three weeks I just had no way to count them.

What governance changed is which bugs, and who caught them:

VibeDisciplined
UI / runtime-state failures32%6%
Tests as share of commits1%8%
Scoped, atomic commits0%95%
Reverts05

center

The UI-breakage class fell from a third of all bugs to almost nothing. The bugs that remained were a different kind, less “the system doesn’t run” and more “a value was wrong at a boundary.” The disciplined-era fixes read like signal_count rejects negative max for C# parity, throw on unknown phase, guard AbortReason on JsonValueKind.String: narrow, named, and each one shipped with a guard and a test so it couldn’t quietly return a fake success. That kind of Silent Fallback is what turns a single outage into a thundering herd.

Detection also moved earlier. Sixty-eight disciplined fixes name the thing that caught them: a review, a test, a worktree sweep, a characterization snapshot. The process found the bug before it merged. In the vibe era, the process was me, running the app and finding out the hard way.

The reverts surprised me. Zero in the vibe era, five after. That’s the opposite of what you’d guess, until you think about what a revert needs. You can only back a change out if it’s small enough to isolate and you catch it soon enough to bother. The vibe-era commits were 826-line dumps, so I patched forward instead. Discipline turned the revert back into a usable tool. That is the Shield principle: units kept small and independent enough to pull out without the system losing its shape.

How you actually turn slop into something viable

One ratio tells the story better than any other. In the vibe era, 30% of the lines in a commit were deletions; the code mostly just grew. In the disciplined era it was 95%. Week after week, the codebase deleted almost as much as it added.

center

That ratio is the whole conversion. Turning slop into something viable wasn’t a rewrite, and it wasn’t a cleanup sprint. It was steady reshaping: 124 refactor commits against the vibe era’s 8, and 47 of those broke God Methods apart into named, single-purpose units. The commit messages stopped naming features and started naming smells, like drop cargo-cult egress normalization write-back and rename HashSet to StaticNoRepoBlueprintAllowList; fix misleading comment. That is the Anchor principle coming back: code that states its intent instead of making the next reader dig it out of the wreckage.

The recipe, in order:

  1. Write the rules down first. Not in your head, but in a file the agent reads on every task. An agent with no constitution writes locally-plausible code with no global coherence. The rulebook is the global coherence.
  2. Plan every feature before you prompt. Requirements and edge cases on paper, and a failing test that encodes them, before the implementation exists. The agent writes toward a target instead of a vibe, and the test becomes the first thing the gate checks. This is the half people skip, and it is half the result.
  3. Install a gate that fails the build. A linter or architecture check that fails the build turns decay from invisible into a number. Skip it and the rules are only a suggestion, and a suggestion always loses to velocity.
  4. Run the gate against the slop and burn down the backlog. The vibe-era code won’t pass. That’s fine; now you have a finite, measurable list instead of a vague sense of unease.
  5. Make every commit atomic from then on. Small enough to review, small enough to revert. That is what lets the gate work continuously instead of in painful sweeps.

The honest caveat

The sweep was incremental, not total

Some vibe-era files were never re-touched and still carry the old shape. factory/nodes/pr.py, born on day one, still has 12 of its 26 functions over the project’s own 15-line limit, with the largest at 56. The webhook controller has sat at 564 lines since March 21st. Code that worked fine since the vibe era never got dragged across the line, because nothing forced it to. The gate only governs code that passes through it.

That limit is the thesis in miniature. Governance worked because it bracketed the agent on both ends: a plan that said what the code was for, and a gate that refused anything which drifted from it. Vibe coding doesn’t fail because agents write bad code; the code they write is good, locally. It fails because nothing makes all that local good add up to something coherent, and nothing makes the decay visible while it’s still cheap to fix. The agent brings the speed. The plan and the gate bring the negentropy, the steady pressure that keeps all that motion from sliding into decay. That part you still have to install. The factory isn’t finished, but it turned into something I could keep building on the weekend I stopped trusting my own diligence and started trusting a spec going in and a build that fails.