An agent tells me a fix is done. I check. The bug is still there.
This keeps happening. Not once, across every product I build. An agent reports a feature complete, and half of it is missing. It says a bug is fixed, and the thing that was broken is still broken. “Done” comes back fast and confident, and it is wrong often enough that I stop trusting the word. On one product there was a stretch where the fixes outnumbered the new features — work that had already been called done, coming back broken.
The agents, the AI tools I use to write software, are taking the shortest path. They do something that looks like the work, mark it complete, and move on. The status says finished. The product says otherwise.
So I stop letting them call it. I tell them what I want: stop being lazy, and stop cutting corners. Be efficient with the coding too, not just fast at declaring victory.
The status says finished. The product says otherwise.
The fix is not more checking by me. I cannot stand behind every change. The fix is to make every change come with the evidence it actually works: what the agent ran, and what that showed.
So every change now ends with one line. How it was checked, and what came back. Not “looks right.” Ran it and saw the bug gone. Tested it this way and got the result I expected.
The agents build the enforcement. A gate sits at the end of every change, on every product. No evidence line, the change does not go through. A second rule says a task cannot be closed without it. The evidence is the price of saying done.
I watch the next batch of work come back. This time each one arrives with what was run and what it showed. The ones that cannot show their work do not get to claim they finished.
There is a catch I had to learn the hard way, though. An evidence line is only as honest as the agent writing it. The same agent that wrote the code can still grade its own homework kindly, or write a weak test that passes. So self-reported evidence is the floor, not the bar. For the changes that carry real risk, the rule is stricter: a different agent, one that did not write the code, has to read the work and agree before it counts.
Done stops being a thing an agent says. It becomes a thing it has to show.
Learnings
Nothing was getting done right, or done at all, until I stopped trusting the word “done.” The agents are eager to report success. Left alone, they call work finished when it merely looks finished, and the gap shows up later as a bug or a missing feature. Checking every change myself does not scale. What I noticed was this: redefine “done” so it cannot be claimed without showing the work actually runs, enforce that automatically, and no one can wave it through.
For a builder: the unreliable signal is the agent’s own claim of completion. The countermeasure is a required line on every change naming how it was checked and what the check showed, blocked at save time if it is missing, plus a rule that no task closes without it. “Done” moves from something self-reported to something that has to exist.
Related entry: The bug that wiped out my sessions is the same instinct one level down. The agents fix the first instance and stop, so I make them sweep every case and prove it.