The Smart Data Sanitizer Bake-Off: What we learned from 3 different team Approaches
We locked three teams in a room for four hours (they were, of course, free to step out to eat or use the washroom) with the same brief: build a tool that takes a messy CSV file and a plain-English instruction, lets an LLM generate the transformation code, runs it against up to 100,000 rows, and returns something clean – plus a summary of what changed.
The point wasn’t to ship a product. It was to figure out what working with AI feels like when the stakes are real, the data is ugly, and the clock is running.
So, we ran it as a bake-off.
Three teams, three philosophies: Team Vibe, Prius, and Craft
Team Vibe went full vibe coding. Humans direct, AI writes everything. ZERO code written by humans. Maximum trust in the model.
Team Prius took the hybrid road. Humans in the driver’s seat, AI riding shotgun. Pair programming, but the pair is half silicon.
Team Craft wrote every line by hand. No Copilot. No Claude. Just a developer, a terminal, and the problem.
We scored each team’s tool on accuracy (40), robustness (30), scalability (20), and how clearly the tool explained itself – the “vibe” score (10).
What the outputs showed
We didn’t grade on vibes or code quality. We compared cleaned CSVs row-by-row against the originals, across 1k, 10k, and 100k row datasets.
Team Vibe came out swinging. Dates fully normalized to ISO. A precise 0.453592 kg/lb conversion factor. City names 100% corrected – typos, casing, stray whitespace, all handled. They even added isAnomalous and cleaned tracking columns, which was a genuinely thoughtful touch for auditability.
Then the 100k run landed. Anomaly detection silently stopped working after row ~51,858. No crash. No error. Just silence for the remaining ~48,000 rows. In a RegTech context, that’s the kind of bug that doesn’t get caught until an auditor catches it for you.
Team Prius normalized dates cleanly. But their unit conversion used a rounded 0.454 factor instead of the precise one – small, systematic, and the kind of error that compounds. City normalization fixed only 30-43% of dirty values; “TORONTO,” “DENVER,” and “Austiin” sailed through untouched. No anomaly detection. No tracking columns.
Team Craft – the solo developer with no AI – won.
Dates, units, cities: 100% clean at every scale. Precise conversion factor. Anomaly detection ran across the full 100k and caught every one of the 1,870 negative values and 1,805 high-value outliers. Zero misses.
They were also the only team to submit against the edge case file – 100k rows of intentionally broken data. Impossible dates. Null cities. Literal not_a_date strings. Values of 5,000+ kg/hr. The tool handled it gracefully, preserving the unparseable rather than silently corrupting it.
What we learned
The hand-coder had a head start that had nothing to do with typing speed. Before writing a line, they’d already thought the problem through. The task was well-defined in their head. The AI teams were still negotiating the problem space while Team Craft was already building. Preparation beat parallelism.
The Mythical Man Month applies to AI agents, too. Both AI teams had 3+ people, including non-developers. Brooks’s insight was that adding people to a late project makes it later – coordination overhead compounds. It turns out that’s true whether you’re coordinating humans or humans trying to steer multiple AI agents toward a coherent output. Vibe coding especially would have been stronger as one developer and one AI – not a committee with a chatbot. The hybrid team felt this less, but it was still there.
The silent failure is the scariest failure. Team Vibe’s 100k bug didn’t throw an error. It just stopped. When LLM-generated code fails quietly at scale, you need a robustness layer – a self-healing loop, a validation pass, something – or you won’t find out until the regulator does. The “AI wrote it and the tests passed” feeling is not the same as “it works.”
We’re running it again
In the fall, we’ll run the same three approaches again – but with teams of one. We will have multiple vibe coding and hybrid teams made up of a single developer, pitted against one hand-coder. We will remove the coordination tax entirely and see what happens.
Our hunch: vibe coding closes the gap significantly when it’s one brain, one AI, and no overhead. The 4-hour format punished coordination, and AI-heavy workflows needed the most of it. Give a single developer a good model and a clear problem, and the picture might look very different.
The goal was never to prove that one approach wins. It was to understand the real tradeoffs of shipping AI-assisted code in a domain where silent failures matter. We came away with better questions than we started with – which, honestly, is what a good hackathon is supposed to do.
More in the fall. Different teams, same ugly data.


