We locked three teams in a room for four hours (they were free to step out for food or the washroom, of course) with the same brief. Build a tool that takes a messy CSV and a plain-English instruction, uses an LLM to write the transformation code, processes up to 100,000 rows, and returns clean data plus a summary of what changed.
We weren’t trying to ship a product. We wanted to see what working with AI feels like when the data is ugly and the clock is running.
So we ran it as a bake-off.
Three teams, three philosophies
Team Vibe went all-in on vibe coding. Humans direct, AI writes everything. Zero lines of human-written code.
Team Prius took the hybrid road. Humans driving, AI riding shotgun. Pair programming, with half the pair made of silicon.
Team Craft wrote every line by hand. No Copilot. No Claude. Just a developer, a terminal, and the problem.
Each AI team also included two non-developer analysts. The intention was good. We wanted them to see firsthand how developers work alongside AI. What we underestimated was the cost. Two people who couldn’t contribute code became two more people that the developers had to explain context to and route decisions through. More on that below.
One constraint mattered. The AI teams used their tools out of the box. No custom prompts, no domain-specific configuration, no guardrails tuned for our workflows. We did this on purpose, because we wanted a clean baseline for how AI-assisted development performs without any tailoring.
Tools were scored on accuracy (40), robustness (30), scalability (20), and how clearly the tool explained itself (10).
What the outputs showed
We didn’t grade on vibes or code quality. We compared cleaned CSVs row-by-row against the originals at 1k, 10k, and 100k rows.
Team Vibe came out swinging. Dates fully normalized to ISO. A precise 0.453592 kg/lb conversion factor. City names 100% corrected, including typos, casing, and stray whitespace. They added isAnomalous and tracking columns, which was a thoughtful touch for auditability.
Then the 100k run landed. Anomaly detection silently stopped working after row 51,858. No crash. No error. Silence for the remaining 48,000 rows. In a RegTech context, that’s the kind of bug you don’t find until the auditor finds it for you.
Team Prius normalized dates cleanly, but their unit conversion used a rounded 0.454 factor instead of the precise one. Small, systematic, and the kind of error that compounds. City normalization fixed only 30 to 43% of dirty values. “TORONTO,” “DENVER,” and “Austiin” sailed through untouched. No anomaly detection. No tracking columns.
Team Craft, the solo developer with no AI, won.
Dates, units, cities: 100% clean at every scale. Precise conversion factor. Anomaly detection ran across the full 100k and caught every one of the 1,870 negative values and 1,805 high-value outliers. Zero misses.
They were also the only team to submit against the edge case file. 100k rows of intentionally broken data. Impossible dates. Null cities. Literal “not_a_date” strings. Values over 5,000 kg/hr. The tool handled it gracefully and preserved the unparseable rows rather than silently corrupting them.
What we learned
Team Craft’s head start had nothing to do with typing speed. Before writing a line, the developer had already thought the problem through. The task was well-defined in their head. The AI teams were still negotiating the problem space when Team Craft started building.
Fred Brooks’s Mythical Man Month rule applies to AI agents too, maybe even more so. Adding people to a late project makes it later, because every new person adds communication overhead that grows faster than their contribution. Both AI teams had three or more people, including the two non-developer analysts. The analysts were there to observe and learn, but they added decision-making latency. Developers had to explain what the AI was doing, why, and what to try next, when they could have just been doing it. This doesn’t reflect how our developers work with AI in their daily routine.
What became obvious was that coordinating multiple AI agents inside a team has the same problem. When two or three developers steer different AI tools in parallel, each making different choices about prompts and approaches, the team fragments. Time that should go into building gets spent reconciling what each AI produced. The overhead is no longer human-to-human. It’s human-to-AI-to-human.
A single developer working with a single AI doesn’t have this problem. They move at the speed of the model. This is, conveniently, exactly how our developers already work day-to-day. The hackathon format penalized the AI teams for doing something we’d never do in practice.
Silent failures are the scariest. Team Vibe’s 100k bug didn’t throw an error. It just stopped. When LLM-generated code fails quietly at scale, you need a robustness layer of some kind: a self-healing loop, a validation pass, something. Otherwise, you’ll find out from the regulator. “The tests passed” isn’t the same thing as “it works.”
Off-the-shelf AI is just the starting point. The AI teams used general-purpose tools with no workflow customization. That’s not how AI-assisted development should work in production. That’s how it works on day one.
Where we go from here: Highwood’s bet on AI engineering
Highwood is investing in becoming the best AI-assisted software organization in MRV. AI isn’t fashionable for us. It’s a serious capability we want to operate at expert level, and that takes deliberate work and experiments like this help us tailor how we use AI in our daily development processes.
Right now, we’re building a custom AI development workflow specific to Highwood. That means tooling and context that knows our codebase, our standards, and how our team reviews work. The aim is for our AI to operate like a senior Highwood developer who already knows the stack and the domain, instead of a brilliant generalist showing up on day one. Speed isn’t the only point. We need code we can trust to ship in a regulated industry where silent failures cost our customers real money.
Separately, we’re running the bake-off again in the fall. Teams of one this time. One vibe coder. One hybrid coder. One hand-coder. With our AI workflow in their hands, we expect the AI-assisted developers to beat the hand-coder outright.
This bake-off was our first formal look. It won’t be the last. We’ll keep tracking what works and what doesn’t, and we’ll keep sharing it.
— The Highwood engineering team



