Shipping Large Refactors with AI — What It Really Takes

Length:

9 min

Published:

April 25, 2026

The industry spent a decade telling engineers to stop doing big refactors. Small PRs, continuous delivery, trunk-based development. Then AI-assisted coding landed, and the math changed. In April 2026, a DX Heroes engineer merged 82 commits in a single branch against a production application after about six hours of focused work. Without AI, the same person estimates it would have taken "easily a month." The refactor covered a contracts migration to Zod as the single source of truth, a canvas refactor sweep, new email templates, and a Stripe lifecycle split. When it merged, 4,216 web tests, 2,119 API tests, and 128 new contract tests all passed.

The question enterprise clients keep asking us is not "can AI write code?" but "can we let AI drive a structural refactor through our production system?" The answer isn't binary. It can be learned. The discipline around the AI matters more than the model inside it.

Why large refactors are back on the table

Three things are happening at once. Tech debt from 2020–2024 has accumulated to the point where incremental cleanup doesn't move the needle. Legacy libraries that teams lived with for years are hitting end-of-life, forcing migrations. And AI now changes the cost of exploration. The expensive part of a big refactor used to be discovering what breaks where. An agent with access to the codebase, the test suite, and a clear milestone plan can do that discovery faster than a human pair can.

What hasn't changed is the risk model. A structural refactor still touches code paths you're not actively thinking about. The only difference is that an AI can touch more of them in a shorter time. That cuts both ways.

Case 1: 82 commits, one pull request

The Plantory.ai refactor was, from the outside, 82 commits on a single branch merged into main. From the inside, it was a sequence of milestones. Claude Code committed each one as soon as its test suite turned green.

The broader product context is in the Plantory.ai case study, and the architecture behind the AI-native build is covered in the Plantory playbook.

"Realistically it took about six hours of clean time, and without AI it would have taken me significantly longer — I'd say easily a month. Why did it split into eighty-two commits? Because AI, specifically Claude Code, committed each single milestone according to the refactor breakdown. Each commit ran my test suite — unit, integration, and end-to-end — to ensure nothing broke. I even wrote tests beforehand, so that after the refactor we'd know everything works the same as it should."

— Prokop Simek, Co-founder at DX Heroes

The shape matters. Eighty-two commits are not eighty-two decisions. They're one decision, the contracts migration and the canvas sweep, expressed as eighty-two safe checkpoints. Each checkpoint is recoverable. Each one has green tests behind it. The branch was merged directly from local because, by the time it was done, there was no uncertainty left to resolve in CI.

This is the opposite of the "AI wrote 1,000 lines in one PR, please review" failure mode. The agent was structured as a long-running, disciplined executor of a plan, not as an autonomous author.

Case 2: AI-assisted accuracy on a classification system

Our team at Třinecké Železárny is building a technical classification system for industrial inquiries. The system reads images and PDFs and outputs forty parameter values per inquiry. End-to-end accuracy is sitting around 99%. That number did not arrive by accident.

"When AI suggests a change to the scoring logic, we do a human review. We look for signs of overfitting to the specific error we're trying to fix. We often read the entire modified prompt and look for contradictions. We also use AI to help us review that. And end-to-end tests catch regressions — they'll flag if we fixed one class of input and broke another."

— Jakub Vacek, Developer at DX Heroes

There's a pattern behind his second point that clients ask about all the time: how much of this work is AI and how much is human?

"At the start of the project the split was roughly 50/50. The closer we get to production, the more the human share grows. It also varies by part of the codebase — AI doesn't handle complex business logic well, but UI changes are fine."

— Jakub Vacek, Developer at DX Heroes

That split, heavy AI contribution early and heavy human contribution as production approaches, mirrors what we see on every serious engagement. AI is fastest where the problem is understood and the tests are expressive. It slows down or regresses where the problem is underspecified and the tests don't cover the business constraint you actually care about.

What holds the system together

Across both projects, three things do the load-bearing work: tests, contracts, and a review gate.

Tests. Not "tests exist": expressive tests that fail loudly on the thing you care about. The Plantory branch merged safely because the combined suite, 4,216 web tests, 2,119 API tests, and 128 brand-new contract tests, actually exercised the refactored surfaces. TRZ holds 99% accuracy because E2E tests catch regressions across the parameter space, not just on the happy path.

Contracts. The Plantory refactor centered on making Zod schemas the single source of truth for API contracts. Class-validator was banned. A CI gate enforces the single-source rule. That matters more than the line count: after the refactor, the contract is executable, not documentary. If the runtime drifts, CI fails. If a change proposal drifts from the contract, the PR fails.

Review gate. Nothing reaches main without a human signing off on the structural intent. An AI can write a forty-parameter classifier. It cannot decide whether adding a forty-first parameter is worth the business complexity. That call stays with a human reviewer looking at the diff, the tests, and the prompt.

If any one of these three is weak, the refactor should not leave your laptop.

Where AI genuinely falls short: a learning moment

One part of the Plantory refactor is worth singling out because it maps directly to a failure mode enterprise clients ask about. A canvas sub-feature (A2.3) passed its tests after refactoring, then revealed a bug in production: zones were being placed hundreds of meters off-canvas.

"With canvas A2.3 the AI refactor passed tests but broke the contract. What happened there is that I didn't have sufficient test coverage. I let the AI — specifically Claude Opus 4.7 — write the tests first and then refactor. That part worked. What was interesting is that Opus 4.6 had told me previously that coverage was sufficient, but it was actually only 40%. It apparently skipped running some commands. Opus 4.7 caught that. So the failure wasn't 'AI refactored and broke something.' It was 'an earlier model told me I was safe when I wasn't.'"

— Prokop Simek, Co-founder at DX Heroes

The lesson is not that one model is better than another. The lesson is that AI-reported coverage is not coverage. If you're going to let AI drive a structural refactor, you need an independent signal: a command that actually runs, a report that you read with your own eyes, and a CI gate that enforces the number. An agent that confidently says "coverage is adequate" is not the same as coverage being adequate. Make the tool prove it.

A five-step playbook for a safe big-bang refactor with AI

Based on what we're seeing work and what we're seeing fail, here's the sequence we recommend to clients considering an AI-led structural refactor:

Write the tests first, against the contract, not the code. Before touching the implementation, make sure your suite fails for any behavior change that matters to the business. If you can't name the tests that guard the refactor, you're not ready to start.
Define milestones, not tasks. Give the agent a sequence of green-tests checkpoints, each one recoverable. The agent commits per milestone. You review per milestone. Eighty-two safe commits beats one big unreviewable one.
Make contracts executable. Runtime validation, schema-as-source-of-truth, CI gates that fail on drift. If your contracts are documentary rather than executable, a refactor is a bet, not a plan.
Independent coverage verification. Do not accept an AI's claim about test coverage. Run the coverage command. Read the number. Gate CI on it. Assume any self-reporting by the model is unreliable on this one specific question.
Review the diff and the prompt. For changes to business logic, don't just review the generated code. Read the prompt that produced it. Overfitting to a specific failure is often visible in the prompt before it shows up in the code.

Compounding interest, not a revolution

The headline from the Plantory branch isn't "AI did eighty-two commits in six hours." It's that the team had a test suite expressive enough, contracts strict enough, and review discipline sharp enough that eighty-two commits could land safely at all. The refactor is a symptom of good infrastructure. Without that infrastructure, the same eighty-two commits would be a production incident in waiting.

Clients ask us how to get there. Our honest answer is that the payoff on tests, contracts, and review gates compounds: slowly at first, then suddenly. The teams shipping six-hour refactors today are teams that paid the discipline tax two years ago.

If your team is thinking about a structural refactor and you want an honest read on whether the guardrails are in place before you give an agent the keys, get in touch. We help engineering teams go from "we're trying AI" to "we can ship a big refactor before lunch" without skipping the parts that make it safe.

Back to insights

Want to stay one step ahead?

Don't miss our best insights. No spam, just practical analyses, invitations to exclusive events, and podcast summaries delivered straight to your inbox.