AI-Assisted Legacy Modernization Works – When the Workflow Does
How engineering teams turn AI from a coding shortcut into a structured delivery system –
and what phased, human-first modernization looks like in practice.

AI-assisted legacy modernization is genuinely effective – when the workflow around it is deliberately designed. That statement deserves to be said plainly, because the engineering conversation has become oddly polarized: between vendors promising complete automation and sceptics cataloguing every failure mode. The practical reality sits in neither camp. Structured AI refactoring workflow produces measurable productivity improvements of 30–60% that make the difference between a modernization program that delivers in months and one that drifts for years. The gap is not in the AI. It is in whether the workflow has been built.
This article explains what that workflow design looks like – specifically, the phased, human-first methodology that consistently unlocks AI's real contribution to legacy modernization, and why the sequence and governance of that process matter more than any model selection decision.
1 · AI Delivers – on the Right Refactoring Tasks
AI coding tools succeed at atomic refactoring tasks with over 80% accuracy – and fail at compound tasks roughly six times out of ten. A February 2026 study from Concordia University tested large language models – including GPT-5.1-Codex with full repository access – against a structured benchmark. On atomic, single operation refactoring’s: 82.6% success. On compound, multi-step transformations requiring coordinated changes across files, call sites, and module boundaries: 39.4%. CodeScene's independent replication reached 37%.
This is not a reason to distrust AI tools. It is a reason to understand their scope. Compound refactoring is not a single-model problem – it is a systems problem that requires human architectural judgment to sequence, scope, and validate. When that judgment is present, when the workflow defines which tasks AI performs and under what governance, the 80%+ success rate on bounded tasks compounds into substantial delivery acceleration across a full modernization program.

Teams that adopted AI without a governing workflow saw refactoring activity collapse – not improve. GitClear's longitudinal study of 211 million changed lines of code found that refactoring as a share of code changes dropped from 25% in 2021 to under 10% by 2024 in those teams, while code duplication rose from 8.3% to 12.3%. The constructive implication: teams with structured workflows maintained refactoring discipline and used AI to accelerate it. The tool is the same. The workflow makes the difference.
2 · Why AI Refactoring Underdelivers – The Workflow Gap
Most AI-assisted refactoring underperformance is caused by five missing workflow elements, not by model limitations. Engineering teams that adopt AI tools and find the results underwhelming describe the same experience: AI helps with writing code, but not with the actual refactoring. This is a precise diagnosis. AI coding tools were designed for greenfield, single-developer, file-scoped work. Legacy modernization is the opposite: inherited, collaborative, architecturally complex, and coupled in undocumented ways.
Five structural absences explain most of the underperformance. The first is the absence of a shared context – when every engineer prompts AI independently, without a common architecture document, output varies dramatically and conflicts with existing codebase patterns. The second is the absence of a phased sequence – AI is used wherever convenient rather than at defined points in a structured process, so the highest-risk modules receive the least structured attention. The third is the absence of review gates – without explicit validation criteria for AI-generated refactoring output, errors accumulate silently. The fourth is the absence of a behavioral baseline – without golden master tests locking current system behavior before structural changes begin, there is no way to verify that refactoring preserved what it was supposed to preserve. The fifth is the absence of system-level measurement – individual developer velocity is tracked while change failure rate and deployment frequency stay flat or worsen.
None of these absences are about the AI model. All of them are about engineering process. And all of them are directly addressable through phased workflow design.
"AI is a force multiplier for well-structured teams – and a productivity trap for undisciplined ones."
GitLab Global DevSecOps Report 2025/26
3 · Phased, Human-First AI Modernization – What It Looks Like in Practice
A phased, human-first AI modernization workflow defines where AI operates, on what tasks, in what sequence, and under what human governance – producing 30–60% productivity improvements that ad-hoc AI adoption does not. The organizing principle is straightforward: humans direct the work; AI accelerates execution. In practice, this requires a deliberate five-phase structure where AI's role in each phase is explicitly defined, not left to individual discretion.

Phase 1 · Scope and risk mapping – Human-led.
The "do-not-break" list – public interfaces, regulatory constraints, performance SLAs, and business-critical flows – must be defined before any AI tool is opened. This institutional knowledge exists nowhere in the codebase. Legacy systems encode decades of decisions that AI cannot reconstruct from static analysis. Skipping this phase is the single most common cause of production incidents in AI-assisted modernization programs.
Phase 2 · Dependency mapping – AI-accelerated.
This is where AI delivers its clearest early win in legacy modernization. Profiling repositories, generating dependency graphs, summarizing module responsibilities – work that previously took weeks compresses to days. Critical qualification: AI-generated dependency maps must be validated against production runtime behavior before use. Static analysis and runtime behavior diverge in most legacy systems. The AI produces the draft; engineers validate it.
Phase 3 · Behavioral baselining – Human-reviewed, AI-supported. Before any structural change is committed, current system behavior must be locked with golden master tests – characterization tests that capture what the system does, including unexpected behaviors. These provide the regression safety net that makes AI-assisted refactoring verifiable. AI can propose test fixtures and edge cases; human review ensures tests assert the right invariants, not just that code runs.
Phase 4 · Incremental refactoring – AI-directed, human-governed.
This phase delivers the highest AI productivity contribution in the workflow: mechanical extractions, component library conversions, syntax upgrades, test suite generation for isolated modules. Each pull request carries one clear architectural intent. CI regression gates are mandatory before merge. Architectural sequencing – what to refactor next and in what order – is a human decision made in regular architecture sessions. AI executes decisions that humans have already made.
Phase 5 · Controlled release and measurement – Human-governed.
Modernization outcomes are measured at the system level: change failure rate, mean time to recovery, lead time to change. Canary rollouts and documented rollback plans make incremental delivery safe to sustain continuously. If individual velocity rises while system metrics stay flat, the workflow needs adjustment – a human judgment, informed by data.
4 · What Structured AI Workflow Delivers – Three Engagement Examples
Phased, human-first AI workflow achieves approximately 85% project success rates against modernization objectives – compared to roughly 30% for big-bang rewrites and inconsistent results for ad-hoc AI adoption. Time to first measurable delivery value lands in the two-to-four-month range. The 30–60% productivity improvements claimed for AI coding tools are achievable, but specifically through structured workflow implementation – not through tool deployment alone.
Three brief examples from Altimi engagements illustrate what that looks like in practice.
A fintech platform with a nine-year-old payment core. Eight months of AI tool adoption had produced higher individual output velocity and a worsening system picture: no behavioral baseline, billing engine test coverage at 14% behind a headline figure of 76%, and three silent regressions reaching production in a single quarter. The structured intervention established a golden master test suite covering critical payment paths, a runtime-validated dependency map, and a sequenced 90-day plan isolating the billing engine behind a stable API boundary. Change failure rate dropped from 18% to 5% in the following quarter. AI did not change. The workflow did.
An enterprise SaaS with no review gates. PR cycle time climbed from six days to nineteen over eight months of growing AI adoption – because AI-generated output was being merged faster than teams could review it, with no structured criteria for evaluating compound refactoring output. A three-tier review protocol targeting AI's specific failure modes, combined with a single-intent PR policy and mandatory CI regression gates, returned cycle time to seven days within six weeks. Deployment frequency tripled. The same AI tools were in use throughout.
A B2B platform attempting monolith decomposition. Eighteen months of AI usage directly against a tightly coupled monolith produced limited structural improvement. The reason: AI cannot modernize a monolith – it can very effectively modernize a module isolated behind a stable API contract first. Establishing a Strangler Fig boundary before continuing AI-assisted decomposition was the intervention. The first isolated module was delivered in three weeks. Twelve followed over five months.
5 · Human-First Is Not a Constraint – It's the Architecture
"Human-first AI" describes an architectural choice about where human judgment lives in the delivery process – and teams that make that choice deliberately get better outcomes than teams that leave it implicit. The phrase can sound like a hedge, a way of qualifying enthusiasm for tools not quite ready to operate autonomously. In legacy modernization, it is the opposite: the deliberate design decision that makes AI investment productive.
Martin Fowler's articulation of the Strangler Fig pattern identifies four activities that create conditions for any successful AI-assisted modernization: understand the outcomes the work needs to achieve, break the problem into smaller independently deliverable parts, execute those parts with quality gates, and evolve the engineering organization around the new delivery model. None of these are code generation tasks. They require system-level thinking, stakeholder alignment, and architectural decision-making under uncertainty – capabilities that currently live with engineers, not models. AI is most valuable when it operates inside the structure those four activities create. It is least valuable – and most risky – when deployed before that structure exists.
The design skill at the center of this work – recognizing when a structure no longer fits the problem, knowing what better structure looks like, and sequencing the path from current to target state while keeping the system running – has not been replaced by AI. It has become more important. Engineers who build this skill deliberately, and who direct AI tools toward architecturally meaningful outcomes rather than locally coherent fragments, are the ones whose organizations will realize the productivity improvements the tooling offers.
"The workflow is the product. The AI is the tool that makes the workflow faster."
Engineering Lead, enterprise SaaS modernization engagement
The competitive moat in legacy modernization is not the AI model – that is commoditizing rapidly. It is the structured engineering methodology that makes any sufficiently capable model perform reliably, phase by phase, with human judgment at every decision point that matters. That methodology is buildable. It is learnable. And it produces results that ad-hoc AI adoption, however enthusiastic, consistently does not.
---
A Note from the Author
I wrote this article because the conversations I keep having with engineering leaders and technical decision-makers across modernization programs point to the same pattern: the AI tools are rarely the problem. The workflow around them almost always is.
Over the past two years, I have seen teams adopt capable models, invest in tooling, and still end up with rising change failure rates, growing code duplication, and modernization timelines that drift. The gap is not in the technology. It is in the absence of a deliberate delivery structure – one that defines where AI operates, on what tasks, and under what human governance.
That is what this article is about. Not a pitch for any particular tool or platform – but a practical account of what phased, human-first methodology looks like when it is actually built, and what it consistently delivers when it is.
If the workflow gaps in Section 2 sound familiar, or if a modernization program you are running is producing individual velocity without system-level improvement – I am happy to compare notes.
– Miłosz Ciupiał, Head of Delivery @ Altimi
Sources & Further Reading
Why does AI-assisted legacy modernization often fail to deliver?
AI-assisted legacy modernization fails primarily because of missing workflow structure, not model limitations. The five most common gaps are: no shared architectural context injected at AI interactions, no phased delivery sequence, no review gates for AI-generated output, no behavioral baselining before structural changes begin, and no system-level measurement (change failure rate, MTTR) beyond individual developer velocity.
What is a human-first AI refactoring workflow?
A human-first AI refactoring workflow is a phased delivery methodology where engineers direct architectural decisions, sequencing, and risk governance – while AI accelerates bounded execution tasks within each phase. In practice: humans define scope and risk (Phase 1), AI accelerates dependency mapping (Phase 2), humans review behavioral baselines (Phase 3), AI executes incremental refactoring under CI gates (Phase 4), and humans This compares roughly govern release measurement (Phase 5).
What is behavioral baseline in legacy modernization?
Behavioral baseline is the practice of locking current system behavior with golden master (characterization) tests before any structural code changes begin. It captures what the system does – including unexpected or undocumented behaviors – and provides a regression safety net that makes it verifiable whether refactoring preserved the intended behavior. Skipping this step is one of the most common causes of silent regressions in AI-assisted refactoring programs.
How much productivity improvement can structured AI workflow deliver?
Structured AI refactoring workflow consistently produces 30–60% productivity improvements in legacy modernization programs, with time to first measurable delivery value in the two-to-four-month range. This compares to roughly 30% project success rates for big-bang rewrites and inconsistent results for ad-hoc AI adoption without workflow structure. The productivity gain is specific to structured workflow implementation – not to AI tool deployment alone.
What is the Strangler Fig pattern and why does it matter for AI modernization?
The Strangler Fig pattern, articulated by Martin Fowler, is an architectural approach to legacy modernization where new functionality is built around the edges of an existing system, gradually replacing it while keeping the legacy system running. It matters critically for AI-assisted modernization because AI tools cannot effectively refactor a tightly coupled monolith – but they can very effectively modernize modules that have been isolated behind stable API contracts using the Strangler Fig approach. Establishing this boundary before applying AI is a prerequisite, not an optional step.
What metrics should engineer teams track during AI-assisted refactoring?
Engineering teams should track system-level DORA metrics: change failure rate (target: below 5–10%), mean time to recovery (MTTR), lead time to change, and deployment frequency – alongside test coverage on critical paths and PR cycle time. Individual developer velocity is insufficient as a standalone metric; it can rise while system health worsens if AI generates high-volume output without adequate review gates and regression testing.



