Selenium → Playwright Migration for an Enterprise QE Programme

Engagement context

I was engaged as Test Architect on a multi-client delivery programme at a large Australian systems integrator — one of the concrete programmes delivered within the broader 6-year QA leadership tenure that covered the organisation's cross-company test function. The QE function inherited a five-year-old Selenium-Java test suite running across multiple product lines — TestNG and Maven, a self-hosted Selenium Grid on a rack of VMs, and wall-clock regression times that made daily gating impractical. The suite carried all the marks of accumulated technical debt: XPath chains tied to DOM depth rather than semantics, explicit-wait blocks scattered across every page object, a parallel-run setup that produced concurrent-login collisions, and a flake rate in the ten-to-twenty percent band that had eroded developer trust to the point where CI failures were routinely ignored. The grid carried a real infrastructure cost that was visible on the project ledger. The business ask was faster, cheaper, more reliable — without losing release confidence during the transition.

Framework redesign over literal port — with explicit exit criteria at every gate. No big-bang cutover; release confidence was never staked on Playwright until it had earned it.

The win

"I redesigned and delivered the migration from Selenium-Java to Playwright-TypeScript as a full framework redesign — not a line-for-line port. That meant establishing a four-tier locator strategy, a storageState auth model that eliminated per-test login overhead and concurrent-login flake, an in-process worker parallelism model that replaced the grid entirely, and a phased cutover plan with explicit pass-rate exit criteria at every gate. The result was a materially faster, substantially less flaky suite that the team trusted enough to run multiple times daily rather than once overnight. The grid was decommissioned. The infra cost came off the books. The team shipped the Playwright framework as a pattern they owned, not a black box I handed them."

Phase 0 — Assessment and cull (Weeks 1–3)

The first three weeks were not about writing Playwright code. They were about understanding what existed and deciding what deserved to be migrated.

The 4-R classification immediately revealed that the majority of flake was in the timing category — explicit-wait chains that masked the actual synchronisation problem rather than solving it. That diagnosis shaped the framework decisions.

I ran the suite five times in succession and recorded per-test pass/fail patterns, producing three cohorts: always-pass, always-fail, and flaky-intermittent. Within the flaky cohort, I classified by root cause using the 4-R taxonomy — race and timing issues, resource and state contamination, downstream reliability, and UI rendering sync gaps. That classification immediately revealed that the majority of flake was in the timing category: explicit-wait chains written defensively that masked the actual synchronisation problem rather than solving it.

The inventory phase also surfaced a meaningful proportion of tests that were testing backend validation logic through the browser — wrong-layer tests that belonged at the API or unit level, not in a Selenium suite. I classified these for pull-down rather than migration, trimming the migration scope materially before work began.

The Phase 0 output was a framework decision brief (ADR) covering four architectural choices: locator strategy (role-first, then text, then test-id, then CSS, with XPath banned except for edge cases), auth strategy (storageState fixtures per role), parallelism model (workers-per-runner with per-test browser contexts), and CI shape (containerised runners, sharded with --shard=i/N). Decisions in writing, before the first Playwright line, meant the migration cohorts later had a clear standard to review against.

Phase 0 — Test inventory classification

Cohort A: Always-pass

Retained for migration as genuine user-journey E2E

Cohort B: Flaky-intermittent (4-R classified)

Race/timing — dominant category; root cause: explicit-wait sprawl
Resource/state — concurrent login contention; addressed by storageState model
Downstream reliability — third-party dependencies; addressed by controlled test data
Rendering sync — addressed by Playwright auto-wait

Cohort C: Always-fail / dead

Archived before migration starts

Cohort D: Wrong-layer (backend logic through browser)

Pulled down to API / unit test layer; not migrated

Phase 1 — Pilot and pattern lock (Weeks 3–7)

The pilot scope was thirty to fifty tests covering one critical user journey end-to-end. The constraint was deliberate: small enough to move fast, representative enough to validate the framework choices from the ADR.

Decisions in writing before the first Playwright line. Each rejected option was faster to build and worse to live with — the ADR is where the migration budget is actually spent.

The greenfield Playwright framework was TypeScript from the start — no Java binding, no compromises. Page objects used lazy locator getters rather than eagerly-resolved element fields, which meant locators resolved at action time and avoided the stale-element class of failure that Selenium POMs routinely produced. Auth fixtures used storageState — a global-setup script authenticated once per role, serialised the browser storage state to playwright/.auth/, and tests consumed it via test.use({ storageState }). Per-test login cost and concurrent-login contention were structurally eliminated.

CI integration ran containerised Playwright workers — no Hub, no Grid, no per-session handshake overhead. Sharding distributed the pilot across runners with --shard=i/N. HTML report artefacts per shard gave reviewers a test-by-test breakdown without digging into CI logs.

The pilot exit criteria were explicit: flake rate below two percent and per-test wall-clock within a defined ceiling at the ninety-fifth percentile. Both were met. Pattern lock then followed — page object template, locator template, fixture template, CI workflow template, and a code-review checklist for migration PRs. The checklist was the enforcement mechanism: every migration PR required a sign-off against the pattern before merge. The cheapest moment to reject anti-patterns is the first fifty tests, not the last five hundred.

Phase 2 — Parallel suite and ramp (Weeks 7–18)

The ramp phase ran Selenium and Playwright in parallel. Selenium remained the release gate throughout — no release confidence was staked on Playwright until it had earned it. Playwright ran in shadow mode: informational, not blocking, with results visible to the team but not gating deploys.

The cutover criterion — 98% pass rate for two consecutive weeks, with coverage equivalence on critical journeys — was specified before the ramp started. When it was met, the conversation was short.

Phase 2 — per-cohort migration sequence

Loading diagram…

Each weekly cohort followed a fixed linear sequence. The code-review checklist was the enforcement mechanism — the cheapest moment to reject anti-patterns is the first fifty tests, not the last five hundred.

Migration ran in weekly cohorts of forty to eighty tests, paired by feature area. Each cohort was pair-authored — a QA engineer who owned the Selenium tests sat with a migration lead working through the Playwright rewrite, transferring both the test intent and the domain knowledge. Each cohort's Selenium counterpart was deleted once the Playwright equivalent ran green — atomic replacement, not accumulation. Per-cohort wall-clock and flake metrics were tracked against the Phase 0 Selenium baseline so the improvement signal was visible at every step.

The cutover criterion was specified in advance: Playwright shadow-mode pass rate at or above ninety-eight percent for two consecutive weeks, with coverage equivalence verified on critical user journeys. Writing the criterion before the ramp started meant the cutover decision was data-driven, not political. When the criterion was met, the conversation was short.

Team capability uplift ran in parallel with the migration cohorts. A two-day TypeScript primer covered the subset Playwright actually demands — async/await, arrow functions, destructuring, type narrowing — not the full language. Paired programming in weeks one through four spread framework knowledge across the team rather than concentrating it. The single biggest unlock was the trace-viewer demo: once engineers saw failure triage drop from twenty minutes of re-run-and-debugger-attach to two minutes of opening a trace and reading DOM snapshots, network calls, and console output in one view, the language transition stopped being the resistance point. The trace-viewer sold the migration more effectively than any slide.

Phase 3 — Cutover and decommission (Weeks 18–26)

Gate swap was clean. Playwright became the release gate on the week the cutover criterion was met. Selenium ran in shadow for two to three additional weeks — regression-on-regression, confirming that no genuine issues were escaping the new suite that the old one would have caught. Two consecutive production releases gated by Playwright with no escape later, Selenium was decommissioned.

Infrastructure decommission followed: the Grid VMs were wound down, the CI pipeline was cleaned of Selenium stages, and the Selenium repository was tagged and archived rather than deleted — institutional reference, not active maintenance overhead. The grid cost came off the project ledger.

ROI was re-measured against the Phase 0 baseline across four dimensions: wall-clock per run, flake rate, CI infrastructure cost per run, and runs-per-day actually executed. All four moved in the right direction, materially. The runs-per-day metric was the one that compounded: a shorter, more trusted suite ran more frequently, which tightened the feedback loop, which caught issues earlier, which reduced the cost of fixing them. That is where the migration's long-term value accumulated — not in the one-time grid decommission.

What I'd do differently

Cull earlier and harder. I pulled down wrong-layer tests in Phase 0 but was conservative under stakeholder pressure to preserve test count. With clearer conviction I would have been more aggressive: a smaller suite of correctly-layered, well-structured tests is a better outcome than a larger migrated suite that still carries the wrong-layer bloat in a new language. The pull-down decision is the right conversation to have before migration, not after.

Lock the ADR before the stakeholder kickoff, not after. On this engagement the framework decision brief was drafted during Phase 0 but only formally approved partway into Phase 1. That gap created a brief window where pattern choices were re-litigated in migration PRs. Writing the ADR, socialising it, and closing dissent before the pilot starts saves the team a cycle of rework on the first cohort.

Run the trace-viewer demo in week one, not week four. The TypeScript resistance evaporated the moment engineers saw the trace-viewer. Showing it earlier would have reduced the ambient friction in the ramp period — and produced better-quality migration PRs sooner, because engineers who believed in the tooling wrote better tests.

Architectural patterns I now apply

Across this and similar migration engagements, five patterns have crystallised into defaults I reach for on every new programme:

These are not style preferences — each one is a structural decision that prevents a class of maintenance cost. The locator priority order, the storageState model, and the shadow-mode cutover criterion all exist because the alternative produces measurable debt.

Framework redesign over literal port. A line-for-line Selenium-to-Playwright port preserves the Selenium anti-patterns in a new runtime. The migration is the cheapest moment to redesign locator strategy, auth model, and parallelism — deferring that is deferred debt, not deferred work. See the Selenium → Playwright migration pattern for the reference architecture.
storageState auth fixtures as the default auth model. Per-test UI login is slow and generates concurrent-login flake in parallel suites. A global-setup script that authenticates once per role and writes storage state to playwright/.auth/ eliminates both problems structurally. The pattern scales across roles and environments without modification.
Four-tier locator priority enforced at review time. Role-based locators first (getByRole), then visible text (getByText), then test-id attributes (getByTestId), then CSS — with XPath banned by default. The priority is not a style preference; it is a flake-prevention decision. Locators that survive DOM structure changes are locators that do not generate maintenance tickets.
Parallel-suite shadow mode with explicit cutover criteria. No big-bang cutover. Run the new suite in shadow mode — informational, not blocking — until it earns gate status against pre-defined pass-rate and coverage criteria. The criteria are written before the ramp starts, so the cutover is a data event, not a negotiation.
Cull before you migrate. The migration window is the cheapest moment to delete dead tests, pull wrong-layer tests down to the API or unit tier, and right-size the suite before it moves. A migrated suite that inherits Selenium's bloat is a migration that missed its moment.

The reference architecture distilled from this and similar engagements — framework structure, CI sharding model, storageState auth recipe, and locator strategy — is codified in the Selenium → Playwright migration pattern.