AI-Augmented Playwright Test Pipeline for a Large Australian Energy Company

Engagement context

I was engaged as Senior Test Automation Lead at a large Australian energy retailer with national customer-facing digital surfaces and back-office systems. The business ran customer portals, loyalty and payment flows, and financial settlement systems in parallel — all under an accelerating release cadence driven by a concurrent cloud modernisation programme. The QA function had grown organically and carried a high proportion of brittle tests that eroded CI gate trust. Test authoring throughput could not keep pace with feature delivery, and engineering time was disproportionately absorbed by manual failure triage. I was brought in to close that gap: more coverage, faster authoring, richer failure context, and a performance signal baked into the same pipeline.

Each stage is independently observable, individually replaceable, and audit-traceable — not a black box. The architectural value is composability: any stage can be upgraded or replaced without rewriting the pipeline.

The win

"I designed and shipped the end-to-end AI-augmented QE pipeline — 6 production-deployed AI-assisted stages working together: user story → AC clarification (LLM) → test generation (LLM-assisted Playwright) → execution with Dynatrace observability → App Insights log validation (LLM-on-logs) → LLM-driven failure triage with Tier 4 self-healing locator re-binding → K6-based performance regression detection. That's the full surface — not just 'we used AI to write some tests'. The architectural value is that each stage is independently observable, individually replaceable, and audit-traceable — not a black box."

What was built (6-stage pipeline)

1. AI-assisted test generation

The team's recurring bottleneck was the translation gap between user story and executable test: happy-path coverage was reasonable but edge cases were inconsistent, and test authoring time was a hard ceiling on velocity. I built a multi-stage generation pipeline where an LLM first expands acceptance criteria into a structured scenario list — happy path, negative paths, and edge cases the human author would likely have missed — then generates TypeScript Playwright code matched to the framework's conventions. Every prompt was grounded in the team's standards documentation (locator strategy, page object patterns, fixture imports) so generated code matched house style without a rewrite pass. Authors shifted from typing tests to reviewing and curating AI-generated tests, a higher-leverage activity that materially accelerated test authoring throughput and expanded edge case coverage across critical flows.

2. End-to-end pipeline with LLM-driven AC clarification

Even with generation in place, the upstream bottleneck remained: ambiguous acceptance criteria. "System should validate payment correctly" cannot be tested without interpretation, and manual clarification cycles with BAs and PMs were slow. I added a discrete AC clarification stage at the front of the pipeline — the LLM analyses each AC, flags ambiguity, proposes clarifications, and surfaces them to the BA for confirmation before generation runs. Every clarification decision is logged and tied to the test cases generated from it, providing a full audit trail: when a test fails, you can trace back to which AC interpretation it was enforcing.

The audit trail is the key architectural property: when a test fails in production, you can trace back to which AC interpretation it was enforcing — and whether that interpretation was reviewed and confirmed by the BA.

User Story → AC-clarified, AI-augmented test pipeline

Stage 1: AC Analysis

LLM analyses each AC → flags ambiguity + proposes clarifications
Clarifications surfaced to BA for human review

Stage 2: AC Expansion

LLM expands clarified ACs into scenarios
Happy path + edge cases + negative paths

Stage 3: Test Generation

Playwright TypeScript code per scenario
Matches framework conventions (locator strategy, page objects, fixtures)

Stage 4: Execution + Reporting

Pipeline-integrated runs with Dynatrace observability
App Insights log validation (LLM-on-logs)
LLM-driven failure triage with Tier 4 self-healing locator re-binding

This end-to-end coupling also made AC ambiguity measurable for the first time — even when the pipeline does not auto-resolve, surfacing the problem upstream improved PM discipline on story quality over time.

3. Dynatrace observability integration

Functional pass/fail told the team whether the test worked; it did not tell them whether the system worked under realistic conditions. I integrated Dynatrace directly into the test lifecycle: the framework emits Dynatrace-compatible correlation tags on every HTTP request — test-run ID, build ID, suite, and test ID — so every trace in Dynatrace is filterable by test. On failure, the unified dashboard surfaces the test output, the Dynatrace distributed trace for the failing user flow, and the application logs around the failure window together. Engineers stopped context-switching between tools to triage. I also applied tiered telemetry depth: full distributed traces on test failure; production-equivalent sampling on pass to avoid paying for spans that would never be read. The same Playwright scripts were later redeployed as Dynatrace synthetic probes against staging and production, giving the team a continuous quality signal outside CI runs.

The tiered telemetry decision is the cost-discipline insight: paying for full distributed traces on every passing test run is wasteful. Full traces only on failure; production-equivalent sampling on pass. Same signal where it matters, negligible cost where it does not.

4. LLM-driven log validation

Application Insights emitted high log volume — traditional alerting covered known-bad patterns via regex and error rate thresholds, but missed novel anomalies where a flow completed with a success status but the wrong outcome. I applied an LLM-as-Judge pattern to log validation: log streams are sampled and batched, and the LLM grades each batch against an anomaly rubric covering novel patterns, sequences suggesting erroneous flow, and unexpected sensitive data in logs. Output is structured JSON with an anomaly score, evidence, and a suggested action. Cost discipline was central: a faster and cheaper model handles the first-pass scan; escalation to a more capable model is triggered only when the anomaly score exceeds a threshold. Flagged anomalies feed a feedback loop — false positives refine the rubric, true positives seed new test cases. This reduced log review effort from manual triage to LLM-driven triage, with SRE review concentrated on the small set of escalated anomalies.

The cost discipline is the architectural insight: SRE review is not eliminated, it is concentrated on the small set of anomalies that survived the cheaper model's first pass. The feedback loop means false positives sharpen the rubric over time rather than accumulating as noise.

5. LLM-driven failure triage + Tier 4 self-healing

Test failures name the symptom — "expected X, got Y" — not the cause. Correlating a failure with code changes, environment state, APM traces, and historical similar failures was a slow, experience-dependent manual process. I built a four-stage LLM-driven triage pipeline: the LLM first classifies the failure category (locator drift, data issue, environment instability, real bug, or known flake pattern), then retrieves the relevant context for that category (DOM snapshot for locator drift; git diff for real bugs; Dynatrace traces for environment issues), then generates a root-cause hypothesis with a proposed next action. For locator-drift failures with high-confidence classification, the pipeline goes one step further: Tier 4 self-healing, where the LLM inspects the current DOM against the test's intent description and proposes a new locator that is auto-applied on the next run. The tiered locator recovery model, confidence gating, and audit trail design used here are documented as a reusable reference in the Agentic Self-Healing Test Framework pattern — the full architecture including Promptfoo eval harness and promotion-to-source pipeline. Every self-heal is logged to the quality dashboard, and tests that self-heal repeatedly are flagged for human review — the guard against self-healing masking a genuine regression. The architectural patterns behind this triage pipeline — particularly LangGraph orchestration with deterministic fallback, and confidence-scored decisions with explainability — were originally proven in the AgentQE Continuum R&D project before being applied here in a production enterprise context. This brought locator-drift maintenance to near-zero for stable test intent, and redirected engineer attention from flake noise to real bugs.

The guard against regression masking is structural: tests that self-heal repeatedly are flagged for human review rather than silently absorbing a genuine behaviour change. Self-healing is a residual handler, not a default mode.

6. K6 performance baselines from APM data

Performance regression detection was end-of-cycle — JMeter runs against staging caught issues late, after they had propagated through multiple merges. I replaced the primary performance harness with K6, which integrated cleanly with the TypeScript-native framework stack, and derived workload models directly from Dynatrace APM data rather than synthetic assumptions. Production traffic distributions — user distributions, time-of-day patterns, and channel mix — were extracted from the observability layer and used to build realistic K6 scenarios. CI integration followed a tiered execution model: a lightweight performance smoke on flagged PRs, a full baseline comparison post-merge to main, and a nightly full-suite run against staging — the same PR / post-merge / nightly cadence documented as a reference architecture in the K6 Performance Testing Pattern, which covers the tiered execution model, APM-derived scenario design, and baseline comparison approach in full. K6 results are compared against a statistically-derived baseline; regressions surface in PR review rather than in a post-release incident. For protocol coverage where K6 does not reach — JDBC, certain messaging protocols — JMeter was retained deliberately rather than deprecated prematurely.

Deriving workload models from Dynatrace APM data rather than synthetic assumptions means K6 scenarios reflect actual production traffic patterns — not best-guess load profiles. Regressions surface in PR review, not in a post-release incident.

Engagement summary

Field	Detail
Duration	Sep 2025 – current
Role	Senior Test Automation Lead
Reporting line	QA Lead / Engineering Manager
Team	Cross-functional QA pod (onshore + offshore)

Reference

Reference from QA Lead / Engineering Manager available on request at screen stage.