AI-Driven Quality Engineering Architect · Available for new engagements · Australia

SNK
SNK Digital
Back to Work
EnergySep 2025 – current

AI-Augmented Playwright Test Pipeline for a Large Australian Energy Company

End-to-end AI-augmented QE pipeline — 6 production-deployed stages, each independently observable, replaceable, and audit-traceable.

PlaywrightTypeScriptAI-driven test genPromptfooLLM-as-JudgeDynatraceK6Azure DevOps

Engagement context

I was engaged as Senior Test Automation Lead at a large Australian energy retailer with national customer-facing digital surfaces and back-office systems. The business ran customer portals, loyalty and payment flows, and financial settlement systems in parallel — all under an accelerating release cadence driven by a concurrent cloud modernisation programme. The QA function had grown organically and carried a high proportion of brittle tests that eroded CI gate trust. Test authoring throughput could not keep pace with feature delivery, and engineering time was disproportionately absorbed by manual failure triage. I was brought in to close that gap: more coverage, faster authoring, richer failure context, and a performance signal baked into the same pipeline.

Six-stage AI-augmented QE pipeline shown as a continuum from user story in to quality signal out. Stage 1 AI test generation: LLM expands acceptance criteria into happy path, negative, and edge case scenarios then generates Playwright TypeScript code. Stage 2 AC clarification: LLM flags ambiguity, proposes clarifications, surfaces to BA before generation runs with full audit trail. Stage 3 Dynatrace observability: correlation tags on every HTTP request, test output and Dynatrace distributed trace and logs unified in one dashboard. Stage 4 log validation LLM-as-Judge: batched log streams graded against an anomaly rubric, fast model first then capable model on threshold breach. Stage 5 failure triage and Tier 4 self-healing: classifies failure category, retrieves context, generates hypothesis, auto-applies high-confidence locator re-bindings. Stage 6 K6 performance baseline: APM-derived workload models with tiered execution on PRs, post-merge, and nightly.

Each stage is independently observable, individually replaceable, and audit-traceable — not a black box. The architectural value is composability: any stage can be upgraded or replaced without rewriting the pipeline.

The win

"I designed and shipped the end-to-end AI-augmented QE pipeline — 6 production-deployed AI-assisted stages working together: user story → AC clarification (LLM) → test generation (LLM-assisted Playwright) → execution with Dynatrace observability → App Insights log validation (LLM-on-logs) → LLM-driven failure triage with Tier 4 self-healing locator re-binding → K6-based performance regression detection. That's the full surface — not just 'we used AI to write some tests'. The architectural value is that each stage is independently observable, individually replaceable, and audit-traceable — not a black box."

What was built (6-stage pipeline)

1. AI-assisted test generation

The team's recurring bottleneck was the translation gap between user story and executable test: happy-path coverage was reasonable but edge cases were inconsistent, and test authoring time was a hard ceiling on velocity. I built a multi-stage generation pipeline where an LLM first expands acceptance criteria into a structured scenario list — happy path, negative paths, and edge cases the human author would likely have missed — then generates TypeScript Playwright code matched to the framework's conventions. Every prompt was grounded in the team's standards documentation (locator strategy, page object patterns, fixture imports) so generated code matched house style without a rewrite pass. Authors shifted from typing tests to reviewing and curating AI-generated tests, a higher-leverage activity that materially accelerated test authoring throughput and expanded edge case coverage across critical flows.

2. End-to-end pipeline with LLM-driven AC clarification

Even with generation in place, the upstream bottleneck remained: ambiguous acceptance criteria. "System should validate payment correctly" cannot be tested without interpretation, and manual clarification cycles with BAs and PMs were slow. I added a discrete AC clarification stage at the front of the pipeline — the LLM analyses each AC, flags ambiguity, proposes clarifications, and surfaces them to the BA for confirmation before generation runs. Every clarification decision is logged and tied to the test cases generated from it, providing a full audit trail: when a test fails, you can trace back to which AC interpretation it was enforcing.

AC clarification pipeline shown as inputs, process, and outputs. Inputs on the left: raw user story acceptance criteria, BA or PM confirmation loop, and framework locator and page object standards. Centre LLM process: analyses each AC, flags ambiguity, proposes clarifications, surfaces to BA for human confirmation, then logs decision and proceeds. Outputs on the right: audit trail linking AC interpretation to test case, scenario list covering happy path plus edge and negative cases, and generated Playwright TypeScript code matched to framework conventions.

The audit trail is the key architectural property: when a test fails in production, you can trace back to which AC interpretation it was enforcing — and whether that interpretation was reviewed and confirmed by the BA.

User Story → AC-clarified, AI-augmented test pipeline

Stage 1: AC Analysis

  • LLM analyses each AC → flags ambiguity + proposes clarifications
  • Clarifications surfaced to BA for human review

Stage 2: AC Expansion

  • LLM expands clarified ACs into scenarios
  • Happy path + edge cases + negative paths

Stage 3: Test Generation

  • Playwright TypeScript code per scenario
  • Matches framework conventions (locator strategy, page objects, fixtures)

Stage 4: Execution + Reporting

  • Pipeline-integrated runs with Dynatrace observability
  • App Insights log validation (LLM-on-logs)
  • LLM-driven failure triage with Tier 4 self-healing locator re-binding

This end-to-end coupling also made AC ambiguity measurable for the first time — even when the pipeline does not auto-resolve, surfacing the problem upstream improved PM discipline on story quality over time.

3. Dynatrace observability integration

Functional pass/fail told the team whether the test worked; it did not tell them whether the system worked under realistic conditions. I integrated Dynatrace directly into the test lifecycle: the framework emits Dynatrace-compatible correlation tags on every HTTP request — test-run ID, build ID, suite, and test ID — so every trace in Dynatrace is filterable by test. On failure, the unified dashboard surfaces the test output, the Dynatrace distributed trace for the failing user flow, and the application logs around the failure window together. Engineers stopped context-switching between tools to triage. I also applied tiered telemetry depth: full distributed traces on test failure; production-equivalent sampling on pass to avoid paying for spans that would never be read. The same Playwright scripts were later redeployed as Dynatrace synthetic probes against staging and production, giving the team a continuous quality signal outside CI runs.

Four Dynatrace integration design decisions shown as cards. Decision 1 correlation tags on every HTTP request: test-run ID, build ID, suite, and test ID emitted so every trace is filterable by test. Decision 2 unified failure dashboard: test output plus Dynatrace distributed trace plus application logs combined for the failing flow. Decision 3 tiered telemetry depth (highlighted): full distributed traces on failure, production-equivalent sampling on pass to avoid cost for spans never read. Decision 4 Playwright scripts as synthetic probes: same scripts redeployed as Dynatrace synthetic probes against staging and production for continuous quality signal outside CI.

The tiered telemetry decision is the cost-discipline insight: paying for full distributed traces on every passing test run is wasteful. Full traces only on failure; production-equivalent sampling on pass. Same signal where it matters, negligible cost where it does not.

4. LLM-driven log validation

Application Insights emitted high log volume — traditional alerting covered known-bad patterns via regex and error rate thresholds, but missed novel anomalies where a flow completed with a success status but the wrong outcome. I applied an LLM-as-Judge pattern to log validation: log streams are sampled and batched, and the LLM grades each batch against an anomaly rubric covering novel patterns, sequences suggesting erroneous flow, and unexpected sensitive data in logs. Output is structured JSON with an anomaly score, evidence, and a suggested action. Cost discipline was central: a faster and cheaper model handles the first-pass scan; escalation to a more capable model is triggered only when the anomaly score exceeds a threshold. Flagged anomalies feed a feedback loop — false positives refine the rubric, true positives seed new test cases. This reduced log review effort from manual triage to LLM-driven triage, with SRE review concentrated on the small set of escalated anomalies.

LLM-as-Judge log validation shown as a two-model pipeline. Input is high-volume App Insights log stream sampled and batched. Stage 1 faster and cheaper model: grades each batch against anomaly rubric covering novel patterns, sequences suggesting erroneous flow, and sensitive data in logs, outputs structured JSON with anomaly score, evidence, and suggested action. A threshold gate checks if the score exceeds threshold: if not, log only; if yes, escalate to stage 2 more capable model for deep analysis. Output feeds SRE review queue for escalated anomalies only. False positives refine the rubric; true positives seed new test cases.

The cost discipline is the architectural insight: SRE review is not eliminated, it is concentrated on the small set of anomalies that survived the cheaper model's first pass. The feedback loop means false positives sharpen the rubric over time rather than accumulating as noise.

5. LLM-driven failure triage + Tier 4 self-healing

Test failures name the symptom — "expected X, got Y" — not the cause. Correlating a failure with code changes, environment state, APM traces, and historical similar failures was a slow, experience-dependent manual process. I built a four-stage LLM-driven triage pipeline: the LLM first classifies the failure category (locator drift, data issue, environment instability, real bug, or known flake pattern), then retrieves the relevant context for that category (DOM snapshot for locator drift; git diff for real bugs; Dynatrace traces for environment issues), then generates a root-cause hypothesis with a proposed next action. For locator-drift failures with high-confidence classification, the pipeline goes one step further: Tier 4 self-healing, where the LLM inspects the current DOM against the test's intent description and proposes a new locator that is auto-applied on the next run. The tiered locator recovery model, confidence gating, and audit trail design used here are documented as a reusable reference in the Agentic Self-Healing Test Framework pattern — the full architecture including Promptfoo eval harness and promotion-to-source pipeline. Every self-heal is logged to the quality dashboard, and tests that self-heal repeatedly are flagged for human review — the guard against self-healing masking a genuine regression. The architectural patterns behind this triage pipeline — particularly LangGraph orchestration with deterministic fallback, and confidence-scored decisions with explainability — were originally proven in the AgentQE Continuum R&D project before being applied here in a production enterprise context. This brought locator-drift maintenance to near-zero for stable test intent, and redirected engineer attention from flake noise to real bugs.

Four-stage failure triage pipeline shown as a continuum from cheap to expensive. Stage 1 classify: failure categories are locator drift, data issue, environment instability, real bug, or known flake pattern. Stage 2 retrieve context per category: DOM snapshot for locator drift, git diff for real bugs, Dynatrace traces for environment issues, history for flake patterns. Stage 3 hypothesise: LLM generates root-cause hypothesis and proposed next action as structured output to dashboard. Stage 4 Tier 4 self-heal highlighted: for high-confidence locator-drift classification, LLM inspects current DOM against test intent description and proposes a new locator auto-applied on next run. Every self-heal is logged; tests that self-heal repeatedly are flagged for human review.

The guard against regression masking is structural: tests that self-heal repeatedly are flagged for human review rather than silently absorbing a genuine behaviour change. Self-healing is a residual handler, not a default mode.

6. K6 performance baselines from APM data

Performance regression detection was end-of-cycle — JMeter runs against staging caught issues late, after they had propagated through multiple merges. I replaced the primary performance harness with K6, which integrated cleanly with the TypeScript-native framework stack, and derived workload models directly from Dynatrace APM data rather than synthetic assumptions. Production traffic distributions — user distributions, time-of-day patterns, and channel mix — were extracted from the observability layer and used to build realistic K6 scenarios. CI integration followed a tiered execution model: a lightweight performance smoke on flagged PRs, a full baseline comparison post-merge to main, and a nightly full-suite run against staging — the same PR / post-merge / nightly cadence documented as a reference architecture in the K6 Performance Testing Pattern, which covers the tiered execution model, APM-derived scenario design, and baseline comparison approach in full. K6 results are compared against a statistically-derived baseline; regressions surface in PR review rather than in a post-release incident. For protocol coverage where K6 does not reach — JDBC, certain messaging protocols — JMeter was retained deliberately rather than deprecated prematurely.

K6 performance baseline three-tier execution model shown on a timeline axis. Tier 1 on flagged pull requests: lightweight performance smoke, fast and cheap, catches gross regressions before merge. Tier 2 post-merge to main: full baseline comparison, results compared against statistically-derived baseline, regressions surface in PR review. Tier 3 nightly on staging: full suite with APM-derived workload models covering user distributions, time-of-day patterns, and channel mix from Dynatrace production data.

Deriving workload models from Dynatrace APM data rather than synthetic assumptions means K6 scenarios reflect actual production traffic patterns — not best-guess load profiles. Regressions surface in PR review, not in a post-release incident.

Engagement summary

FieldDetail
DurationSep 2025 – current
RoleSenior Test Automation Lead
Reporting lineQA Lead / Engineering Manager
TeamCross-functional QA pod (onshore + offshore)

Reference

Reference from QA Lead / Engineering Manager available on request at screen stage.

Matching your brief? Get in touch.