AI-Driven Quality Engineering Architect · Available for new engagements · Australia

SNK
SNK Digital
Back to Work
Open R&DApril 2026 – ongoing

AgentQE Continuum — Event-Driven, Self-Healing QA Validation System

Event-driven LangGraph pipeline with confidence-scored decisions, multi-CI ingestion, pgvector RAG, and a self-healing fix sub-graph — full event-to-fix lifecycle.

LangGraphClaude APITypeScriptPostgreSQLpgvectorpg-bossPlaywrightNext.jsts-morph

Project context

Every client engagement I take on in AI-augmented testing runs into the same friction: CI failures land, engineers triage manually, and the gap between "test failed" and "root cause understood" consumes disproportionate engineering time. I wanted to prove out a complete architectural answer to that problem — not as a deliverable for a client, but as something I owned end-to-end and could iterate on without external constraints.

AgentQE Continuum is that answer. I designed and built it as a personal engineering R&D project: an event-driven pipeline that connects to a GitHub Actions webhook, processes Playwright test failures through a LangGraph analysis pipeline, applies a confidence-scored decision engine, and (in Phase 3) generates and verifies autonomous fixes. Every architectural pattern in the system — the LangGraph node topology, the deterministic fallback strategy, the pg-boss job queue design, the multi-provider LLM tier routing — is something I now carry into client engagements with production evidence behind it rather than theory.

The project spans 6 work streams and 87 planned tasks across the docs. It is ongoing.

Six stat tiles summarising AgentQE Continuum: a 7-node LangGraph analysis pipeline; 3 decision outputs — escalate, report, and annotate_pr; 6 work streams across the project; 87 planned tasks in the docs; 166 Vitest tests at Phase 1 completion; and Phase 4 shipped on 2026-05-19.

The project spans every layer from event ingestion to autonomous fix generation — a complete architectural answer proven end-to-end before being carried into client engagements.

The win

"I designed and built a full event-to-decision pipeline in TypeScript — GitHub webhook to LangGraph to confidence-scored output — that handles the complete CI failure lifecycle: HMAC-verified ingestion, Playwright JSON parsing with flake detection, pgvector RAG retrieval of historical failure patterns, LangGraph 7-node analysis graph with AI clustering and deterministic Levenshtein fallback, a decision engine that emits escalate / report / annotate_pr with explainable confidence scores, Slack and Jira notification on escalation, and a self-healing sub-graph (Phase 3) that classifies failure type, generates a ts-morph AST fix, triggers a CI verification run, and merges only if no previously-passing test regresses. The key architectural contribution is that every decision is explainable without AI involvement — the system never produces a black box output."

Five-layer pipeline continuum from left to right: Layer 1 Event Normalisation with POST API, HMAC-SHA256 verification, idempotency, and PR advisory lock; Layer 2 Ingestion with Playwright JSON parsing, Git blame, and pgvector RAG retrieval; Layer 3 Analysis — a 7-node LangGraph pipeline from ingestSignals through to notify; Layer 4 Decision Engine with computeConfidence, escalation threshold above 10 tests or confidence below 0.30, and circuit breaker opening after 3 failures for 30 seconds; Layer 5 Action Layer dispatching PR comments, Slack alerts, Jira tickets, retry triggers, and the Phase 3 fix sub-graph.

The pipeline is a strict left-to-right flow: a webhook enters at Layer 1 and a confidence-scored, human-readable action exits at Layer 5. The LangGraph analysis layer is the only non-deterministic component — every other layer is deterministic and auditable without AI involvement.

High-level architecture

AgentQE Continuum — Pipeline (Event → Ingestion → Analysis → Decision → Action)

Event normalisation layer

  • POST /api/v1/events — single ingestion endpoint for all CI providers
  • GitHub Actions / Jenkins / GitLab CI / Woodpecker adapters (header-routed)
  • HMAC-SHA256 signature verification per provider
  • Idempotency via PostgreSQL unique constraint on event ID
  • PR-level PostgreSQL advisory lock — prevents concurrent pipeline runs on the same pull request

Ingestion layer

  • Playwright JSON report parsing — extract signals, detect flaky tests
  • Git blame and diff fetching for failing test files
  • pgvector similarity search against historical failure_patterns (RAG context, Phase 2)
  • NormalisedSignal[] persisted to PostgreSQL; artifact URLs stored for screenshots, videos, logs

Analysis layer — LangGraph 7-node pipeline

  • ingestSignals → deterministic parse and validate
  • validateContracts → Ajv + OpenAPI 3.x schema validation on API response signals
  • retrieveContext → pgvector + Voyage AI voyage-3 top-K exemplar retrieval (no-op when VOYAGE_API_KEY absent)
  • clusterFailures → AI semantic grouping (Claude Sonnet high tier) with RAG context; Levenshtein 70% threshold fallback when LLM unavailable
  • generateSummary → AI narrative (Claude Haiku medium tier); template fallback in degraded mode
  • makeDecision → deterministic confidence scoring with AI hybrid; emits escalate / report / annotate_pr
  • notify → Slack webhook + Jira ticket creation for escalated clusters only

Decision engine

  • computeConfidence() — error consistency + stack-origin match + historical pattern match, minus mixed-error and degraded penalties
  • Escalation threshold: >10 affected tests or confidence score below 0.30
  • Every decision record carries a degraded boolean — downstream consumers always know whether AI was involved
  • Circuit breaker on LLM calls: 3 consecutive failures open for 30 seconds; pipeline continues in deterministic fallback mode

Action layer

  • PR comment (create or update via GitHub App + Octokit)
  • Slack alert via @slack/webhook (escalated decisions only)
  • Jira ticket creation via Jira Cloud REST API (escalated decisions only)
  • Retry trigger (re-run failed CI jobs)
  • Self-healing fix sub-graph (Phase 3): ts-morph AST ops → git branch + commit → CI verification → merge or discard
AgentQE Continuum — event-to-action call sequence
Loading diagram…

The pipeline is fully asynchronous: the webhook endpoint returns immediately and the analysis runs via pg-boss. Every AI node has a deterministic fallback path, and every decision carries a degraded boolean so downstream consumers always know whether an LLM was involved.

What was built (phase-by-phase)

Phase timeline with five markers. Phase 1: MVP event-to-decision backbone with 7-node LangGraph pipeline, 166 Vitest tests at completion. Phase 2: RAG retrieval and full action layer with pgvector plus Voyage AI voyage-3 1024-dimensional embeddings. Phase 2.5: self-hosted Grafana LGTM observability stack with 10 Prometheus alert rules and worker role split into api, pipeline, actions, and scheduler. Phase 3: autonomous fix engine — planned and architecturally specified, next active development phase. Phase 4: Next.js 15 dashboard with GitHub OAuth, RBAC, and prompt version management, shipped 2026-05-19.

The build sequence reflects deliberate prioritisation: the deterministic backbone (Phase 1) had to be proven before RAG and actions were added (Phase 2), and the observability layer (Phase 2.5) was added before the dashboard (Phase 4) so the data was there to surface.

AgentQE Continuum — Phases 1–4

Phase 1 — MVP: event-to-decision backbone

  • Full webhook ingestion path: HMAC verification, idempotency, PR advisory lock
  • Playwright JSON parser with flake detection
  • LangGraph pipeline (7 nodes) with Levenshtein fallback throughout
  • Confidence-scored decision engine (escalate / report / annotate_pr)
  • PR comment action via GitHub App
  • pg-boss async job queue — analysis pipeline runs fully asynchronously, decoupled from the webhook response
  • Prometheus metrics at /metrics; Pino structured JSON logging with per-module child loggers
  • Vitest unit + integration test suite (166 tests at Phase 1 completion)

Phase 2 — RAG retrieval + full action layer

  • retrieveContext node: pgvector + Voyage AI voyage-3 1024-dim embeddings for semantic failure pattern retrieval
  • embed-patterns pg-boss background worker — embeds newly-seen failure patterns after each run
  • Slack webhook notification for escalated clusters
  • Jira Cloud ticket creation with deduplication (clusters.jira_ticket_key prevents re-filing)
  • LangGraph checkpoint store (PostgreSQL-backed) wired for the fix sub-graph
  • Parallel node execution: clusterFailures + analyseDiff run concurrently

Phase 2.5 — Observability + worker role split

  • Self-hosted Grafana LGTM stack (Grafana, Loki, Tempo, Prometheus, Alertmanager, Alloy) — three-pillar telemetry with no SaaS dependency
  • Every Pino log line carries trace_id and span_id — log-to-trace correlation in one click in Grafana Explore
  • 10 Prometheus alert rules covering pipeline failure rate, LLM circuit-breaker state, dead-letter queue buildup, and DB pool saturation
  • Worker role split (WORKER_ROLE env var): api / pipeline / actions / scheduler — same Docker image, independently scalable subsystems

Phase 4 — Next.js dashboard (shipped 2026-05-19)

  • Next.js 15 dashboard with GitHub OAuth (NextAuth)
  • Run timeline, cluster drill-down, flakiness dashboard, admin console
  • Pattern library browser, learning reports, override analytics
  • RBAC: Developer / QA Engineer / Engineering Manager / DevOps with per-route access matrix
  • Prompt version management UI at /admin/prompts
  • Full audit log at /admin/audit (append-only, every admin write action)

Architectural patterns proven (the portable lessons)

A grid of five portable architectural patterns. Pattern 1: LangGraph with deterministic fallback at every node — clusterFailures uses Levenshtein at 70 percent similarity threshold when the LLM is unavailable, generateSummary falls back to a template, circuit breaker coordinates the switch after 3 consecutive failures opening for 30 seconds. Pattern 2: confidence-scored decisions with explainability — four weighted signals produce escalate, report, or annotate_pr outputs with a degraded boolean on every record. Pattern 3: pg-boss with PR-level PostgreSQL advisory locks preventing concurrent webhook race conditions. Pattern 4: prompt versioning with the database as source of truth — re-analyse button replays the pipeline against already-ingested signals. Pattern 5: multi-provider LLM tier routing behind a single abstraction boundary — provider and model resolved from env vars per tier with zero code changes to switch.

These five patterns recur in every AI-augmented client engagement. Proving them in a system I own end-to-end means I can recommend them with production evidence rather than theory.

LangGraph as orchestration with deterministic fallback at every node. The pipeline is designed so that every AI node has a deterministic path that fires when the LLM is unavailable. clusterFailures falls back to Levenshtein distance clustering (70% similarity threshold); generateSummary falls back to a template. The circuit breaker — 3 consecutive failures open for 30 seconds — coordinates the switch. No event is dropped; every decision carries a degraded boolean. This pattern is directly portable to any client pipeline where LLM availability cannot be guaranteed. The Tier 4 LLM re-binding sub-graph in particular has been generalised into a reusable reference architecture — the Agentic Self-Healing Test Framework pattern doc covers the tiered locator recovery model, confidence gating, and audit trail design that originated here.

Confidence-scored decision engine with explicit explainability. The decision engine computes confidence from four weighted signals (error consistency, stack-trace origin match, historical pattern match, degraded mode penalty) and maps the result to a human-readable action with supporting evidence. Engineers can inspect exactly why the system chose to escalate rather than report. This is more trustworthy than an LLM emitting a string decision with no audit trail, and it is the pattern I bring to client triage tooling.

pg-boss for resilient job queuing with PR-level locking. Choosing a PostgreSQL-native queue eliminated a Redis dependency and unified the operational surface. pg-boss handles job deduplication, dead-letter queues, cron scheduling, and retry with backoff. PR-level advisory locks in the event processor prevent race conditions when multiple webhooks arrive for the same pull request concurrently — a problem that emerges in high-velocity repos and is invisible until it bites in production.

Prompt versioning with the database as the source of truth. Prompts are versioned and managed via the /admin/prompts dashboard, not committed as static strings. The re-analyse button on the run detail view re-runs the pipeline with the current prompt version against already-ingested signals — this is how prompt regression testing works in a live system. Human overrides feed back into prompt tuning via the learning layer.

Multi-provider LLM tier routing with a single abstraction boundary. llm-client.ts exposes a single call surface; the provider (Anthropic, OpenAI, Groq, Together, Fireworks, OpenRouter, Ollama, or any OpenAI-compatible backend), model, and temperature are resolved at startup from env vars per tier (high / medium / low). Switching the clustering model from Claude Sonnet to a self-hosted Llama 3.3 on Groq is one env var change with zero code changes. I now apply this abstraction to any client system where LLM cost or vendor lock-in is a concern.

Stack rationale

Four stack decision cards each showing a chosen approach and the rejected alternative. Decision 1: TypeScript-only monorepo over a Python service boundary, for end-to-end type safety with a single language, single type system, and single pnpm workspace. Decision 2: pgvector over a dedicated vector database — 1024-dimensional Voyage AI embeddings stored in the same PostgreSQL instance as events, clusters, and decisions. Decision 3: LangGraph over LangChain expression chains — graph topology maps to the pipeline, PipelineState eliminates custom plumbing, and the PostgreSQL checkpointer supports mid-execution suspension while awaiting a CI webhook. Decision 4: pg-boss over Redis-based queues — runs on existing PostgreSQL, advisory locks for PR-level deduplication, scheduler role must never exceed one replica.

Every stack choice reduces the operational surface. The rejected alternatives are faster to set up and add a dependency or a cross-language boundary that compounds over time.

TypeScript-only monorepo (no Python). The entire system — Express API, LangGraph pipeline, Next.js dashboard — runs on Node.js with strict TypeScript. A single language, a single type system, and a single pnpm workspace means end-to-end type safety without a Python service boundary. The LangGraph TypeScript SDK (@langchain/langgraph) is a first-class API, not a wrapper — this was a deliberate bet that the TS ecosystem for AI pipelines had matured enough to go language-pure. Many of the same architectural bets — TypeScript-native framework, LLM-driven triage, confidence-scored decisions — were applied in a production energy-sector client engagement; the AI-Augmented Playwright Test Pipeline case study shows how these patterns translate under the constraints of a large enterprise QA function.

pgvector over a dedicated vector database. Storing 1024-dimensional Voyage AI embeddings in the same PostgreSQL instance that holds events, clusters, and decisions keeps the operational stack simple: one database, one backup policy, one connection pool. PGVectorStore from @langchain/community handles upsert and HNSW index queries. The retrieval problem — find failure patterns similar to this new failure — is vector similarity search, not graph traversal, so a dedicated vector store adds complexity without benefit.

LangGraph over LangChain expression chains. LangGraph's graph-based execution model maps directly to the pipeline topology (ingest → validate → cluster → summarise → decide). Built-in state management (PipelineState) eliminates custom plumbing for passing context between nodes. Parallel node execution (clusterFailures + analyseDiff concurrently) is a one-line change. LangGraph's PostgreSQL checkpointer is used specifically for the fix sub-graph, where the graph needs to suspend mid-execution while waiting for a CI webhook — a checkpoint-based design that LangChain chains cannot express.

pg-boss over Redis-based queues. Redis adds an operational dependency and a separate persistence boundary. pg-boss runs on the existing PostgreSQL instance, supports advisory locks (critical for PR-level deduplication), and provides dead-letter queues and cron scheduling without a separate scheduler service. The scheduler worker role must never be scaled past one replica — pg-boss does not deduplicate cron-job inserts by handler, so two schedulers fire every job twice. This is a known and documented constraint, not a deficiency.

Current status

Two side-by-side cards. Left card, green, shows Live and Fully Operational capabilities: webhook ingestion with HMAC verification and idempotency, LangGraph 7-node analysis pipeline, confidence-scored decisions with PR comments, Slack alerts and Jira ticket creation, pgvector RAG with Voyage AI voyage-3 embeddings, self-hosted Grafana LGTM observability stack with 10 alert rules, and Next.js 15 dashboard shipped 2026-05-19. Right card, slate with dashed border, shows Phase 3 — specified but not yet implemented: autonomous fix generation engine using ts-morph AST operations, guardrails of test files only with max 50 lines across 5 files and 120-second timeout, and four autonomy levels from L0 shadow through to L3 auto-merge.

The pipeline is fully operational from webhook to action. Phase 3 is the next boundary: moving from a system that explains and notifies to one that proposes and fixes — with the guardrails already designed into the architecture.

Phases 1 through 4 are complete. Phase 4 — the Next.js dashboard — shipped on 2026-05-19. The pipeline is fully operational: webhook ingestion, LangGraph analysis, confidence-scored decisions, PR comments, Slack alerts, Jira tickets, RAG retrieval, and the observability stack are all live.

Phase 3 — the autonomous fix generation engine (app/src/fix/ directory, ts-morph AST operations, verification gate, autonomy level progression) — is planned and architecturally specified but not yet implemented. The fix sub-graph structure, guardrails (test files only, max 50 lines across 5 files, 120-second timeout), and the four autonomy levels (L0 shadow through L3 auto-merge) are fully designed in the architecture document. Implementation is the next phase of active development.

The patterns in this system are what I now bring to client engagements. AgentQE Continuum is where they were proven.

Matching your brief? Get in touch.