Project context
Every client engagement I take on in AI-augmented testing runs into the same friction: CI failures land, engineers triage manually, and the gap between "test failed" and "root cause understood" consumes disproportionate engineering time. I wanted to prove out a complete architectural answer to that problem — not as a deliverable for a client, but as something I owned end-to-end and could iterate on without external constraints.
AgentQE Continuum is that answer. I designed and built it as a personal engineering R&D project: an event-driven pipeline that connects to a GitHub Actions webhook, processes Playwright test failures through a LangGraph analysis pipeline, applies a confidence-scored decision engine, and (in Phase 3) generates and verifies autonomous fixes. Every architectural pattern in the system — the LangGraph node topology, the deterministic fallback strategy, the pg-boss job queue design, the multi-provider LLM tier routing — is something I now carry into client engagements with production evidence behind it rather than theory.
The project spans 6 work streams and 87 planned tasks across the docs. It is ongoing.
The project spans every layer from event ingestion to autonomous fix generation — a complete architectural answer proven end-to-end before being carried into client engagements.
The win
"I designed and built a full event-to-decision pipeline in TypeScript — GitHub webhook to LangGraph to confidence-scored output — that handles the complete CI failure lifecycle: HMAC-verified ingestion, Playwright JSON parsing with flake detection, pgvector RAG retrieval of historical failure patterns, LangGraph 7-node analysis graph with AI clustering and deterministic Levenshtein fallback, a decision engine that emits
escalate/report/annotate_prwith explainable confidence scores, Slack and Jira notification on escalation, and a self-healing sub-graph (Phase 3) that classifies failure type, generates a ts-morph AST fix, triggers a CI verification run, and merges only if no previously-passing test regresses. The key architectural contribution is that every decision is explainable without AI involvement — the system never produces a black box output."
The pipeline is a strict left-to-right flow: a webhook enters at Layer 1 and a confidence-scored, human-readable action exits at Layer 5. The LangGraph analysis layer is the only non-deterministic component — every other layer is deterministic and auditable without AI involvement.
High-level architecture
AgentQE Continuum — Pipeline (Event → Ingestion → Analysis → Decision → Action)
Event normalisation layer
POST /api/v1/events— single ingestion endpoint for all CI providers- GitHub Actions / Jenkins / GitLab CI / Woodpecker adapters (header-routed)
- HMAC-SHA256 signature verification per provider
- Idempotency via PostgreSQL unique constraint on event ID
- PR-level PostgreSQL advisory lock — prevents concurrent pipeline runs on the same pull request
Ingestion layer
- Playwright JSON report parsing — extract signals, detect flaky tests
- Git blame and diff fetching for failing test files
- pgvector similarity search against historical
failure_patterns(RAG context, Phase 2) NormalisedSignal[]persisted to PostgreSQL; artifact URLs stored for screenshots, videos, logs
Analysis layer — LangGraph 7-node pipeline
ingestSignals→ deterministic parse and validatevalidateContracts→ Ajv + OpenAPI 3.x schema validation on API response signalsretrieveContext→ pgvector + Voyage AIvoyage-3top-K exemplar retrieval (no-op whenVOYAGE_API_KEYabsent)clusterFailures→ AI semantic grouping (Claude Sonnet high tier) with RAG context; Levenshtein 70% threshold fallback when LLM unavailablegenerateSummary→ AI narrative (Claude Haiku medium tier); template fallback in degraded modemakeDecision→ deterministic confidence scoring with AI hybrid; emitsescalate/report/annotate_prnotify→ Slack webhook + Jira ticket creation for escalated clusters only
Decision engine
computeConfidence()— error consistency + stack-origin match + historical pattern match, minus mixed-error and degraded penalties- Escalation threshold: >10 affected tests or confidence score below 0.30
- Every decision record carries a
degradedboolean — downstream consumers always know whether AI was involved - Circuit breaker on LLM calls: 3 consecutive failures open for 30 seconds; pipeline continues in deterministic fallback mode
Action layer
- PR comment (create or update via GitHub App + Octokit)
- Slack alert via
@slack/webhook(escalated decisions only) - Jira ticket creation via Jira Cloud REST API (escalated decisions only)
- Retry trigger (re-run failed CI jobs)
- Self-healing fix sub-graph (Phase 3): ts-morph AST ops → git branch + commit → CI verification → merge or discard
The pipeline is fully asynchronous: the webhook endpoint returns immediately and the analysis runs via pg-boss. Every AI node has a deterministic fallback path, and every decision carries a degraded boolean so downstream consumers always know whether an LLM was involved.
What was built (phase-by-phase)
The build sequence reflects deliberate prioritisation: the deterministic backbone (Phase 1) had to be proven before RAG and actions were added (Phase 2), and the observability layer (Phase 2.5) was added before the dashboard (Phase 4) so the data was there to surface.
AgentQE Continuum — Phases 1–4
Phase 1 — MVP: event-to-decision backbone
- Full webhook ingestion path: HMAC verification, idempotency, PR advisory lock
- Playwright JSON parser with flake detection
- LangGraph pipeline (7 nodes) with Levenshtein fallback throughout
- Confidence-scored decision engine (
escalate/report/annotate_pr) - PR comment action via GitHub App
- pg-boss async job queue — analysis pipeline runs fully asynchronously, decoupled from the webhook response
- Prometheus metrics at
/metrics; Pino structured JSON logging with per-module child loggers - Vitest unit + integration test suite (166 tests at Phase 1 completion)
Phase 2 — RAG retrieval + full action layer
retrieveContextnode: pgvector + Voyage AIvoyage-31024-dim embeddings for semantic failure pattern retrievalembed-patternspg-boss background worker — embeds newly-seen failure patterns after each run- Slack webhook notification for escalated clusters
- Jira Cloud ticket creation with deduplication (
clusters.jira_ticket_keyprevents re-filing) - LangGraph checkpoint store (PostgreSQL-backed) wired for the fix sub-graph
- Parallel node execution:
clusterFailures+analyseDiffrun concurrently
Phase 2.5 — Observability + worker role split
- Self-hosted Grafana LGTM stack (Grafana, Loki, Tempo, Prometheus, Alertmanager, Alloy) — three-pillar telemetry with no SaaS dependency
- Every Pino log line carries
trace_idandspan_id— log-to-trace correlation in one click in Grafana Explore - 10 Prometheus alert rules covering pipeline failure rate, LLM circuit-breaker state, dead-letter queue buildup, and DB pool saturation
- Worker role split (
WORKER_ROLEenv var):api/pipeline/actions/scheduler— same Docker image, independently scalable subsystems
Phase 4 — Next.js dashboard (shipped 2026-05-19)
- Next.js 15 dashboard with GitHub OAuth (NextAuth)
- Run timeline, cluster drill-down, flakiness dashboard, admin console
- Pattern library browser, learning reports, override analytics
- RBAC: Developer / QA Engineer / Engineering Manager / DevOps with per-route access matrix
- Prompt version management UI at
/admin/prompts - Full audit log at
/admin/audit(append-only, every admin write action)
Architectural patterns proven (the portable lessons)
These five patterns recur in every AI-augmented client engagement. Proving them in a system I own end-to-end means I can recommend them with production evidence rather than theory.
LangGraph as orchestration with deterministic fallback at every node. The pipeline is designed so that every AI node has a deterministic path that fires when the LLM is unavailable. clusterFailures falls back to Levenshtein distance clustering (70% similarity threshold); generateSummary falls back to a template. The circuit breaker — 3 consecutive failures open for 30 seconds — coordinates the switch. No event is dropped; every decision carries a degraded boolean. This pattern is directly portable to any client pipeline where LLM availability cannot be guaranteed. The Tier 4 LLM re-binding sub-graph in particular has been generalised into a reusable reference architecture — the Agentic Self-Healing Test Framework pattern doc covers the tiered locator recovery model, confidence gating, and audit trail design that originated here.
Confidence-scored decision engine with explicit explainability. The decision engine computes confidence from four weighted signals (error consistency, stack-trace origin match, historical pattern match, degraded mode penalty) and maps the result to a human-readable action with supporting evidence. Engineers can inspect exactly why the system chose to escalate rather than report. This is more trustworthy than an LLM emitting a string decision with no audit trail, and it is the pattern I bring to client triage tooling.
pg-boss for resilient job queuing with PR-level locking. Choosing a PostgreSQL-native queue eliminated a Redis dependency and unified the operational surface. pg-boss handles job deduplication, dead-letter queues, cron scheduling, and retry with backoff. PR-level advisory locks in the event processor prevent race conditions when multiple webhooks arrive for the same pull request concurrently — a problem that emerges in high-velocity repos and is invisible until it bites in production.
Prompt versioning with the database as the source of truth. Prompts are versioned and managed via the /admin/prompts dashboard, not committed as static strings. The re-analyse button on the run detail view re-runs the pipeline with the current prompt version against already-ingested signals — this is how prompt regression testing works in a live system. Human overrides feed back into prompt tuning via the learning layer.
Multi-provider LLM tier routing with a single abstraction boundary. llm-client.ts exposes a single call surface; the provider (Anthropic, OpenAI, Groq, Together, Fireworks, OpenRouter, Ollama, or any OpenAI-compatible backend), model, and temperature are resolved at startup from env vars per tier (high / medium / low). Switching the clustering model from Claude Sonnet to a self-hosted Llama 3.3 on Groq is one env var change with zero code changes. I now apply this abstraction to any client system where LLM cost or vendor lock-in is a concern.
Stack rationale
Every stack choice reduces the operational surface. The rejected alternatives are faster to set up and add a dependency or a cross-language boundary that compounds over time.
TypeScript-only monorepo (no Python). The entire system — Express API, LangGraph pipeline, Next.js dashboard — runs on Node.js with strict TypeScript. A single language, a single type system, and a single pnpm workspace means end-to-end type safety without a Python service boundary. The LangGraph TypeScript SDK (@langchain/langgraph) is a first-class API, not a wrapper — this was a deliberate bet that the TS ecosystem for AI pipelines had matured enough to go language-pure. Many of the same architectural bets — TypeScript-native framework, LLM-driven triage, confidence-scored decisions — were applied in a production energy-sector client engagement; the AI-Augmented Playwright Test Pipeline case study shows how these patterns translate under the constraints of a large enterprise QA function.
pgvector over a dedicated vector database. Storing 1024-dimensional Voyage AI embeddings in the same PostgreSQL instance that holds events, clusters, and decisions keeps the operational stack simple: one database, one backup policy, one connection pool. PGVectorStore from @langchain/community handles upsert and HNSW index queries. The retrieval problem — find failure patterns similar to this new failure — is vector similarity search, not graph traversal, so a dedicated vector store adds complexity without benefit.
LangGraph over LangChain expression chains. LangGraph's graph-based execution model maps directly to the pipeline topology (ingest → validate → cluster → summarise → decide). Built-in state management (PipelineState) eliminates custom plumbing for passing context between nodes. Parallel node execution (clusterFailures + analyseDiff concurrently) is a one-line change. LangGraph's PostgreSQL checkpointer is used specifically for the fix sub-graph, where the graph needs to suspend mid-execution while waiting for a CI webhook — a checkpoint-based design that LangChain chains cannot express.
pg-boss over Redis-based queues. Redis adds an operational dependency and a separate persistence boundary. pg-boss runs on the existing PostgreSQL instance, supports advisory locks (critical for PR-level deduplication), and provides dead-letter queues and cron scheduling without a separate scheduler service. The scheduler worker role must never be scaled past one replica — pg-boss does not deduplicate cron-job inserts by handler, so two schedulers fire every job twice. This is a known and documented constraint, not a deficiency.
Current status
The pipeline is fully operational from webhook to action. Phase 3 is the next boundary: moving from a system that explains and notifies to one that proposes and fixes — with the guardrails already designed into the architecture.
Phases 1 through 4 are complete. Phase 4 — the Next.js dashboard — shipped on 2026-05-19. The pipeline is fully operational: webhook ingestion, LangGraph analysis, confidence-scored decisions, PR comments, Slack alerts, Jira tickets, RAG retrieval, and the observability stack are all live.
Phase 3 — the autonomous fix generation engine (app/src/fix/ directory, ts-morph AST operations, verification gate, autonomy level progression) — is planned and architecturally specified but not yet implemented. The fix sub-graph structure, guardrails (test files only, max 50 lines across 5 files, 120-second timeout), and the four autonomy levels (L0 shadow through L3 auto-merge) are fully designed in the architecture document. Implementation is the next phase of active development.
The patterns in this system are what I now bring to client engagements. AgentQE Continuum is where they were proven.
Related services