AI-Driven Quality Engineering Architect · Available for new engagements · Australia

SNK
SNK Digital
Back to Work
SaaSWeeks 1–6 pre-launch + post-launch cadence

Performance Test Strategy for a SaaS Product Launch — JMeter + k6 Dual-Tool Architecture

Workload modelling from APM · dual-tool test plan (JMeter + k6) · threshold-gated CI baselines · capacity envelope + recovery validation

JMeterk6TypeScriptGrafanaInfluxDBBlazeMeterAzure DevOps

Engagement context

I joined a B2B SaaS product team in the final weeks before a major product launch. The platform was cloud-native on AWS — ECS, RDS, ElastiCache — serving enterprise tenants with a workload that was read-heavy overall but carried a small number of known expensive operations: report generation, bulk data import, and search across large tenant datasets. The team had done no systematic performance testing prior to my engagement. There was an existing JMeter investment in the broader programme — licences, a small pool of perf engineers familiar with the GUI authoring path, and an estate of test plans for adjacent systems — but no perf gate in the product's own CI pipeline. A beta cohort had been running for several weeks, which meant there was real APM telemetry available to anchor a workload model. The brief was to get the programme to launch-readiness: explicit SLOs, a defensible test strategy across all five test types, automated CI gating, and a capacity envelope the ops team could act on. This performance programme was one discipline within a broader multi-programme QA remit; for the full picture of how that portfolio was structured and governed, see the Enterprise QA Leadership — 6-Year Multi-Programme Tenure.

The win

"I designed a dual-tool performance architecture that gave the programme two things it needed and could not get from a single tool: the JMeter layer covered the mixed-protocol workload surface — JDBC direct calls, JMS messaging hops, and LDAP auth flows — where k6 simply cannot reach, and it preserved the institutional knowledge already encoded in the programme's existing test plans. The k6 layer covered the modern HTTP and API surface, lived in the engineering team's TypeScript repository, and integrated directly into the Azure DevOps CI pipeline as a threshold-gated quality gate. The engineering team owned the k6 suite and ran it locally. The perf engineers owned the JMeter estate and operated the distributed runners. Each team used the tool that matched their skill profile, and neither surface was sacrificed to make a single-tool story work."

Workload modelling from APM data

The beta cohort had been generating real traffic for several weeks before my engagement began. Rather than building a synthetic workload model from scratch, I started with the APM data — per-endpoint request rates, time-of-day patterns, and the distribution of tenant sizes across the beta user mix. The modelling pipeline ran in three passes.

Three-pass workload modelling pipeline. Pass 1 extracts endpoint frequency tiers from APM data: high-frequency reads including dashboard load, tenant data fetch, and search; medium-frequency writes; and low-frequency heavy operations including report generation and bulk import. Pass 2 applies business-hours time-of-day shaping from beta traffic curves so k6 ramping profiles match observed peak shape rather than flat synthetic concurrency. Pass 3 isolates report generation and bulk import as dedicated k6 scenario threads so database-tier impact is measurable independently. The output is k6 scenario files per workload tier and JMeter thread groups aligned to the same frequency tiers.

The three-pass model is the most defensible artefact from the engagement. When bottlenecks appeared, the question "is this a real bottleneck or a synthetic load artefact?" had a clear answer because the load was anchored in real beta traffic, not invented concurrency numbers.

The first pass extracted raw request-rate data per endpoint from the APM layer and grouped endpoints into frequency tiers: high-frequency read operations, medium-frequency transactional writes, and low-frequency but expensive batch operations. The second pass applied time-of-day shaping — the beta data showed a clear business-hours peak, which the k6 ramping profiles had to reflect rather than applying flat synthetic concurrency. The third pass identified the known expensive operations and modelled them as a separate scenario thread: report generation and bulk import were given their own k6 scenarios rather than being folded into the main workload mix, so their impact on the shared database tier was measurable in isolation.

Workload Modelling Pipeline

Pass 1: Endpoint frequency tiering (from APM)

  • High-frequency reads: dashboard load, tenant data fetch, search
  • Medium-frequency writes: record create/update, state transitions
  • Low-frequency heavy: report generation, bulk import

Pass 2: Time-of-day shaping

  • Business-hours peak profile extracted from APM request-rate curves
  • k6 ramping profiles match observed peak shape, not flat synthetic load

Pass 3: Expensive-operation isolation

  • Report generation and bulk import modelled as dedicated scenario threads
  • Isolated to measure database-tier impact independently from main workload

Output

  • k6 scenario file per workload tier (main mix, report-gen, bulk import)
  • JMeter thread groups aligned to same frequency tiers for cross-tool consistency

The workload model was the most defensible artefact from the engagement. When results landed and bottlenecks appeared, the first question is always "is this a real bottleneck or a synthetic load artefact?" — having the model anchored in real APM data meant that answer was straightforward.

Dual-tool architecture (JMeter for the institutional surface, k6 for the engineering gate)

The architectural decision to run both tools in parallel was not a compromise — it was the correct split given the protocol surface and the team structure.

Two-lane tool architecture. JMeter lane: covers HTTP, JDBC, JMS, and LDAP protocols; operated by perf engineers using the GUI authoring path; BlazeMeter for high-concurrency distributed stress runs; Grafana plus InfluxDB for results and historical dashboards; test plan estate shared with adjacent programme systems. k6 lane: covers HTTP and REST API only; TypeScript scripts colocated with application code; engineers run locally before pushing; Azure DevOps threshold-gated CI on PR and post-merge; separate k6 series in shared Grafana dashboard. A shared foundation row at the bottom shows both tools use the same APM-derived workload model, the same SLO thresholds, and both contribute to the capacity envelope.

Each tool covers what the other cannot. k6 cannot speak JDBC, JMS, or LDAP — forcing it across the full protocol surface would have required custom extensions or left those protocols untested. JMeter's distributed infrastructure was already available for high-concurrency stress runs. Using both was faster and more defensible than making a single-tool story work.

The platform's workload surface included JDBC database-direct calls from certain internal services, JMS messaging hops for async processing flows, and LDAP-backed authentication. k6 speaks HTTP, WebSocket, and gRPC. It does not speak JDBC, JMS, or LDAP. Forcing k6 across the full surface would have required either leaving those protocols untested or building custom extensions — neither was acceptable given the launch timeline. JMeter handled the mixed-protocol surface natively, with a test plan per protocol tier and a shared workload model aligned to the APM-derived frequency analysis.

The k6 layer covered the primary HTTP and REST API surface: the tenant dashboard, the transactional write paths, the search endpoints, and the public-facing API. These were TypeScript k6 scripts, living in the same repository as the application and functional tests. Engineers ran them locally before pushing. CI integration was direct — no separate perf infrastructure needed for the lightweight CI gate.

Dual-Tool Architecture: Protocol and Team Split

JMeter — mixed-protocol + institutional surface

  • Protocol coverage: HTTP + JDBC + JMS + LDAP
  • Operated by perf engineers familiar with GUI authoring path
  • Distributed runners: BlazeMeter for high-concurrency stress runs
  • Grafana + InfluxDB for results and historical trend dashboards
  • Test plan estate: shared with adjacent programme systems

k6 — modern HTTP/API surface + engineering CI gate

  • Protocol coverage: HTTP/REST API, tenant dashboard, transactional writes
  • TypeScript scripts colocated with application code
  • Owned and maintained by the engineering team
  • Azure DevOps integration: threshold-gated CI runs on PR and post-merge
  • Results: InfluxDB + Grafana (shared dashboard, separate k6 series)

Shared foundation

  • Workload model: same APM-derived frequency tiers used in both tools
  • SLOs: p95 latency and error rate thresholds enforced in both layers
  • Capacity envelope: consolidated from JMeter stress results + k6 load results

The institutional argument was equally important. The programme had perf engineers who knew JMeter. Deprecating it to make a single-tool story would have destroyed institutional knowledge for no performance gain on the HTTP surface. I retained JMeter, extended it with InfluxDB/Grafana output and CI scheduling, and added k6 as the engineering layer. Both tools served the programme.

Test type mix (Smoke · Load · Stress · Spike · Soak)

The five test types each answer a different question. Stakeholders often ask for "a load test" when what they actually need is all five, executed in sequence, with different tools appropriate to each type's cadence and audience.

Five test types in a table. Smoke: k6 on every PR gate, answers whether the system works under minimal load. Load: k6 HTTP plus JMeter full protocol, pre-release weekly and nightly post-launch, answers whether it handles expected normal and peak concurrency. Stress: JMeter distributed via BlazeMeter, pre-launch weeks four and five then quarterly, finds the capacity ceiling. Spike: k6, pre-launch week five and after architecture changes, validates recovery from sudden traffic burst. Soak: k6 overnight on self-hosted runner, pre-launch week five then quarterly, checks whether the system holds under sustained load for hours.

Five types, five different questions, five different cadences. If any of the five is missing, the programme has a named blind spot — and the right response is to name it explicitly, not pretend the coverage is complete.

Test typeQuestion it answersToolCadence
SmokeDoes the system work under minimal load?k6Per CI build (PR gate)
LoadCan it handle expected normal and peak concurrency?k6 (HTTP) + JMeter (full protocol)Pre-release weekly; nightly post-launch
StressWhere does it break? What is the capacity ceiling?JMeter (distributed, BlazeMeter)Pre-launch Weeks 4–5; quarterly post-launch
SpikeCan it recover from a sudden traffic burst?k6Pre-launch Week 5; after architecture changes
SoakDoes it hold under sustained load for hours?k6 (overnight, self-hosted runner)Pre-launch Week 5; quarterly post-launch

The JMeter suite was the right tool for high-concurrency stress runs because the programme's distributed BlazeMeter infrastructure was already available and the perf engineers knew how to operate it. Running a multi-thousand-virtual-user stress test through k6 Cloud would have required new infrastructure procurement on a tight timeline. Using what already existed was faster and produced more defensible results.

The k6 suite was the right tool for smoke, spike, and soak because it ran in a single container, needed no separate orchestration layer, and produced results that engineers could read in the same pipeline they were already watching. A soak test running overnight in a k6 container on the Azure DevOps agent pool required no BlazeMeter spend and no perf-engineer supervision.

CI integration and threshold gating

The CI architecture was tiered by test type and audience. Not every test type belongs in every pipeline stage — the cost and duration of a soak test makes it unsuitable for a PR gate; the lightweight nature of a smoke test makes it wasteful to run only once a week.

Three pipeline tiers in Azure DevOps. Tier 1 PR gate: k6 smoke on every pull request, all critical endpoints, under two minutes, p95 latency within SLO and zero errors, non-zero exit code blocks the merge. Tier 2 nightly post-merge: k6 load and JMeter load both run at expected peak concurrency, threshold gates on p95 latency and error rate and connection pool metrics, Grafana drift alert against rolling baseline. Tier 3 scheduled on-demand: JMeter stress ramped past peak to find the breaking point, k6 spike for burst recovery, k6 soak overnight for memory and connection-pool stability, output is the capacity envelope and updated runbook thresholds.

The nightly run was not just pass/fail — it was a trend line. A p95 latency drifting upward across four consecutive nightly runs without breaching a single threshold was still a signal, and Grafana made it visible before it became an incident.

Azure DevOps Tiered Execution Model

PR gate (every pull request)

  • k6 smoke: minimal concurrency, all critical endpoints
  • Pass criteria: p95 latency within SLO, zero errors
  • Duration: under two minutes — fast enough to block the merge

Post-merge to main (nightly)

  • k6 load: full workload model at expected peak concurrency
  • JMeter load: full protocol surface at expected peak
  • Threshold gates: p95 latency, error rate, connection pool metrics
  • Grafana: results written to InfluxDB; drift from previous baseline flagged automatically

Pre-launch and quarterly (scheduled + on-demand)

  • JMeter stress: concurrency ramped beyond expected peak to find breaking point
  • k6 spike: sudden burst, observe recovery time
  • k6 soak: sustained concurrency overnight, memory and connection-pool stability
  • Output: capacity envelope update; runbook thresholds refreshed

Threshold gating was enforced at two levels. The k6 native threshold syntax was used for the CI smoke gate — the pipeline step returned a non-zero exit code on breach, blocking the merge. For the nightly and scheduled runs, Grafana annotations marked each test run, and a delta comparison against the rolling baseline surfaced regressions as Grafana alerts before they were reviewed. The nightly run was not just pass/fail; it was a trend line. A p95 latency that drifted upward across four consecutive nightly runs without a single threshold breach was still a signal — and Grafana made it visible.

What I'd do differently

Two retrospective lessons. Lesson 1 workload model review at end of Week 1: beta cohort users were technically sophisticated early adopters rather than the enterprise tenant population actually targeted. Two high-frequency operations in beta data were low-frequency in real enterprise usage. Fix: build a workload model review into end of Week 1 with the product manager and enterprise design team sense-checking frequency tiers before scenarios are built. Lesson 2 shared metric contract upfront: k6 used http_req_duration with percentile aggregation while JMeter used a custom Grafana Backend Listener schema, requiring manual query alignment for cross-tool comparison in Grafana. Fix: define shared metric names upfront so p95 latency panels show both tools on the same series without translation.

Both lessons are now defaults on every performance engagement. The workload model review is a Week 1 checkpoint. The metric contract is defined before any test script is written.

The workload model was built during the first week, which was the right call. What I underestimated was how much the beta APM data reflected a non-representative user mix — beta cohort users were technically sophisticated early adopters who used the platform differently from the enterprise tenant population the product was actually targeting. By Week 3, when the load test results started showing which endpoints were under pressure, it was clear that two high-frequency operations in the beta data were low-frequency in the real target workload, and two operations that the beta data suggested were rare were actually central to enterprise usage patterns. I adjusted the workload model mid-engagement, which was fine — the process worked — but I would build a workload model review into the end of Week 1 now, explicitly asking the product manager and enterprise design team to sense-check the frequency tiers before building scenarios around them.

The second thing I'd change is the JMeter and k6 result schemas. Both tools wrote to InfluxDB, but I designed them with different metric names for equivalent measurements — k6 used http_req_duration with percentile aggregation, JMeter used a custom Grafana Backend Listener schema. Cross-tool comparison in Grafana required manual query alignment. I'd define a shared metric contract upfront so that the p95 latency panel shows both tools' data on the same series without translation.

Architectural patterns I now apply

Five patterns from this engagement I apply as defaults on every performance programme:

Five architectural patterns as cards. Pattern 1 workload model before scripts: no scenario written until frequency tiers and time-of-day shaping are derived from real APM data — synthetic flat load misses actual bottlenecks. Pattern 2 tool follows protocol and team: k6 is not the default everywhere and JMeter is not legacy everywhere — heterogeneous tooling is correct architecture when the programme spans multiple protocol surfaces. Pattern 3 five test types not one: every programme needs smoke, load, stress, spike, and soak — if any is missing it is a named blind spot. Pattern 4 CI gate at the right cadence per type: smoke on every PR, load nightly, stress and spike and soak on scheduled cadence tied to release milestones — over-gating and under-gating are both failure modes. Pattern 5 capacity envelope as a deliverable: stress test output is not pass-fail but the maximum sustained concurrency before the first error wave, handed to ops with a runbook.

These five patterns are the residue of the engagement. They are not theoretical — each one was clarified or confirmed by something that went wrong or right during the six weeks.

  1. Workload model before test scripts. No scenario is written until the frequency tiers and time-of-day shaping are derived from real traffic data. Synthetic flat load is a trap — it produces results that look like a load test but miss the actual bottlenecks. If there is no production or beta APM data, the workload model is the first thing I ask the team to help build, using their best estimate of user behaviour, validated with the product owner.

  2. Tool selection follows protocol surface and team profile, not convention. k6 is not the default everywhere. JMeter is not legacy everywhere. The correct tool is the one that covers the protocol surface without gaps and that the team who will own it post-engagement is able to maintain. Heterogeneous tooling — k6 for the HTTP engineering gate, JMeter for the mixed-protocol perf-engineer surface — is the correct architecture when the programme has both.

  3. Five test types, not one. Every performance programme gets a coverage matrix: smoke, load, stress, spike, soak. The question is not "which one do we run?" but "when does each run, what tool runs it, and who acts on the result?" If any of the five is missing, the programme has a blind spot I name explicitly.

  4. CI gate at the right cadence for each test type. Smoke on every PR. Load nightly. Stress, spike, and soak on a scheduled cadence tied to release milestones and architecture changes. Over-gating (stress on every PR) and under-gating (load only pre-launch) are both failure modes.

  5. Capacity envelope as a deliverable. The stress test's output is not a pass/fail; it is the capacity envelope: the maximum sustained concurrency the system holds before the first error wave, and the concurrency at which errors become unacceptable. That envelope is handed to the ops and SRE team with a runbook: "if traffic crosses X, scale to Y before it becomes an incident." Without the envelope, the performance programme is a test; with it, it is an input to operations.

For the JMeter reference architecture — distributed execution, InfluxDB/Grafana integration, mixed-protocol test plans, and the CI scheduling model — see JMeter Performance Testing for Enterprise Programmes.

For the k6 reference architecture — TypeScript scenario structure, threshold gating, tiered CI execution, and the Grafana drift-detection model — see k6 Performance Testing Pattern (Modern).

Related services

Matching your brief? Get in touch.