Engagement context
I joined a B2B SaaS product team in the final weeks before a major product launch. The platform was cloud-native on AWS — ECS, RDS, ElastiCache — serving enterprise tenants with a workload that was read-heavy overall but carried a small number of known expensive operations: report generation, bulk data import, and search across large tenant datasets. The team had done no systematic performance testing prior to my engagement. There was an existing JMeter investment in the broader programme — licences, a small pool of perf engineers familiar with the GUI authoring path, and an estate of test plans for adjacent systems — but no perf gate in the product's own CI pipeline. A beta cohort had been running for several weeks, which meant there was real APM telemetry available to anchor a workload model. The brief was to get the programme to launch-readiness: explicit SLOs, a defensible test strategy across all five test types, automated CI gating, and a capacity envelope the ops team could act on. This performance programme was one discipline within a broader multi-programme QA remit; for the full picture of how that portfolio was structured and governed, see the Enterprise QA Leadership — 6-Year Multi-Programme Tenure.
The win
"I designed a dual-tool performance architecture that gave the programme two things it needed and could not get from a single tool: the JMeter layer covered the mixed-protocol workload surface — JDBC direct calls, JMS messaging hops, and LDAP auth flows — where k6 simply cannot reach, and it preserved the institutional knowledge already encoded in the programme's existing test plans. The k6 layer covered the modern HTTP and API surface, lived in the engineering team's TypeScript repository, and integrated directly into the Azure DevOps CI pipeline as a threshold-gated quality gate. The engineering team owned the k6 suite and ran it locally. The perf engineers owned the JMeter estate and operated the distributed runners. Each team used the tool that matched their skill profile, and neither surface was sacrificed to make a single-tool story work."
Workload modelling from APM data
The beta cohort had been generating real traffic for several weeks before my engagement began. Rather than building a synthetic workload model from scratch, I started with the APM data — per-endpoint request rates, time-of-day patterns, and the distribution of tenant sizes across the beta user mix. The modelling pipeline ran in three passes.
The three-pass model is the most defensible artefact from the engagement. When bottlenecks appeared, the question "is this a real bottleneck or a synthetic load artefact?" had a clear answer because the load was anchored in real beta traffic, not invented concurrency numbers.
The first pass extracted raw request-rate data per endpoint from the APM layer and grouped endpoints into frequency tiers: high-frequency read operations, medium-frequency transactional writes, and low-frequency but expensive batch operations. The second pass applied time-of-day shaping — the beta data showed a clear business-hours peak, which the k6 ramping profiles had to reflect rather than applying flat synthetic concurrency. The third pass identified the known expensive operations and modelled them as a separate scenario thread: report generation and bulk import were given their own k6 scenarios rather than being folded into the main workload mix, so their impact on the shared database tier was measurable in isolation.
Workload Modelling Pipeline
Pass 1: Endpoint frequency tiering (from APM)
- High-frequency reads: dashboard load, tenant data fetch, search
- Medium-frequency writes: record create/update, state transitions
- Low-frequency heavy: report generation, bulk import
Pass 2: Time-of-day shaping
- Business-hours peak profile extracted from APM request-rate curves
- k6 ramping profiles match observed peak shape, not flat synthetic load
Pass 3: Expensive-operation isolation
- Report generation and bulk import modelled as dedicated scenario threads
- Isolated to measure database-tier impact independently from main workload
Output
- k6 scenario file per workload tier (main mix, report-gen, bulk import)
- JMeter thread groups aligned to same frequency tiers for cross-tool consistency
The workload model was the most defensible artefact from the engagement. When results landed and bottlenecks appeared, the first question is always "is this a real bottleneck or a synthetic load artefact?" — having the model anchored in real APM data meant that answer was straightforward.
Dual-tool architecture (JMeter for the institutional surface, k6 for the engineering gate)
The architectural decision to run both tools in parallel was not a compromise — it was the correct split given the protocol surface and the team structure.
Each tool covers what the other cannot. k6 cannot speak JDBC, JMS, or LDAP — forcing it across the full protocol surface would have required custom extensions or left those protocols untested. JMeter's distributed infrastructure was already available for high-concurrency stress runs. Using both was faster and more defensible than making a single-tool story work.
The platform's workload surface included JDBC database-direct calls from certain internal services, JMS messaging hops for async processing flows, and LDAP-backed authentication. k6 speaks HTTP, WebSocket, and gRPC. It does not speak JDBC, JMS, or LDAP. Forcing k6 across the full surface would have required either leaving those protocols untested or building custom extensions — neither was acceptable given the launch timeline. JMeter handled the mixed-protocol surface natively, with a test plan per protocol tier and a shared workload model aligned to the APM-derived frequency analysis.
The k6 layer covered the primary HTTP and REST API surface: the tenant dashboard, the transactional write paths, the search endpoints, and the public-facing API. These were TypeScript k6 scripts, living in the same repository as the application and functional tests. Engineers ran them locally before pushing. CI integration was direct — no separate perf infrastructure needed for the lightweight CI gate.
Dual-Tool Architecture: Protocol and Team Split
JMeter — mixed-protocol + institutional surface
- Protocol coverage: HTTP + JDBC + JMS + LDAP
- Operated by perf engineers familiar with GUI authoring path
- Distributed runners: BlazeMeter for high-concurrency stress runs
- Grafana + InfluxDB for results and historical trend dashboards
- Test plan estate: shared with adjacent programme systems
k6 — modern HTTP/API surface + engineering CI gate
- Protocol coverage: HTTP/REST API, tenant dashboard, transactional writes
- TypeScript scripts colocated with application code
- Owned and maintained by the engineering team
- Azure DevOps integration: threshold-gated CI runs on PR and post-merge
- Results: InfluxDB + Grafana (shared dashboard, separate k6 series)
Shared foundation
- Workload model: same APM-derived frequency tiers used in both tools
- SLOs: p95 latency and error rate thresholds enforced in both layers
- Capacity envelope: consolidated from JMeter stress results + k6 load results
The institutional argument was equally important. The programme had perf engineers who knew JMeter. Deprecating it to make a single-tool story would have destroyed institutional knowledge for no performance gain on the HTTP surface. I retained JMeter, extended it with InfluxDB/Grafana output and CI scheduling, and added k6 as the engineering layer. Both tools served the programme.
Test type mix (Smoke · Load · Stress · Spike · Soak)
The five test types each answer a different question. Stakeholders often ask for "a load test" when what they actually need is all five, executed in sequence, with different tools appropriate to each type's cadence and audience.
Five types, five different questions, five different cadences. If any of the five is missing, the programme has a named blind spot — and the right response is to name it explicitly, not pretend the coverage is complete.
| Test type | Question it answers | Tool | Cadence |
|---|---|---|---|
| Smoke | Does the system work under minimal load? | k6 | Per CI build (PR gate) |
| Load | Can it handle expected normal and peak concurrency? | k6 (HTTP) + JMeter (full protocol) | Pre-release weekly; nightly post-launch |
| Stress | Where does it break? What is the capacity ceiling? | JMeter (distributed, BlazeMeter) | Pre-launch Weeks 4–5; quarterly post-launch |
| Spike | Can it recover from a sudden traffic burst? | k6 | Pre-launch Week 5; after architecture changes |
| Soak | Does it hold under sustained load for hours? | k6 (overnight, self-hosted runner) | Pre-launch Week 5; quarterly post-launch |
The JMeter suite was the right tool for high-concurrency stress runs because the programme's distributed BlazeMeter infrastructure was already available and the perf engineers knew how to operate it. Running a multi-thousand-virtual-user stress test through k6 Cloud would have required new infrastructure procurement on a tight timeline. Using what already existed was faster and produced more defensible results.
The k6 suite was the right tool for smoke, spike, and soak because it ran in a single container, needed no separate orchestration layer, and produced results that engineers could read in the same pipeline they were already watching. A soak test running overnight in a k6 container on the Azure DevOps agent pool required no BlazeMeter spend and no perf-engineer supervision.
CI integration and threshold gating
The CI architecture was tiered by test type and audience. Not every test type belongs in every pipeline stage — the cost and duration of a soak test makes it unsuitable for a PR gate; the lightweight nature of a smoke test makes it wasteful to run only once a week.
The nightly run was not just pass/fail — it was a trend line. A p95 latency drifting upward across four consecutive nightly runs without breaching a single threshold was still a signal, and Grafana made it visible before it became an incident.
Azure DevOps Tiered Execution Model
PR gate (every pull request)
- k6 smoke: minimal concurrency, all critical endpoints
- Pass criteria: p95 latency within SLO, zero errors
- Duration: under two minutes — fast enough to block the merge
Post-merge to main (nightly)
- k6 load: full workload model at expected peak concurrency
- JMeter load: full protocol surface at expected peak
- Threshold gates: p95 latency, error rate, connection pool metrics
- Grafana: results written to InfluxDB; drift from previous baseline flagged automatically
Pre-launch and quarterly (scheduled + on-demand)
- JMeter stress: concurrency ramped beyond expected peak to find breaking point
- k6 spike: sudden burst, observe recovery time
- k6 soak: sustained concurrency overnight, memory and connection-pool stability
- Output: capacity envelope update; runbook thresholds refreshed
Threshold gating was enforced at two levels. The k6 native threshold syntax was used for the CI smoke gate — the pipeline step returned a non-zero exit code on breach, blocking the merge. For the nightly and scheduled runs, Grafana annotations marked each test run, and a delta comparison against the rolling baseline surfaced regressions as Grafana alerts before they were reviewed. The nightly run was not just pass/fail; it was a trend line. A p95 latency that drifted upward across four consecutive nightly runs without a single threshold breach was still a signal — and Grafana made it visible.
What I'd do differently
Both lessons are now defaults on every performance engagement. The workload model review is a Week 1 checkpoint. The metric contract is defined before any test script is written.
The workload model was built during the first week, which was the right call. What I underestimated was how much the beta APM data reflected a non-representative user mix — beta cohort users were technically sophisticated early adopters who used the platform differently from the enterprise tenant population the product was actually targeting. By Week 3, when the load test results started showing which endpoints were under pressure, it was clear that two high-frequency operations in the beta data were low-frequency in the real target workload, and two operations that the beta data suggested were rare were actually central to enterprise usage patterns. I adjusted the workload model mid-engagement, which was fine — the process worked — but I would build a workload model review into the end of Week 1 now, explicitly asking the product manager and enterprise design team to sense-check the frequency tiers before building scenarios around them.
The second thing I'd change is the JMeter and k6 result schemas. Both tools wrote to InfluxDB, but I designed them with different metric names for equivalent measurements — k6 used http_req_duration with percentile aggregation, JMeter used a custom Grafana Backend Listener schema. Cross-tool comparison in Grafana required manual query alignment. I'd define a shared metric contract upfront so that the p95 latency panel shows both tools' data on the same series without translation.
Architectural patterns I now apply
Five patterns from this engagement I apply as defaults on every performance programme:
These five patterns are the residue of the engagement. They are not theoretical — each one was clarified or confirmed by something that went wrong or right during the six weeks.
-
Workload model before test scripts. No scenario is written until the frequency tiers and time-of-day shaping are derived from real traffic data. Synthetic flat load is a trap — it produces results that look like a load test but miss the actual bottlenecks. If there is no production or beta APM data, the workload model is the first thing I ask the team to help build, using their best estimate of user behaviour, validated with the product owner.
-
Tool selection follows protocol surface and team profile, not convention. k6 is not the default everywhere. JMeter is not legacy everywhere. The correct tool is the one that covers the protocol surface without gaps and that the team who will own it post-engagement is able to maintain. Heterogeneous tooling — k6 for the HTTP engineering gate, JMeter for the mixed-protocol perf-engineer surface — is the correct architecture when the programme has both.
-
Five test types, not one. Every performance programme gets a coverage matrix: smoke, load, stress, spike, soak. The question is not "which one do we run?" but "when does each run, what tool runs it, and who acts on the result?" If any of the five is missing, the programme has a blind spot I name explicitly.
-
CI gate at the right cadence for each test type. Smoke on every PR. Load nightly. Stress, spike, and soak on a scheduled cadence tied to release milestones and architecture changes. Over-gating (stress on every PR) and under-gating (load only pre-launch) are both failure modes.
-
Capacity envelope as a deliverable. The stress test's output is not a pass/fail; it is the capacity envelope: the maximum sustained concurrency the system holds before the first error wave, and the concurrency at which errors become unacceptable. That envelope is handed to the ops and SRE team with a runbook: "if traffic crosses X, scale to Y before it becomes an incident." Without the envelope, the performance programme is a test; with it, it is an input to operations.
For the JMeter reference architecture — distributed execution, InfluxDB/Grafana integration, mixed-protocol test plans, and the CI scheduling model — see JMeter Performance Testing for Enterprise Programmes.
For the k6 reference architecture — TypeScript scenario structure, threshold gating, tiered CI execution, and the Grafana drift-detection model — see k6 Performance Testing Pattern (Modern).
Related services