Load Testing Framework

12-cell matrix harness measures throughput across size, concurrency, and history dimensions.

Before optimising performance you need to measure it. We built a purpose-designed load testing harness that stress-tests the inference stack across three independent dimensions simultaneously.

Test matrix

Dimension	Values	Purpose
Request size	small (128 tok), medium (384 tok), heavy (1024 tok)	Model the full token range
Concurrency	serial, parallel (10 simultaneous)	Stress the inference queue
Context	no history, with history	Measure prompt-length degradation

12 cells × 10 requests each = 120 total requests per run. The harness uses the Qwen tokenizer for accurate token budgeting and records wall time, tokens/second, HTTP status, and full usage breakdown per request.

Key finding

Peak throughput: 33.59 tok/s (small, serial, no history). The Semaphore(1) serialisation in the shim caused a 10× collapse under parallel load — the primary driver of the Ollama migration.