12-cell matrix harness measures throughput across size, concurrency, and history dimensions.
Before optimising performance you need to measure it. We built a purpose-designed load testing harness that stress-tests the inference stack across three independent dimensions simultaneously.
| Dimension | Values | Purpose |
|---|---|---|
| Request size | small (128 tok), medium (384 tok), heavy (1024 tok) | Model the full token range |
| Concurrency | serial, parallel (10 simultaneous) | Stress the inference queue |
| Context | no history, with history | Measure prompt-length degradation |
12 cells × 10 requests each = 120 total requests per run. The harness uses the Qwen tokenizer for accurate token budgeting and records wall time, tokens/second, HTTP status, and full usage breakdown per request.
Peak throughput: 33.59 tok/s (small, serial, no history). The Semaphore(1) serialisation in the shim caused a 10× collapse under parallel load — the primary driver of the Ollama migration.