Rebuilt the on-prem inference stack on native Ollama — request reliability rose 65.8% → 90.0% and serial throughput stabilized at ~44 tok/s across all prompt sizes.
After deploying the load-test harness, we ran a full 12-cell benchmark before and after migrating the on-prem inference stack from the custom MLX shim to native Ollama. The results measured a step-change improvement across reliability, throughput, and consistency.
Successful requests climbed from 79/120 to 108/120. Four test cells that previously failed wholesale — medium_parallel_history, heavy_serial_history, heavy_parallel_no_history, and heavy_parallel_history — now pass. No cells deliver 0/10 after the migration.
Run A serial throughput was erratic — a sawtooth from 0.34 to 33.6 tok/s caused by the shim cold-loading the model on some requests. After the migration, serial throughput stabilized at ~40–45 tok/s regardless of prompt size. Medium-history cells went from 2.3 → 26.6 tok/s (+1055%) and 3.2 → 45.2 tok/s (+1315%). Native Ollama keeps the model resident with a warm KV cache — long-context requests no longer trigger cold reloads.
| Metric | Before (shim) | After (Ollama) |
|---|---|---|
| Request reliability | 65.8% (79/120) | 90.0% (108/120) |
| Cells with 0/10 failures | 4 cells | 0 cells |
| Serial throughput range | 0.34–33.6 tok/s | 26.6–45.2 tok/s |
| Run duration | ~43 min | ~22 min |
OLLAMA_NUM_PARALLEL=4 limits heavy concurrency to 4/10 — excess requests queue past the 120 s timeout. Concurrency control at LiteLLM will shed these with a clean 429 instead.num_ctx (Ollama default: 4096 tokens) — earlier turns are dropped. Setting num_ctx explicitly (≥16K) is a pending correctness fix.