Inference Reliability Leap

Rebuilt the on-prem inference stack on native Ollama — request reliability rose 65.8% → 90.0% and serial throughput stabilized at ~44 tok/s across all prompt sizes.

After deploying the load-test harness, we ran a full 12-cell benchmark before and after migrating the on-prem inference stack from the custom MLX shim to native Ollama. The results measured a step-change improvement across reliability, throughput, and consistency.

Reliability: 65.8% → 90.0%

Successful requests climbed from 79/120 to 108/120. Four test cells that previously failed wholesale — medium_parallel_history, heavy_serial_history, heavy_parallel_no_history, and heavy_parallel_history — now pass. No cells deliver 0/10 after the migration.

Serial throughput: stable at ~44 tok/s

Run A serial throughput was erratic — a sawtooth from 0.34 to 33.6 tok/s caused by the shim cold-loading the model on some requests. After the migration, serial throughput stabilized at ~40–45 tok/s regardless of prompt size. Medium-history cells went from 2.3 → 26.6 tok/s (+1055%) and 3.2 → 45.2 tok/s (+1315%). Native Ollama keeps the model resident with a warm KV cache — long-context requests no longer trigger cold reloads.

Comparison at a glance

Metric	Before (shim)	After (Ollama)
Request reliability	65.8% (79/120)	90.0% (108/120)
Cells with 0/10 failures	4 cells	0 cells
Serial throughput range	0.34–33.6 tok/s	26.6–45.2 tok/s
Run duration	~43 min	~22 min

Remaining work tracked

Heavy parallel cap: OLLAMA_NUM_PARALLEL=4 limits heavy concurrency to 4/10 — excess requests queue past the 120 s timeout. Concurrency control at LiteLLM will shed these with a clean 429 instead.
Context truncation: Multi-turn medium/heavy conversations silently exceed num_ctx (Ollama default: 4096 tokens) — earlier turns are dropped. Setting num_ctx explicitly (≥16K) is a pending correctness fix.