// ACHIEVEMENTS.LOG / MAY 22, 2026
◈ TECHNICAL

Inference Reliability Leap

Rebuilt the on-prem inference stack on native Ollama — request reliability rose 65.8% → 90.0% and serial throughput stabilized at ~44 tok/s across all prompt sizes.

After deploying the load-test harness, we ran a full 12-cell benchmark before and after migrating the on-prem inference stack from the custom MLX shim to native Ollama. The results measured a step-change improvement across reliability, throughput, and consistency.

Reliability: 65.8% → 90.0%

Successful requests climbed from 79/120 to 108/120. Four test cells that previously failed wholesale — medium_parallel_history, heavy_serial_history, heavy_parallel_no_history, and heavy_parallel_history — now pass. No cells deliver 0/10 after the migration.

Serial throughput: stable at ~44 tok/s

Run A serial throughput was erratic — a sawtooth from 0.34 to 33.6 tok/s caused by the shim cold-loading the model on some requests. After the migration, serial throughput stabilized at ~40–45 tok/s regardless of prompt size. Medium-history cells went from 2.3 → 26.6 tok/s (+1055%) and 3.2 → 45.2 tok/s (+1315%). Native Ollama keeps the model resident with a warm KV cache — long-context requests no longer trigger cold reloads.

Comparison at a glance

MetricBefore (shim)After (Ollama)
Request reliability65.8% (79/120)90.0% (108/120)
Cells with 0/10 failures4 cells0 cells
Serial throughput range0.34–33.6 tok/s26.6–45.2 tok/s
Run duration~43 min~22 min

Remaining work tracked