Notes & Disclaimers
Q4_K_M is the sweet spot for model weights. It cuts VRAM usage by ~72% vs F16 with minimal quality loss. On consumer hardware, the jump from Q4 to Q6 rarely justifies the extra VRAM cost unless you are seeing measurable accuracy problems.
KV cache at q8_0 is generally safe. Dropping KV precision from f16 to q8_0 halves cache VRAM with barely perceptible quality difference in most tasks. This is the recommended setting when you need to squeeze a larger context window into remaining VRAM.
KV cache at q4_0 is aggressive. Halves cache size again vs q8_0, but the KV cache is more sensitive to quantization than model weights — especially at long contexts. You may notice coherence drift in long agentic sessions. Use only if you cannot fit your context at q8_0.
Multi-GPU (count > 1): VRAM adds up, but tensors are split across cards. Ollama distributes model layers across GPUs — total VRAM is the sum, so CPU overflow is avoided. Inter-GPU bandwidth (PCIe) is much slower than intra-GPU, so expect reduced tokens/sec vs a single card with equivalent VRAM. Still a solid option when the goal is zero CPU offload.
These are estimates, not exact values. Actual VRAM usage varies by Ollama version, CUDA overhead, runtime allocations, and context fill level. Add a 1–2 GB safety margin in practice. KV cache figures assume attention layers at standard head counts — MoE models may vary slightly.
Any overflow to CPU = major slowdown. Ollama will not crash — it will silently offload layers to system RAM. Performance drops 5–20× for offloaded layers. The goal of this tool is to confirm 100% GPU fit before you start a long agentic session.
MoE models load all weights but compute a fraction. A model like Gemma 4 26B MoE still requires VRAM for all 26B parameters at your chosen quantization — only the compute uses 4B active parameters. VRAM footprint is determined by total parameter count, not active parameter count.