VRAM Calculator — Local LLM Fit Check

Context Length

num_ctx in Ollama Modelfile.

Ollama defaults to 2048 — far too low for agentic work. Set this in your Modelfile with PARAMETER num_ctx <value>.

KV cache VRAM grows linearly with context length. Longer context = more VRAM reserved, less headroom for model weights.

KV Cache Precision

OLLAMA_KV_CACHE_TYPE environment variable.

Set before launching Ollama:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Default is f16. Dropping to q8_0 cuts cache memory in half with minimal quality impact. q4_0 halves again but can degrade long-context coherence.

Result

⏳

Model weights 0 GB / 0 GB GPU

Model weights

KV cache

Free headroom

Overflow (CPU)

Model Weights

—

at selected quant

KV Cache

—

at selected context

Total Required

—

model + cache

Headroom / Overflow

—

remaining VRAM

Modelfile

Environment (if KV cache ≠ f16)

Notes & Disclaimers

Q4_K_M is the sweet spot for model weights. It cuts VRAM usage by ~72% vs F16 with minimal quality loss. On consumer hardware, the jump from Q4 to Q6 rarely justifies the extra VRAM cost unless you are seeing measurable accuracy problems.

KV cache at q8_0 is generally safe. Dropping KV precision from f16 to q8_0 halves cache VRAM with barely perceptible quality difference in most tasks. This is the recommended setting when you need to squeeze a larger context window into remaining VRAM.

KV cache at q4_0 is aggressive. Halves cache size again vs q8_0, but the KV cache is more sensitive to quantization than model weights — especially at long contexts. You may notice coherence drift in long agentic sessions. Use only if you cannot fit your context at q8_0.

Multi-GPU (count > 1): VRAM adds up, but tensors are split across cards. Ollama distributes model layers across GPUs — total VRAM is the sum, so CPU overflow is avoided. Inter-GPU bandwidth (PCIe) is much slower than intra-GPU, so expect reduced tokens/sec vs a single card with equivalent VRAM. Still a solid option when the goal is zero CPU offload.

These are estimates, not exact values. Actual VRAM usage varies by Ollama version, CUDA overhead, runtime allocations, and context fill level. Add a 1–2 GB safety margin in practice. KV cache figures assume attention layers at standard head counts — MoE models may vary slightly.

Any overflow to CPU = major slowdown. Ollama will not crash — it will silently offload layers to system RAM. Performance drops 5–20× for offloaded layers. The goal of this tool is to confirm 100% GPU fit before you start a long agentic session.

MoE models load all weights but compute a fraction. A model like Gemma 4 26B MoE still requires VRAM for all 26B parameters at your chosen quantization — only the compute uses 4B active parameters. VRAM footprint is determined by total parameter count, not active parameter count.