Ollama · Local Inference

VRAM Fit Check

Calculate whether a quantized LLM + KV cache fits entirely on your GPU — no CPU overflow.

Your GPU
GPU Model
GPU VRAM
GB
Model
Preset Model
⚠ Modified — no longer a preset
Parameters (B)
Model Weight Quantization
Context Window & KV Cache
Context Length
i
num_ctx in Ollama Modelfile.

Ollama defaults to 2048 — far too low for agentic work. Set this in your Modelfile with PARAMETER num_ctx <value>.

KV cache VRAM grows linearly with context length. Longer context = more VRAM reserved, less headroom for model weights.
KV Cache Precision
i
OLLAMA_KV_CACHE_TYPE environment variable.

Set before launching Ollama:
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Default is f16. Dropping to q8_0 cuts cache memory in half with minimal quality impact. q4_0 halves again but can degrade long-context coherence.
Loading...
Model weights 0 GB / 0 GB GPU
Model weights
KV cache
Free headroom
Model Weights
at selected quant
KV Cache
at selected context
Total Required
model + cache
Headroom / Overflow
remaining VRAM
Modelfile
Environment (if KV cache ≠ f16)
Q4_K_M is the sweet spot for model weights. It cuts VRAM usage by ~72% vs F16 with minimal quality loss. On consumer hardware, the jump from Q4 to Q6 rarely justifies the extra VRAM cost unless you are seeing measurable accuracy problems.
KV cache at q8_0 is generally safe. Dropping KV precision from f16 to q8_0 halves cache VRAM with barely perceptible quality difference in most tasks. This is the recommended setting when you need to squeeze a larger context window into remaining VRAM.
KV cache at q4_0 is aggressive. Halves cache size again vs q8_0, but the KV cache is more sensitive to quantization than model weights — especially at long contexts. You may notice coherence drift in long agentic sessions. Use only if you cannot fit your context at q8_0.
Multi-GPU (count > 1): VRAM adds up, but tensors are split across cards. Ollama distributes model layers across GPUs — total VRAM is the sum, so CPU overflow is avoided. Inter-GPU bandwidth (PCIe) is much slower than intra-GPU, so expect reduced tokens/sec vs a single card with equivalent VRAM. Still a solid option when the goal is zero CPU offload.
These are estimates, not exact values. Actual VRAM usage varies by Ollama version, CUDA overhead, runtime allocations, and context fill level. Add a 1–2 GB safety margin in practice. KV cache figures assume attention layers at standard head counts — MoE models may vary slightly.
Any overflow to CPU = major slowdown. Ollama will not crash — it will silently offload layers to system RAM. Performance drops 5–20× for offloaded layers. The goal of this tool is to confirm 100% GPU fit before you start a long agentic session.
MoE models load all weights but compute a fraction. A model like Gemma 4 26B MoE still requires VRAM for all 26B parameters at your chosen quantization — only the compute uses 4B active parameters. VRAM footprint is determined by total parameter count, not active parameter count.