How the estimates work
Local LLM generation is dominated by memory bandwidth: to produce each token, the hardware streams the model's active weights (and the growing KV cache) out of memory. But raw bandwidth isn't the whole story — there's also a fixed per-token overhead that puts a ceiling on speed, which is why a chip with twice the bandwidth isn't twice as fast. We model the time for each token directly:
tokens/sec = 1 ÷ ( weight_read_time + kv_cache_read_time + fixed_overhead ) weight_read_time = active_model_size ÷ (memory_bandwidth × efficiency) kv_cache_read_time = kv_cache_size ÷ (memory_bandwidth × efficiency)
- Active model size = active parameters × bits-per-weight ÷ 8. For mixture-of-experts models only a fraction of parameters run per token, so they are faster than their download size suggests.
- KV cache grows with your context length, so longer contexts mean slower generation — the context selector affects the speed estimate, not just whether a model fits. By default we assume an fp16 cache (2 bytes per element); the KV-cache selector models Q8_0 (≈½) and Q4_0 (≈¼) quantization, a runtime setting (e.g. llama.cpp
--cache-type-k/-v) that shrinks the cache and speeds long-context generation for a little quality. It can't be read from a model on Hugging Face — it's your choice at inference time — so we let you pick it here. - Efficiency and fixed overhead are calibrated per hardware class (Apple unified memory, discrete GPU, CPU).
- Concurrent streams (serving several requests at once) share one copy of the weights but each needs its own KV cache, so memory use grows with the stream count and large models may stop fitting. Because the weight read is amortized across the batch, the speed we show is per stream — aggregate throughput is roughly the stream count times that, with diminishing returns as batched decode becomes compute-bound. Real batching efficiency varies by runtime (vLLM is built for it; llama.cpp less so), so treat the multi-stream numbers as a first-order guide.
Time to first token is a separate story. Before any token is generated the model must read your whole prompt — “prefill” — which is compute-bound(≈ 2 FLOPs per active parameter per token), not bandwidth-bound like generation. So a chip with great memory bandwidth but modest compute can stream tokens quickly yet still take many seconds to start at long context. We estimate prefill from each device's approximate fp16 compute and show the time to first token at your selected context length. These compute figures aren't benchmark-calibrated, so treat the time-to-first-token as a rough ballpark.
Number format (NVFP4, INT8/FP8, and quantization) matters for prefill but barely for decode — and the reason is the same bound-by split. Decode is bandwidth-bound, so what counts is how many bitseach weight occupies: a 4-bit weight streams in the same time whether it's an integer K-quant (Q4) or NVIDIA's NVFP4, a hardware 4-bit floaton Blackwell. NVFP4's advantage over a plain 4-bit integer is quality per bit(a shared scale and a floating-point mantissa preserve more of the model), not raw decode speed — so it doesn't move the tokens/sec we show, which are calibrated to llama.cpp/Ollama integer quants. Prefill is compute-bound, and that's where a card's tensor-core formats decide throughput. Each NVIDIA generation added a lower-precision tensor path that roughly doubles peak math: INT8 on Ampere (A100), FP8 on Ada and Hopper (L40S, H100/H200), and FP4/NVFP4on Blackwell (RTX Pro 6000, B200, GB300). The prefill compute figures for the workstation and datacenter cards below already reflect that tensor-core advantage, which is why they sit well ahead of a consumer card at the same memory bandwidth. NVFP4 itself is consumed today by TensorRT-LLM and vLLM, not GGUF/llama.cpp, so on this site it shows up as faster prefill on Blackwell rather than a new decode number.
Whether a model fits is decided by the total weights at a given quantization, plus the KV cache for your chosen context length, plus runtime overhead — compared against the usable portion of your VRAM or unified memory.
On a discrete GPU, a model that overflows VRAM isn't necessarily out of reach: llama.cpp can keep some layers on the GPU and offload the rest to system RAM, which runs — slowly, because the RAM-resident weights are read at a fraction of VRAM bandwidth each token. When that applies we show the offload speed and how much spills to RAM, assuming a typical desktop with ~64 GB of DDR5. Apple unified memory has no VRAM/RAM split, so offload doesn't apply there.
Calibration
The speed model is fitted against real measured token-generation benchmarks from the llama.cpp benchmark threads, XiongjieDai's GPU-Benchmarks-on-LLM-Inference, and LocalScore. The Apple-silicon fit explains 98% of the variance in the measured data; the discrete-GPU fit, 90%.
As the crowdsourced reports below accumulate, we periodically re-fit the same constants against the accepted submissions and update them when the data warrants — so every benchmark you contribute directly sharpens the estimates everyone sees.
These are estimates, shown as ranges. They're calibrated to the mainstream llama.cpp / Ollama setup, which is the default; the runtime selector adjusts the estimate for faster backends — MLX on Apple silicon, vLLM or ExLlamaV2 on discrete GPUs — using approximate per-runtime factors (themselves refined over time by the crowdsourced reports, which record the engine used). Real numbers also vary with OS, thermal state, and build. The goal is a reliable ballpark for every machine, not a benchmark. CPU-only estimates are not yet benchmark-calibrated and are rougher.
Measured speeds
Alongside our estimates, we show crowdsourced measured speeds when people report them. On the contribute page anyone can paste the raw timing output from their own llama.cpp or Ollama run; we parse the tokens-per-second from it (never a self-typed number), sanity-check it against the estimate, and store it anonymously. Once a given device, model, and quantization has at least three accepted reports, cards and the chart show the median of them, with a count of how many back it — so no single submission can move the number. A measured median is a real number from real hardware, so trust it over the estimate when both are present — the estimate is the prediction, the measured value is the ground truth filling it in. Submissions are rate-limited and gated by a lightweight proof-of-work check (no third-party CAPTCHA); we keep only the one parsed benchmark line, not your full paste.
Capability score
The capability score (0–100) lets you pick the strongest model you can actually run. Its grounding varies by model, and we say which on every card:
- Benchmark-anchored — where a model has a public LMArena Elo, the score is anchored to it and the card shows the Elo. These are grounded in a real, independent benchmark.
- Editorial estimate — brand-new open models often have no clean, machine-readable public benchmark yet, so their score is our estimate from published results and size class, clearly labeled as such. A score upgrades to benchmark-anchored automatically once the model is rated.
Scores reflect full-precision weights; heavy quantization (e.g. Q4) may run a few percent weaker on math and reasoning. Quality data credit: LMArena (CC BY 4.0).