llama.cpp vs Ollama
The local-inference engine vs the wrapper built on it.
First, the naming: “llama” here means llama.cpp (the inference engine), not Meta’s Llama model family. Ollama’s name is a play on it, which is fitting because Ollama is built on top of llama.cpp — they’re not really competitors so much as different layers of the same stack.
llama.cpp is the low-level C/C++ inference engine (powered by GGML) that actually
runs GGUF models. It’s a library plus CLI tools (llama-cli, llama-server) and ships
the GPU backends — CUDA, Metal, Vulkan — for offloading layers. It’s maximally
flexible and fast, but you manage everything yourself: find the right .gguf, pick the
quantization, set context length and GPU-offload flags, wire up the chat template.
Ollama wraps that engine in developer ergonomics. It adds a model registry
(ollama pull llama3), automatic download and caching, sensible default parameters and
templates baked into a Modelfile, and a persistent background server exposing a REST API
(and an OpenAI-compatible endpoint). You trade some low-level control for not having to
think about any of the plumbing.
Rule of thumb: reach for Ollama when you want a model running in one command and a clean local API; drop down to llama.cpp when you need bleeding-edge features, custom build flags, or to squeeze out maximum performance. (LM Studio is a third option — a GUI app over the same llama.cpp core.)