Attention
How tokens weigh each other (query · key · value) — the heart of the transformer.
Attention is the mechanism at the heart of the transformer: it lets a model decide,
for each token, how much every other token should influence it — so meaning can flow
directly between distant words instead of being passed along step by step. Each token is
projected into three vectors: a query, a key, and a value. A token’s query is
compared (dot product) against every token’s key to produce relevance scores; those scores
are scaled, passed through a softmax to become weights that sum to 1, and used to take a
weighted average of the value vectors. That’s scaled dot-product attention, often
written softmax(QKᵀ / √dₖ)·V. In practice models run several of these in parallel —
multi-head attention — so different heads can specialize in different kinds of
relationships (syntax, coreference, position). When the queries, keys, and values all come
from the same sequence it’s self-attention (a token attending to its own context); when
queries come from one sequence and keys/values from another it’s cross-attention (e.g.
a decoder attending to an encoder’s output). The catch is cost: comparing every token to
every other is O(n²) in sequence length, which is why long context windows are expensive
and why so much research targets cheaper attention variants. The Q/K/V projections are just
tensor matrix multiplies — and they’re exactly the weights LoRA usually adapts when
fine-tuning.