Transformer

The self-attention architecture behind modern LLMs.

Transformer is the neural-network design behind virtually all modern large language models (and much more). Introduced in the 2017 paper “Attention Is All You Need,” its key idea is self-attention : for each word in a sequence, the model weighs how much every other word should influence it, so it can connect distant words directly instead of marching through them one at a time like older “recurrent” networks did. Because those weighings happen in parallel across the whole sequence, transformers train efficiently on GPUs and scale up to enormous models and datasets, the practical reason they won out. They come in three flavors: encoder-only (e.g. BERT, good at understanding and classifying text), decoder-only (e.g. the GPT family, good at generating text), and encoder-decoder (e.g. T5, good at tasks like translation that turn one sequence into another). It all runs on tensor operations stacked into repeated attention and feed-forward layers.