LoRA

Low-rank adapters; parameter-efficient fine-tuning.

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method: instead of updating all of a model’s weights — billions of parameters, expensive in compute and memory — you freeze the original weights and train a small pair of low-rank matrices that are added alongside them. The trick rests on the observation that the change a fine-tune makes to a big weight matrix is approximately low-rank, so it can be factored as A · B where A and B are skinny (a chosen rank r, often 8–64) and together hold a tiny fraction of the original parameter count. At inference the product is added to (or merged back into) the frozen weights, so there’s no extra latency once merged. The payoffs: you can fine-tune a large model on a single consumer GPU, and the resulting LoRA adapter is a small file (megabytes, not gigabytes) that you swap in and out on top of one base model. QLoRA goes further by quantizing the frozen base to 4-bit (see GGUF for the related quantization idea) while training the adapter in higher precision, cutting memory more still. LoRA is most commonly applied to the attention projection matrices of a transformer.