Definition
The attention mechanism allows a model to assign varying importance to different input tokens when computing each output token. Multi-head self-attention runs this process in parallel across multiple learned subspaces, capturing diverse contextual relationships.
Why It Matters
Attention is what allows transformers to understand long-range dependencies in language — connecting a pronoun at the end of a paragraph to its referent at the beginning.
How It Works
Each token produces three vectors: Query, Key, and Value. Attention scores are computed as dot products of Queries against all Keys, scaled and softmaxed into weights. The output is a weighted sum of the Value vectors, telling the model which other tokens to “pay attention to.”
Applications
Core to all transformer-based models. Also used in cross-attention layers for multimodal tasks where a model must attend to image patches when generating text.
Limitations
Quadratic compute complexity with sequence length. Very long documents require architectural modifications (sliding window attention, linear attention variants).
Related Terms
Transformer Architecture · Context Window · Large Language Model