Attention Head
A single attention computation within multi-head attention. Each head independently computes attention scores, allowing different heads to specialize in different types of relationships.
Why It Matters
Understanding attention heads helps interpret how transformers process information and enables techniques like head pruning for model optimization.
Example
In a 12-head attention layer, one head might specialize in tracking subject-verb relationships, another in coreference resolution, and another in positional patterns.
Think of it like...
Like members of a jury each paying attention to different aspects of a case — one focuses on evidence, another on testimony, another on motive — then they combine insights.
Related Terms
Multi-Head Attention
An extension of attention where multiple attention mechanisms (heads) run in parallel, each learning to focus on different types of relationships in the data. The outputs are then combined.
Self-Attention
A mechanism where each element in a sequence attends to all other elements to compute a representation, determining how much focus to place on each part of the input. It is the core innovation of the transformer.
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.