Sparse Attention
A variant of attention where each token only attends to a subset of other tokens rather than all of them, reducing computational cost from O(n²) to O(n√n) or O(n log n).
Why It Matters
Sparse attention enables processing much longer sequences than standard attention, making it feasible to handle entire books or codebases.
Example
Instead of every token attending to every other token in a 100K document, each token attends to nearby tokens plus a set of globally shared tokens.
Think of it like...
Like a team meeting where instead of everyone talking to everyone, each person communicates with their immediate team plus a few key liaisons — much more efficient.
Related Terms
Attention Mechanism
A component in neural networks that allows the model to focus on the most relevant parts of the input when producing each part of the output. It assigns different weights to different input elements based on their relevance.
Flash Attention
An optimized implementation of the attention mechanism that reduces memory usage and increases speed by tiling the computation and avoiding materializing the full attention matrix in memory.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.