Artificial Intelligence

Sparse Attention

A variant of attention where each token only attends to a subset of other tokens rather than all of them, reducing computational cost from O(n²) to O(n√n) or O(n log n).

Why It Matters

Sparse attention enables processing much longer sequences than standard attention, making it feasible to handle entire books or codebases.

Example

Instead of every token attending to every other token in a 100K document, each token attends to nearby tokens plus a set of globally shared tokens.

Think of it like...

Like a team meeting where instead of everyone talking to everyone, each person communicates with their immediate team plus a few key liaisons — much more efficient.

Related Terms