Artificial Intelligence

Flash Attention

An optimized implementation of the attention mechanism that reduces memory usage and increases speed by tiling the computation and avoiding materializing the full attention matrix in memory.

Why It Matters

Flash Attention made long-context models practical by reducing the memory from O(n²) to O(n) and speeding up training by 2-4x. It is now standard in LLM training.

Example

Training a model with 128K context window that would require 1TB of memory with standard attention, but only 16GB with Flash Attention.

Think of it like...

Like a smart chef who prepares ingredients in small batches instead of laying everything out at once — same final dish, but uses far less counter space.

Related Terms