KV Cache
Key-Value Cache — a mechanism that stores previously computed attention key and value vectors during autoregressive generation, avoiding redundant computation for tokens already processed.
Why It Matters
KV caching is essential for efficient LLM inference. Without it, every new token would require recomputing attention over the entire sequence from scratch.
Example
After generating 100 tokens, the KV cache stores the key-value pairs for all 100 tokens so generating token 101 only needs to compute attention for the new token.
Think of it like...
Like taking notes during a conversation — instead of replaying the entire conversation to answer each new question, you just reference your notes.
Related Terms
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Context Window
The maximum amount of text (measured in tokens) that a language model can process in a single interaction. It includes both the input prompt and the generated output. Larger context windows allow models to handle longer documents.
Transformer
A neural network architecture introduced in 2017 that uses self-attention mechanisms to process sequential data in parallel rather than sequentially. Transformers are the foundation of modern LLMs like GPT, Claude, and Gemini.