Prompt Caching
A technique that stores and reuses the processed form of frequently used prompt prefixes, avoiding redundant computation. It speeds up inference and reduces costs for repeated prompts.
Why It Matters
Prompt caching can reduce API costs by 50-90% for applications with long, shared system prompts — like RAG systems that include the same context repeatedly.
Example
A customer support bot with a 2,000-token system prompt that is cached after the first call — subsequent calls skip processing those tokens, saving time and money.
Think of it like...
Like keeping your frequently used ingredients pre-chopped in the fridge — you do not re-prepare them every time you cook, saving time on every meal.
Related Terms
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
System Prompt
Hidden instructions provided to an LLM that define its behavior, personality, constraints, and capabilities for a conversation. System prompts set the rules of engagement before the user interacts.
API
Application Programming Interface — a set of rules and protocols that allow different software applications to communicate with each other. In AI, APIs let developers integrate AI capabilities into their applications.