Speculative Decoding
A technique that uses a small, fast model to draft multiple tokens ahead, then uses the large model to verify them in parallel. It speeds up inference without changing output quality.
Why It Matters
Speculative decoding can speed up LLM inference by 2-3x with no quality loss — one of the most impactful serving optimizations available.
Example
A 1B parameter draft model quickly generating 10 candidate tokens, then the 70B main model verifying all 10 in one pass — much faster than generating 10 tokens one at a time.
Think of it like...
Like a junior associate drafting a contract for a senior partner to review — the senior only needs to check and approve rather than write from scratch.
Related Terms
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.
Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.