Latency
The time delay between sending a request to an AI model and receiving the response. In ML systems, latency includes data preprocessing, model inference, and network transmission time.
Why It Matters
Latency determines user experience — a chatbot with 10-second response times feels broken, while one with 200ms feels instant. It is a critical production metric.
Example
An LLM API call taking 800ms to return the first token (time-to-first-token latency) and 3 seconds to generate the complete response.
Think of it like...
Like the wait time at a restaurant — from when you place your order to when food arrives. Some dishes (complex queries) naturally take longer than others.
Related Terms
Throughput
The number of requests or predictions a model can process in a given time period. High throughput means the system can serve many users simultaneously.
Model Serving
The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.
Inference
The process of using a trained model to make predictions on new, previously unseen data. Inference is what happens when an AI model is deployed and actively serving results to users.