Artificial Intelligence

Model Serving

The infrastructure and process of deploying trained ML models to production where they can receive requests and return predictions in real time. It includes scaling, load balancing, and version management.

Why It Matters

Model serving determines the user experience — latency, reliability, and cost. A perfectly trained model is worthless if it cannot be served efficiently.

Example

Using a platform like AWS SageMaker or a framework like vLLM to host an LLM that handles thousands of concurrent user requests with sub-second response times.

Think of it like...

Like running a restaurant kitchen — you need to efficiently take orders, prepare dishes (run inference), and serve them quickly without the kitchen backing up.

Related Terms