Artificial Intelligence

Quantization

The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.

Why It Matters

Quantization can reduce model size by 4-8x and speed up inference, making it possible to run large models on edge devices, phones, and consumer hardware.

Example

Converting a 70B parameter model from FP16 (140GB) to INT4 (35GB) so it can run on a single GPU, with only minimal accuracy loss.

Think of it like...

Like converting a high-resolution photo to a smaller file — you lose some fine detail but the image is still perfectly usable and takes much less space.

Quantization

Why It Matters

Example

Think of it like...

Related Terms

QLoRA

Pruning

Edge Inference