Quantization
The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.
Why It Matters
Quantization can reduce model size by 4-8x and speed up inference, making it possible to run large models on edge devices, phones, and consumer hardware.
Example
Converting a 70B parameter model from FP16 (140GB) to INT4 (35GB) so it can run on a single GPU, with only minimal accuracy loss.
Think of it like...
Like converting a high-resolution photo to a smaller file — you lose some fine detail but the image is still perfectly usable and takes much less space.
Related Terms
QLoRA
Quantized Low-Rank Adaptation — combines LoRA with quantization to further reduce memory requirements for fine-tuning. It quantizes the base model to 4-bit precision while training LoRA adapters in higher precision.
Pruning
A model compression technique that removes unnecessary or redundant weights, neurons, or layers from a trained neural network. Like pruning a plant, it removes parts that are not contributing to overall health.
Edge Inference
Running AI models directly on local devices (phones, IoT sensors, cameras) rather than sending data to the cloud. This reduces latency, preserves privacy, and works without internet connectivity.