QLoRA
Quantized Low-Rank Adaptation — combines LoRA with quantization to further reduce memory requirements for fine-tuning. It quantizes the base model to 4-bit precision while training LoRA adapters in higher precision.
Why It Matters
QLoRA enables fine-tuning of massive models on a single consumer GPU, democratizing access to custom LLMs for individuals and small organizations.
Example
Fine-tuning a 65B parameter model on a single 48GB GPU using 4-bit quantization for the base model combined with LoRA adapters for the trainable parameters.
Think of it like...
Like compressing a huge reference library into pocket-sized summaries and only keeping full-size versions of the chapters you are actively editing.
Related Terms
LoRA
Low-Rank Adaptation — a parameter-efficient fine-tuning technique that freezes the original model weights and adds small trainable matrices to each layer. It dramatically reduces the compute and memory needed for fine-tuning.
Quantization
The process of reducing the precision of a model's numerical weights (e.g., from 32-bit to 8-bit or 4-bit), making the model smaller and faster while accepting a small trade-off in accuracy.
Fine-Tuning
The process of taking a pre-trained model and further training it on a smaller, domain-specific dataset to specialize its behavior for a particular task or domain. Fine-tuning adjusts the model's weights to improve performance on the target task.