Choosing your dataset
Depending on your selected algorithm or scheme, you may also require a dataset. Many quantization algorithms, such as GPTQ, AWQ, SmoothQuant, and AutoRound, require a calibration dataset to analyze activation patterns and optimize weight transformations. This dataset helps the algorithm identify which weights and activations are most critical to preserve during compression. LLM Compressor also supports many datasets from the Hugging Face Datasets library, making it easy to find a suitable dataset for calibration.
Algorithms requiring datasets
- AWQ
- GPTQ
- AutoRound
- SmoothQuant
Info
RTN (Round-to-Nearest) quantization is data-free and can compress models without any calibration dataset. However, calibration-based methods typically achieve better accuracy recovery, especially at lower bit-widths.
Schemes requiring datasets
Quantization schemes where activations are quantized non-dynamically (i.e the scales to quantize the activations are not determined during inference time) will also require a dataset.
These Include: - NVFP4: Data is required to calibrate the activation scales, allowing quantization of the activatins to FP4 during inference - Static-Per Tensor Activation Quantization: Commonly used with FP8 and INT8 weight quantization, if you are targeting a static-per tensor scheme for activation quantization, data is required to calibrate a single scale which enables quantization of the activations to 8 bits during inference
Key considerations
When selecting a calibration dataset, consider the following factors:
Domain alignment
The calibration dataset should be representative of your target use case. For general-purpose language models, common choices include:
- General text datasets: WikiText or C4 for broad language understanding
- Instruction-tuning data: UltraChat for instruction-following models
- Domain-specific data: E.g. code datasets for coding models
Some popular datasets include:
| Dataset | Best for | Description |
|---|---|---|
ultrachat-200k | Instruction-following models | High-quality conversational data for chat and assistant models |
open-platypus | General instruction models | Diverse instruction-following examples |
wikitext-2-raw-v1 | General language models | Clean Wikipedia text for broad language understanding |
c4 | General pre-training | Large-scale web text for general-purpose models |
Dataset size
Most calibration algorithms work well with relatively small datasets:
- Typical range: 128-512 samples is sufficient for most models
- Trade-off: More samples improve representation but increase compression time
- Recommendation: Start with 128-256 samples; increase if accuracy recovery is insufficient
LLM Compressor makes it easy to use popular calibration datasets from Hugging Face by providing access to processed datasets. Users only need to pass in a string with the dataset name or a supported dataset. Begin with a standard dataset like ultrachat-200k, then iterate if needed. Alternatively, you can use your own custom dataset and pass in the dataset object into LLM Compressor for calibration as well.
LLM Compressor provides easy access to popular calibration datasets from Hugging Face. For supported datasets, simply pass the dataset name as a string (e.g., ultrachat-200k). Start with a standard dataset and iterate as needed. You can also use custom datasets by passing a dataset object directly to LLM Compressor for calibration.