llmcompressor.modifiers.autoround.base
Classes:
-
AutoRoundModifier–Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf.
AutoRoundModifier
Bases: Modifier, QuantizationMixin
Implements the AutoRound algorithm from https://aclanthology.org/2024.findings-emnlp.662.pdf. This modifier leverages signed gradient descent (SignSGD) optimizer and block-wise loss to optimize rounding values and weight clipping in a few steps.
Sample yaml:
test_stage:
modifiers:
AutoRoundModifier:
iters: 200
config_groups:
group_0:
targets:
- "Linear"
input_activations: null
output_activations: null
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: group
group_size: 128
Lifecycle:
- on_initialize
- apply config to model
- on_start
- add input capture hooks to decoding layers
- on_sequential_epoch_end
- apply_autoround
- post_autoround_cleanup
- on_finalize
- remove_hooks()
- model.apply(freeze_module_quantization)
Parameters:
-
–config_groupsdictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized.
-
–targetslist of layer names to quantize if a scheme is provided. Defaults to Linear layers
-
–ignoreoptional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list.
-
–schemea single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level.
-
–sequential_targetsclass names of decoding layers to tune sequentially. If None, targets are inferred via
get_no_split_params()to respect no-split constraints for large models. Defaults to None. -
–itersnumber of tuning iterations per block (decoding layer). Higher values typically improve accuracy at the cost of longer tuning time. Defaults to 200.
-
–enable_torch_compilewhether to enable
torch.compileto accelerate the tuning loop. Disable if your environment or model encounters compilation issues. Defaults to True. -
–batch_sizecalibration/tuning batch size used by AutoRound when optimizing rounding/clipping parameters. Larger values can improve stability but require more memory. Defaults to 8.
-
–device_idsoptional device map string for layer dispatch during tuning. Examples: "0,1" for cuda:0 and cuda:1, or "auto" to use all available GPUs. When None, no dispatching occurs and the model remains on its current device. Defaults to None.
Methods:
-
apply_autoround–Applies AutoRound quantization tuning on the current decoding layer.
-
on_end–Finish calibrating by removing observers and calibration hooks
-
on_finalize–disable the quantization observers used by the AutoRound algorithm
-
on_initialize–Initialize the model state for quantization and calibration.
-
start_calibration–Register activation calibration hooks and enable quantization as we calibrate
apply_autoround
Applies AutoRound quantization tuning on the current decoding layer.
The tuning logic is as follows: for iter in range(iters): quant_output = forward(layer, cached_inputs) loss = mse_loss(quant_output, original_output) loss.backward() optimizer.step() if loss < best_loss: best_params = update_params(layer)
For more details, please refer to the AutoRound repository: https://github.com/intel/auto-round/
Source code in llmcompressor/modifiers/autoround/base.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 | |
on_end
Finish calibrating by removing observers and calibration hooks
Source code in llmcompressor/modifiers/autoround/base.py
on_finalize
disable the quantization observers used by the AutoRound algorithm
Parameters:
-
(stateState) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/autoround/base.py
on_initialize
Initialize the model state for quantization and calibration.
Parameters:
-
(stateState) –session state storing input model and calibration data
Source code in llmcompressor/modifiers/autoround/base.py
start_calibration
Register activation calibration hooks and enable quantization as we calibrate
Parameters:
-
(modelModule) –model to prepare for calibration
Source code in llmcompressor/modifiers/autoround/base.py
suspend_offloading
Temporarily suspend offloading, allow AutoRound to take over device movement