llmcompressor.entrypoints.oneshot
Oneshot compression entrypoint for post-training model optimization.
Provides the main oneshot compression entry point for applying quantization, pruning, and other compression techniques to pre-trained models without additional training. Supports calibration-based compression with various pipeline configurations for efficient model optimization.
Classes:
-
Oneshot–Class responsible for carrying out one-shot calibration on a pretrained model.
Functions:
-
oneshot–Performs oneshot calibration on a model.
Oneshot
Class responsible for carrying out one-shot calibration on a pretrained model.
This class handles the entire lifecycle of one-shot calibration, including preprocessing (model and tokenizer/processor initialization), model optimization (quantization or sparsification), and postprocessing (saving outputs). The instructions for model optimization can be specified by using a recipe.
-
Input Keyword Arguments:
kwargsare parsed into:model_args: Arguments for loading and configuring a pretrained model (e.g.,AutoModelForCausalLM).dataset_args: Arguments for dataset-related configurations, such as calibration dataloaders.recipe_args: Arguments for defining and configuring recipes that specify optimization actions.
Parsers are defined in
src/llmcompressor/args/. -
Lifecycle Overview: The oneshot calibration lifecycle consists of three steps:
- Preprocessing:
- Instantiates a pretrained model and tokenizer/processor.
- Ensures input and output embedding layers are untied if they share tensors.
- Patches the model to include additional functionality for saving with quantization configurations.
- Oneshot Calibration:
- Optimizes the model using a global
CompressionSessionand applies recipe-defined modifiers (e.g.,GPTQModifier,SparseGPTModifier)
- Optimizes the model using a global
- Postprocessing:
- Saves the model, tokenizer/processor, and configuration to the specified
output_dir.
- Saves the model, tokenizer/processor, and configuration to the specified
- Preprocessing:
-
Usage:
Methods: init(**kwargs): Initializes the Oneshot object by parsing input arguments, performing preprocessing, and setting instance attributes.
__call__(**kwargs):
Performs the one-shot calibration process by preparing a calibration
dataloader, applying recipe modifiers to the model, and executing
postprocessing steps.
save():
Saves the calibrated model and tokenizer/processor to the specified
`output_dir`. Supports saving in compressed formats based on model
arguments.
apply_recipe_modifiers(calibration_dataloader, **kwargs):
Applies lifecycle actions (e.g., `initialize`, `finalize`) using modifiers
defined in the recipe. Each action is executed via the global
`CompressionSession`.
Initializes the Oneshot class with provided arguments.
Parses the input keyword arguments into model_args, dataset_args, and recipe_args. Performs preprocessing to initialize the model and tokenizer/processor.
Parameters:
-
–model_argsModelArguments parameters, responsible for controlling model loading and saving logic
-
–dataset_argsDatasetArguments parameters, responsible for controlling dataset loading, preprocessing and dataloader loading
-
–recipe_argsRecipeArguments parameters, responsible for containing recipe-related parameters
-
–output_dirPath to save the output model after carrying out oneshot
-
(log_dirstr | None, default:None) –Path to save logs during oneshot run. Nothing is logged to file if None.
Methods:
-
apply_recipe_modifiers–Applies recipe modifiers to the model during the lifecycle.
Source code in llmcompressor/entrypoints/oneshot.py
apply_recipe_modifiers
apply_recipe_modifiers(
calibration_dataloader: DataLoader | None,
recipe_stage: str | None = None,
)
Applies recipe modifiers to the model during the lifecycle.
The modifiers are defined in the recipe and executed via lifecycle actions (initialize, finalize) through the global CompressionSession.
Source code in llmcompressor/entrypoints/oneshot.py
oneshot
oneshot(
model: str | PreTrainedModel,
config_name: str | None = None,
tokenizer: str | PreTrainedTokenizerBase | None = None,
processor: str | ProcessorMixin | None = None,
use_auth_token: bool = False,
precision: str = "auto",
tie_word_embeddings: bool = True,
trust_remote_code_model: bool = False,
save_compressed: bool = True,
model_revision: str = "main",
recipe: str | list[str] | None = None,
recipe_args: list[str] | None = None,
clear_sparse_session: bool = False,
stage: str | None = None,
dataset: str | Dataset | DatasetDict | None = None,
dataset_config_name: str | None = None,
dataset_path: str | None = None,
splits: str | list[str] | dict[str, str] | None = None,
batch_size: int = 1,
data_collator: str | Callable = "truncation",
num_calibration_samples: int = 512,
shuffle_calibration_samples: bool = True,
max_seq_length: int = 384,
pad_to_max_length: bool = True,
text_column: str = "text",
concatenate_data: bool = False,
streaming: bool = False,
overwrite_cache: bool = False,
preprocessing_num_workers: int | None = None,
dataloader_num_workers: int = 0,
min_tokens_per_module: float | None = None,
moe_calibrate_all_experts: bool = True,
pipeline: str | None = "independent",
tracing_ignore: list[str] = [
"_update_causal_mask",
"create_causal_mask",
"_update_mamba_mask",
"make_causal_mask",
"get_causal_mask",
"mask_interface",
"mask_function",
"_prepare_4d_causal_attention_mask",
"_prepare_fsmt_decoder_inputs",
"_prepare_4d_causal_attention_mask_with_cache_position",
"_update_linear_attn_mask",
"project_per_layer_inputs",
],
sequential_targets: list[str] | None = None,
sequential_offload_device: str = "cpu",
quantization_aware_calibration: bool = True,
sequential_prefetch: bool = False,
output_dir: str | None = None,
log_dir: str | None = None,
**kwargs,
) -> PreTrainedModel
Performs oneshot calibration on a model.
Model arguments
Parameters:
-
(modelstr | PreTrainedModel) –A pretrained model identifier from huggingface.co/models or a path to a local model. Required parameter.
-
–distill_teacherTeacher model (a trained text generation model) for distillation.
-
(config_namestr | None, default:None) –Pretrained config name or path if not the same as model_name.
-
(tokenizerstr | PreTrainedTokenizerBase | None, default:None) –Pretrained tokenizer name or path if not the same as model_name.
-
(processorstr | ProcessorMixin | None, default:None) –Pretrained processor name or path if not the same as model_name.
-
(use_auth_tokenbool, default:False) –Whether to use Hugging Face auth token for private models.
-
(precisionstr, default:'auto') –Precision to cast model weights to, default to auto.
-
(tie_word_embeddingsbool, default:True) –Whether the model's input and output word embeddings should be left tied if possible. False means always untie.
-
(trust_remote_code_modelbool, default:False) –Whether to allow for custom models to execute their own modeling files.
-
(save_compressedbool, default:True) –Whether to compress sparse models during save.
-
(model_revisionstr, default:'main') –The specific model version to use (can be branch name, tag, or commit id).
Recipe arguments
-
(recipestr | list[str] | None, default:None) –Path to a LLM Compressor recipe, or a list of paths to multiple LLM Compressor recipes.
-
(recipe_argslist[str] | None, default:None) –List of recipe arguments to evaluate, in the format "key1=value1", "key2=value2".
-
(clear_sparse_sessionbool, default:False) –Whether to clear CompressionSession/ CompressionLifecycle data between runs.
-
(stagestr | None, default:None) –The stage of the recipe to use for oneshot.
Dataset arguments
-
(datasetstr | Dataset | DatasetDict | None, default:None) –The name of the dataset to use (via the datasets library).
-
(dataset_config_namestr | None, default:None) –The configuration name of the dataset to use.
-
(dataset_pathstr | None, default:None) –Path to a custom dataset. Supports json, csv, dvc.
-
(splitsstr | list[str] | dict[str, str] | None, default:None) –Optional percentages of each split to download.
-
(batch_sizeint, default:1) –calibration dataset batch size. During calibration, LLM Compressor disables lm_head output computations to reduce memory usage from large calibration batch sizes. Large batch sizes may result excess padding or truncation, depending on the data_collator
-
(data_collatorstr | Callable, default:'truncation') –The function to use to form a batch from the dataset. Can also specify 'truncation' or 'padding' to truncate or pad non-uniform sequence lengths in a batch. Defaults to 'truncation'.
-
(num_calibration_samplesint, default:512) –Number of samples to use for one-shot calibration.
-
(shuffle_calibration_samplesbool, default:True) –Whether to shuffle the dataset before calibration.
-
(max_seq_lengthint, default:384) –Maximum total input sequence length after tokenization.
-
(pad_to_max_lengthbool, default:True) –Whether to pad all samples to
max_seq_length. -
(text_columnstr, default:'text') –Key to use as the
textinput to tokenizer/processor. -
(concatenate_databool, default:False) –Whether to concatenate datapoints to fill max_seq_length.
-
(streamingbool, default:False) –True to stream data from a cloud dataset.
-
(overwrite_cachebool, default:False) –Whether to overwrite the cached preprocessed datasets.
-
(preprocessing_num_workersint | None, default:None) –Number of processes for dataset preprocessing.
-
(dataloader_num_workersint, default:0) –Number of worker processes for data loading. Default is 0 (safe for low CPU/GPU memory). Set to 2 or more for faster calibration if you have sufficient RAM. Custom data collators may not work with multiprocessing.
-
(min_tokens_per_modulefloat | None, default:None) –Minimum percentage of tokens per module, relevant for MoE models.
-
(moe_calibrate_all_expertsbool, default:True) –Whether to calibrate all experts during MoE model calibration. When True, all experts will see all tokens during calibration, ensuring proper quantization statistics. When False, only routed experts will be used. Only relevant for MoE models. Default is True.
-
(pipelinestr | None, default:'independent') –Calibration pipeline used to calibrate model Options: ['basic', 'datafree', 'sequential', 'independent']
-
(tracing_ignorelist[str], default:['_update_causal_mask', 'create_causal_mask', '_update_mamba_mask', 'make_causal_mask', 'get_causal_mask', 'mask_interface', 'mask_function', '_prepare_4d_causal_attention_mask', '_prepare_fsmt_decoder_inputs', '_prepare_4d_causal_attention_mask_with_cache_position', '_update_linear_attn_mask', 'project_per_layer_inputs']) –List of functions to ignore during tracing, either {module}.{method_name} or {function_name}
-
(sequential_targetslist[str] | None, default:None) –List of layer targets for the sequential pipeline. This is typically a single DecoderLayer. Not specifying this argument will cause the sequential pipeline to default to using the
no_split_paramsspecified by the HF model definition -
(sequential_offload_devicestr, default:'cpu') –Device used to offload intermediate activations between sequential layers. It is recommended to use
cuda:1if using more than one gpu. Default is cpu. -
(quantization_aware_calibrationbool, default:True) –Whether to enable quantization-aware calibration in the sequential pipeline. When True, quantization is applied during forward pass in calibration. When False, quantization is disabled during forward pass in calibration. Default is set to True.
-
(sequential_prefetchbool, default:False) –When using the sequential pipeline, prefetch the next batch in a background thread to overlap onload with forward. Default False; set True for faster calibration when GPU memory allows.
Miscellaneous arguments
-
(output_dirstr | None, default:None) –Path to save the output model after calibration. Nothing is saved if None.
-
(log_dirstr | None, default:None) –Path to save logs during oneshot run. Nothing is logged to file if None.
Returns:
-
PreTrainedModel–The calibrated PreTrainedModel
Source code in llmcompressor/entrypoints/oneshot.py
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 | |