llmcompressor.utils
General utility functions used throughout LLM Compressor.
Modules:
-
dev– -
dist– -
helpers–General utility helper functions.
-
metric_logging–Utility functions for metrics logging and GPU memory monitoring.
-
pytorch– -
transformers–
Functions:
-
DisableQuantization–Disable quantization during forward passes after applying a quantization config
-
calibration_forward_context–Context in which all calibration forward passes should occur.
-
disable_cache–Temporarily disable the key-value cache for transformer models. Used to prevent
-
disable_hf_kernels–In transformers>=4.50.0, some module forward methods may be
-
disable_lm_head–Disable the lm_head of a model by moving it to the meta device. This function
-
dispatch_for_generation–Dispatch a model autoregressive generation. This means that modules are dispatched
-
eval_context–Disable pytorch training mode for the given module
-
get_embeddings–Returns input and output embeddings of a model. If
get_input_embeddings/ -
greedy_bin_packing–Distribute items across bins using a greedy bin-packing heuristic.
-
import_from_path–Import the module and the name of the function/class separated by :
-
is_package_available–A helper function to check if a package is available
-
patch_transformers_logger_level–Context under which the transformers logger's level is modified
-
skip_weights_download–Context manager under which models are initialized without having to download
-
targets_embeddings–Returns True if the given targets target the word embeddings of the model
-
untie_word_embeddings–Untie word embeddings, if possible. This function raises a warning if
-
wait_for_comms–Block until all pending async distributed operations complete.
DisableQuantization
Disable quantization during forward passes after applying a quantization config
Source code in llmcompressor/utils/helpers.py
calibration_forward_context
Context in which all calibration forward passes should occur.
- Remove gradient calculations
- Disable the KV cache
- Disable train mode and enable eval mode
- Disable hf kernels which could bypass hooks
- Disable lm head (input and weights can still be calibrated, output will be meta)
Source code in llmcompressor/utils/helpers.py
disable_cache
Temporarily disable the key-value cache for transformer models. Used to prevent excess memory use in one-shot cases where the model only performs the prefill phase and not the generation phase.
Example:
model = AutoModel.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") input = torch.randint(0, 32, size=(1, 32)) with disable_cache(model): ... output = model(input)
Source code in llmcompressor/utils/helpers.py
disable_hf_kernels
In transformers>=4.50.0, some module forward methods may be replaced by calls to hf hub kernels. This has the potential to bypass hooks added by LLM Compressor
Source code in llmcompressor/utils/helpers.py
disable_lm_head
Disable the lm_head of a model by moving it to the meta device. This function does not untie parameters and restores the model proper loading upon exit
Source code in llmcompressor/utils/helpers.py
dispatch_for_generation
Dispatch a model autoregressive generation. This means that modules are dispatched evenly across avaiable devices and kept onloaded if possible.
Parameters:
-
–modelmodel to dispatch
-
–hint_batch_sizereserve memory for batch size of inputs
-
–hint_batch_seq_lenreserve memory for sequence of length of inputs
-
–hint_model_dtypereserve memory for model's dtype. Will be inferred from model if none is provided
-
–hint_extra_memoryextra memory reserved for model serving
-
–no_split_modulesnames of module classes which should not be split across multiple devices
Returns:
-
PreTrainedModel–dispatched model
Source code in llmcompressor/utils/dev.py
eval_context
Disable pytorch training mode for the given module
Source code in llmcompressor/utils/helpers.py
get_embeddings
Returns input and output embeddings of a model. If get_input_embeddings/ get_output_embeddings is not implemented on the model, then None will be returned instead.
Parameters:
-
(modelPreTrainedModel) –model to get embeddings from
Returns:
-
tuple[Module | None, Module | None]–tuple of containing embedding modules or none
Source code in llmcompressor/utils/transformers.py
greedy_bin_packing
greedy_bin_packing(
items: list[T],
num_bins: int,
item_weight_fn: Callable[[T], float] = lambda x: 1,
) -> tuple[list[T], list[list[T]], dict[T, int]]
Distribute items across bins using a greedy bin-packing heuristic.
Items are sorted by weight in descending order, then each item is assigned to the bin with the smallest current total weight. This approximates an even distribution of weight across bins.
Parameters:
-
(itemslist[T]) –items to distribute. Sorted in-place by descending weight.
-
(num_binsint) –number of bins to distribute items across.
-
(item_weight_fnCallable[[T], float], default:lambda x: 1) –callable that returns the weight of an item. Defaults to uniform weight of 1.
Returns:
-
tuple[list[T], list[list[T]], dict[T, int]]–a 3-tuple of: - items: the input list, now sorted by descending weight. - bin_to_items: list of length
num_binswhere each element is the list of items assigned to that bin. - item_to_bin: mapping from each item to its assigned bin index.
Source code in llmcompressor/utils/dist.py
import_from_path
Import the module and the name of the function/class separated by : Examples: path = "/path/to/file.py:func_or_class_name" path = "/path/to/file:focn" path = "path.to.file:focn"
Parameters:
-
(pathstr) –path including the file path and object name
Source code in llmcompressor/utils/helpers.py
is_package_available
is_package_available(
package_name: str, return_version: bool = False
) -> Union[Tuple[bool, str], bool]
A helper function to check if a package is available and optionally return its version. This function enforces a check that the package is available and is not just a directory/file with the same name as the package.
inspired from: https://github.com/huggingface/transformers/blob/965cf677695dd363285831afca8cf479cf0c600c/src/transformers/utils/import_utils.py#L41
Parameters:
-
(package_namestr) –The package name to check for
-
(return_versionbool, default:False) –True to return the version of the package if available
Returns:
-
Union[Tuple[bool, str], bool]–True if the package is available, False otherwise or a tuple of (bool, version) if return_version is True
Source code in llmcompressor/utils/helpers.py
patch_transformers_logger_level
Context under which the transformers logger's level is modified
This can be used with skip_weights_download to squelch warnings related to missing parameters in the checkpoint
Parameters:
-
(levelint, default:ERROR) –new logging level for transformers logger. Logs whose level is below this level will not be logged
Source code in llmcompressor/utils/dev.py
skip_weights_download
Context manager under which models are initialized without having to download the model weight files. This differs from init_empty_weights in that weights are allocated on to assigned devices with random values, as opposed to being on the meta device
Parameters:
-
(model_classType[PreTrainedModel], default:AutoModelForCausalLM) –class to patch, defaults to
AutoModelForCausalLM
Source code in llmcompressor/utils/dev.py
targets_embeddings
targets_embeddings(
model: PreTrainedModel,
targets: NamedModules,
check_input: bool = True,
check_output: bool = True,
) -> bool
Returns True if the given targets target the word embeddings of the model
Parameters:
-
(modelPreTrainedModel) –containing word embeddings
-
(targetsNamedModules) –named modules to check
-
(check_inputbool, default:True) –whether to check if input embeddings are targeted
-
(check_outputbool, default:True) –whether to check if output embeddings are targeted
Returns:
-
bool–True if embeddings are targeted, False otherwise
Source code in llmcompressor/utils/transformers.py
untie_word_embeddings
Untie word embeddings, if possible. This function raises a warning if embeddings cannot be found in the model definition.
The model config will be updated to reflect that embeddings are now untied
Parameters:
-
(modelPreTrainedModel) –transformers model containing word embeddings
Source code in llmcompressor/utils/transformers.py
wait_for_comms
Block until all pending async distributed operations complete.
Calls wait() on each work handle, then clears the list in-place so it can be reused for the next batch of operations.
Parameters:
-
(pending_commslist[Work]) –mutable list of async communication handles (returned by
dist.reduce,dist.broadcast, etc. withasync_op=True). The list is cleared after all operations have completed.