Skip to content

llmcompressor.observers.helpers

Helper functions for observer token counting and analysis.

Provides utility functions for analyzing observer statistics and token counts across model modules. Used for monitoring compression effects and understanding model behavior during quantization and pruning operations.

Functions:

flatten_for_calibration

flatten_for_calibration(
    value: Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[Tensor] = None,
) -> torch.Tensor

Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:

(num_observations, *qparam_shape, group_size)

For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.

The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.

Parameters:

  • value

    (Tensor) –

    value being flattened

  • base_name

    (str) –

    weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state

  • args

    (QuantizationArgs) –

    quantization args for determining how the value is flattened

  • g_idx

    (Optional[Tensor], default: None ) –

    optional gidx for weight activation ordering

Returns:

  • Tensor

    value which has been reshaped for calibration

Source code in llmcompressor/observers/helpers.py
def flatten_for_calibration(
    value: torch.Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """
    Reshapes the value according to the quantization strategy for the purposes of
    scale/zp calibration. The value after flattening has the following shape:

    `(num_observations, *qparam_shape, group_size)`

    For block quantization, value will be zero-padded if it is not evenly
    divisible by block_size, so as not to distort the calculated qparams and to be
    compatible with vllm block-wise kernels that do not require even divisibility.

    The first dim is the number of observations (usually the batch size times number of
    tokens), the middle dims are the dimension of the scales, and the last dim is the
    number of elements being quantized per group.

    :param value: value being flattened
    :param base_name: weight, input, output, q/k/v. Used to characterize the value as
        being a weight, activation, or attention state
    :param args: quantization args for determining how the value is flattened
    :param g_idx: optional gidx for weight activation ordering
    :return: value which has been reshaped for calibration
    """
    if base_name == "weight":
        return _flatten_weight(value, args, g_idx)
    elif base_name in ("input", "output"):
        return _flatten_activation(value, args)
    elif base_name in ("q", "k", "v"):
        return _flatten_attention(value, args)
    else:
        raise ValueError(f"Unknown quantization base name: {base_name}")