Skip to content

llmcompressor.observers

Framework for monitoring and analyzing model behavior during compression.

Provides observers for tracking tensor statistics, activation ranges, and model behavior during compression workflows. Includes min-max observers, MSE observers, and helper utilities for quantization and other compression techniques.

Modules:

Classes:

  • MemorylessMinMaxObserver

    Compute quantization parameters by taking the min/max of the observed value

  • MinMaxObserver

    Compute quantization parameters by taking the moving average of all min/max values

  • MovingAverageMSEObserver

    Compute quantization parameters by finding the optimal min/max values which minimize

  • MovingAverageObserverBase

    Compute quantization parameters by taking the moving average of min/max values

  • Observer

    Base class for observers which compute quantization parameters given observerations

  • StaticMinMaxObserver

    Compute quantization parameters by taking the min/max of all observed values

Functions:

MemorylessMinMaxObserver

MemorylessMinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

Compute quantization parameters by taking the min/max of the observed value

Parameters:

  • base_name

    (str) –

    str used to name the observer attribute

  • args

    (QuantizationArgs) –

    quantization args used to calibrate and quantize the observed value

  • module

    (Optional[Module], default: None ) –

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

Source code in llmcompressor/observers/base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

MinMaxObserver

MinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: MovingAverageObserverBase

Compute quantization parameters by taking the moving average of all min/max values

Parameters:

  • base_name

    (str) –

    str used to name the observer attribute

  • args

    (QuantizationArgs) –

    quantization args used to calibrate and quantize the observed value

  • module

    (Optional[Module], default: None ) –

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

Source code in llmcompressor/observers/moving_base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

MovingAverageMSEObserver

MovingAverageMSEObserver(*args, **kwargs)

Bases: MovingAverageObserverBase

Compute quantization parameters by finding the optimal min/max values which minimize the mean of quantization error squared.

mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))

Parameters:

  • base_name

    str used to name the observer attribute

  • args

    quantization args used to calibrate and quantize the observed value

  • module

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

    maxshrink: maximum shrink amount (in “grid steps”). The number of search steps is int(maxshrink * grid)

    patience: number of consecutive search steps without improvement before early stopping

    grid: resolution of the shrink search. Larger values give finer granularity in shrink factors

    norm: exponent used when computing the error. norm = 2 approximates MSE

    global_scale: precomputed global scale to use for quantization. Ignored if optimize_global_scale is True

    optimize_global_scale: If True, recompute global_scale from the candidate min/max during each step of the search

Source code in llmcompressor/observers/mse.py
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    observer_kwargs = self.args.observer_kwargs
    self.maxshrink = observer_kwargs.get("maxshrink", 0.20)
    self.patience = observer_kwargs.get("patience", 5)
    self.grid = observer_kwargs.get("grid", 100.0)
    self.norm = observer_kwargs.get("norm", 2.4)

MovingAverageObserverBase

MovingAverageObserverBase(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

Compute quantization parameters by taking the moving average of min/max values

Parameters:

  • base_name

    (str) –

    str used to name the observer attribute

  • args

    (QuantizationArgs) –

    quantization args used to calibrate and quantize the observed value

  • module

    (Optional[Module], default: None ) –

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

Methods:

  • get_current_global_min_max

    Calculate the min and max value of the observed value (without moving average)

  • get_current_min_max

    Calculate the min and max value of the observed value (without moving average)

  • get_global_min_max

    Calculate moving average of min and max values from observed value

  • get_min_max

    Calculate moving average of min and max values from observed value

Source code in llmcompressor/observers/moving_base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

get_current_global_min_max abstractmethod

get_current_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate the min and max value of the observed value (without moving average) for the purposes of global scale calculation

Source code in llmcompressor/observers/moving_base.py
@abstractmethod
def get_current_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    for the purposes of global scale calculation
    """
    raise NotImplementedError()

get_current_min_max abstractmethod

get_current_min_max(observed: Tensor) -> MinMaxTuple

Calculate the min and max value of the observed value (without moving average)

Source code in llmcompressor/observers/moving_base.py
@abstractmethod
def get_current_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    """
    raise NotImplementedError()

get_global_min_max

get_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate moving average of min and max values from observed value for the purposes of global scale calculation

Parameters:

  • observed

    (Tensor) –

    value being observed whose shape is (num_observations, 1, group_size)

Returns:

  • MinMaxTuple

    minimum value and maximum value whose shapes are (1, )

Source code in llmcompressor/observers/moving_base.py
def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value
    for the purposes of global scale calculation

    :param observed: value being observed whose shape is
        (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    min_vals, max_vals = self.get_current_global_min_max(observed)

    if self.past_global_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(
            self.past_global_min_vals, min_vals, self.avg_constant
        )
        max_vals = self._lerp(
            self.past_global_max_vals, max_vals, self.avg_constant
        )

    self.past_global_min_vals = min_vals
    self.past_global_max_vals = max_vals

    return min_vals, max_vals

get_min_max

get_min_max(observed: Tensor) -> MinMaxTuple

Calculate moving average of min and max values from observed value

Parameters:

  • observed

    (Tensor) –

    value being observed whose shape is (num_observations, *qparam_shape, group_size)

Returns:

  • MinMaxTuple

    minimum value and maximum value whose shapes are (*qparam_shape, )

Source code in llmcompressor/observers/moving_base.py
def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value

    :param observed: value being observed whose shape is
        (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    min_vals, max_vals = self.get_current_min_max(observed)

    if self.past_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(self.past_min_vals, min_vals, self.avg_constant)
        max_vals = self._lerp(self.past_max_vals, max_vals, self.avg_constant)

    self.past_min_vals = min_vals
    self.past_max_vals = max_vals

    return min_vals, max_vals

Observer

Observer(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: InternalModule, RegistryMixin

Base class for observers which compute quantization parameters given observerations of weights, activations, or attention states.

Example:

module = ...
observer = Observer.load_from_registry(observer, base_name="weight", args=...)
module.global_scale = observer.get_global_scale(module.weight)
scales, zero_points = observer(module.weight)

Parameters:

  • base_name

    (str) –

    str used to name the observer attribute

  • args

    (QuantizationArgs) –

    quantization args used to calibrate and quantize the observed value

  • module

    (Optional[Module], default: None ) –

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

Methods:

  • forward

    Calculate updated scales and zero points from observed value

  • get_global_min_max

    Calculate min and max values from observed value for the purposes of

  • get_global_scale

    Calculate updated global scale from observed value

  • get_min_max

    Calculate min and max values from observed value

Source code in llmcompressor/observers/base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

forward

forward(observed: Tensor) -> ScaleZpTuple

Calculate updated scales and zero points from observed value (weight, activation, or attention state).

Parameters:

  • observed

    (Tensor) –

    value being observed

Returns:

  • ScaleZpTuple

    calibrated scale and zero point

Source code in llmcompressor/observers/base.py
@torch.no_grad
def forward(self, observed: torch.Tensor) -> ScaleZpTuple:
    """
    Calculate updated scales and zero points from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated scale and zero point
    """
    scales, zero_points, _min, _max = self._forward_with_minmax(observed)
    return (scales, zero_points)

get_global_min_max abstractmethod

get_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate min and max values from observed value for the purposes of global scale calculation

Parameters:

  • observed

    (Tensor) –

    value of shape (num_observations, 1, group_size)

Returns:

  • MinMaxTuple

    minimum value and maximum value whose shapes are (1, )

Source code in llmcompressor/observers/base.py
@abstractmethod
def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value for the purposes of
    global scale calculation

    :param observed: value of shape (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    raise NotImplementedError()

get_global_scale

get_global_scale(observed: Tensor) -> torch.Tensor

Calculate updated global scale from observed value (weight, activation, or attention state).

Parameters:

  • observed

    (Tensor) –

    value being observed

Returns:

  • Tensor

    calibrated global parameter

Source code in llmcompressor/observers/base.py
@torch.no_grad
def get_global_scale(self, observed: torch.Tensor) -> torch.Tensor:
    """
    Calculate updated global scale from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated global parameter
    """
    global_scale, _min, _max = self._get_global_scale_with_minmax(observed)
    return global_scale

get_min_max abstractmethod

get_min_max(observed: Tensor) -> MinMaxTuple

Calculate min and max values from observed value

Parameters:

  • observed

    (Tensor) –

    value of shape (num_observations, *qparam_shape, group_size)

Returns:

  • MinMaxTuple

    minimum value and maximum value whose shapes are (*qparam_shape, )

Source code in llmcompressor/observers/base.py
@abstractmethod
def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value

    :param observed: value of shape (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    raise NotImplementedError()

StaticMinMaxObserver

StaticMinMaxObserver(*args, **kwargs)

Bases: Observer

Compute quantization parameters by taking the min/max of all observed values

Parameters:

  • base_name

    str used to name the observer attribute

  • args

    quantization args used to calibrate and quantize the observed value

  • module

    optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx

  • **observer_kwargs

    keyword arguments for observer initialization

Source code in llmcompressor/observers/min_max.py
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

flatten_for_calibration

flatten_for_calibration(
    value: Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[Tensor] = None,
) -> torch.Tensor

Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:

(num_observations, *qparam_shape, group_size)

For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.

The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.

Parameters:

  • value

    (Tensor) –

    value being flattened

  • base_name

    (str) –

    weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state

  • args

    (QuantizationArgs) –

    quantization args for determining how the value is flattened

  • g_idx

    (Optional[Tensor], default: None ) –

    optional gidx for weight activation ordering

Returns:

  • Tensor

    value which has been reshaped for calibration

Source code in llmcompressor/observers/helpers.py
def flatten_for_calibration(
    value: torch.Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """
    Reshapes the value according to the quantization strategy for the purposes of
    scale/zp calibration. The value after flattening has the following shape:

    `(num_observations, *qparam_shape, group_size)`

    For block quantization, value will be zero-padded if it is not evenly
    divisible by block_size, so as not to distort the calculated qparams and to be
    compatible with vllm block-wise kernels that do not require even divisibility.

    The first dim is the number of observations (usually the batch size times number of
    tokens), the middle dims are the dimension of the scales, and the last dim is the
    number of elements being quantized per group.

    :param value: value being flattened
    :param base_name: weight, input, output, q/k/v. Used to characterize the value as
        being a weight, activation, or attention state
    :param args: quantization args for determining how the value is flattened
    :param g_idx: optional gidx for weight activation ordering
    :return: value which has been reshaped for calibration
    """
    if base_name == "weight":
        return _flatten_weight(value, args, g_idx)
    elif base_name in ("input", "output"):
        return _flatten_activation(value, args)
    elif base_name in ("q", "k", "v"):
        return _flatten_attention(value, args)
    else:
        raise ValueError(f"Unknown quantization base name: {base_name}")