llmcompressor.observers

Framework for monitoring and analyzing model behavior during compression.

Provides observers for tracking tensor statistics, activation ranges, and model behavior during compression workflows. Includes min-max observers, MSE observers, and helper utilities for quantization and other compression techniques.

Modules:

base –
helpers –

Helper functions for observer token counting and analysis.
min_max –
moving_base –
mse –

Classes:

MemorylessMinMaxObserver –

Compute quantization parameters by taking the min/max of the observed value
MinMaxObserver –

Compute quantization parameters by taking the moving average of all min/max values
MovingAverageMSEObserver –

Compute quantization parameters by finding the optimal min/max values which minimize
MovingAverageObserverBase –

Compute quantization parameters by taking the moving average of min/max values
Observer –

Base class for observers which compute quantization parameters given observerations
StaticMinMaxObserver –

Compute quantization parameters by taking the min/max of all observed values

Functions:

flatten_for_calibration –

Reshapes the value according to the quantization strategy for the purposes of

MemorylessMinMaxObserver

MemorylessMinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

Compute quantization parameters by taking the min/max of the observed value

Parameters:

base_name
(str) –

str used to name the observer attribute
args
(QuantizationArgs) –

quantization args used to calibrate and quantize the observed value
module
(Optional[Module], default: None ) –

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

Source code in llmcompressor/observers/base.py

def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

MinMaxObserver

MinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: MovingAverageObserverBase

Compute quantization parameters by taking the moving average of all min/max values

Parameters:

base_name
(str) –

str used to name the observer attribute
args
(QuantizationArgs) –

quantization args used to calibrate and quantize the observed value
module
(Optional[Module], default: None ) –

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

Source code in llmcompressor/observers/moving_base.py

def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

MovingAverageMSEObserver

MovingAverageMSEObserver(*args, **kwargs)

Bases: MovingAverageObserverBase

Compute quantization parameters by finding the optimal min/max values which minimize the mean of quantization error squared.

mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))

Parameters:

base_name
–

str used to name the observer attribute
args
–

quantization args used to calibrate and quantize the observed value
module
–

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

maxshrink: maximum shrink amount (in “grid steps”). The number of search steps is int(maxshrink * grid)

patience: number of consecutive search steps without improvement before early stopping

grid: resolution of the shrink search. Larger values give finer granularity in shrink factors

norm: exponent used when computing the error. norm = 2 approximates MSE

global_scale: precomputed global scale to use for quantization. Ignored if optimize_global_scale is True

optimize_global_scale: If True, recompute global_scale from the candidate min/max during each step of the search

Source code in llmcompressor/observers/mse.py

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    observer_kwargs = self.args.observer_kwargs
    self.maxshrink = observer_kwargs.get("maxshrink", 0.20)
    self.patience = observer_kwargs.get("patience", 5)
    self.grid = observer_kwargs.get("grid", 100.0)
    self.norm = observer_kwargs.get("norm", 2.4)

MovingAverageObserverBase

MovingAverageObserverBase(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

Compute quantization parameters by taking the moving average of min/max values

Parameters:

base_name
(str) –

str used to name the observer attribute
args
(QuantizationArgs) –

quantization args used to calibrate and quantize the observed value
module
(Optional[Module], default: None ) –

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

Methods:

get_current_global_min_max –

Calculate the min and max value of the observed value (without moving average)
get_current_min_max –

Calculate the min and max value of the observed value (without moving average)
get_global_min_max –

Calculate moving average of min and max values from observed value
get_min_max –

Calculate moving average of min and max values from observed value

Source code in llmcompressor/observers/moving_base.py

def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

get_current_global_min_max `abstractmethod`

get_current_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate the min and max value of the observed value (without moving average) for the purposes of global scale calculation

Source code in llmcompressor/observers/moving_base.py

@abstractmethod
def get_current_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    for the purposes of global scale calculation
    """
    raise NotImplementedError()

get_current_min_max `abstractmethod`

get_current_min_max(observed: Tensor) -> MinMaxTuple

Calculate the min and max value of the observed value (without moving average)

Source code in llmcompressor/observers/moving_base.py

@abstractmethod
def get_current_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    """
    raise NotImplementedError()

get_global_min_max

get_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate moving average of min and max values from observed value for the purposes of global scale calculation

Parameters:

observed
(Tensor) –

value being observed whose shape is (num_observations, 1, group_size)

Returns:

MinMaxTuple –

minimum value and maximum value whose shapes are (1, )

Source code in llmcompressor/observers/moving_base.py

def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value
    for the purposes of global scale calculation

    :param observed: value being observed whose shape is
        (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    min_vals, max_vals = self.get_current_global_min_max(observed)

    if self.past_global_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(
            self.past_global_min_vals, min_vals, self.avg_constant
        )
        max_vals = self._lerp(
            self.past_global_max_vals, max_vals, self.avg_constant
        )

    self.past_global_min_vals = min_vals
    self.past_global_max_vals = max_vals

    return min_vals, max_vals

get_min_max

get_min_max(observed: Tensor) -> MinMaxTuple

Calculate moving average of min and max values from observed value

Parameters:

observed
(Tensor) –

value being observed whose shape is (num_observations, *qparam_shape, group_size)

Returns:

MinMaxTuple –

minimum value and maximum value whose shapes are (*qparam_shape, )

Source code in llmcompressor/observers/moving_base.py

def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value

    :param observed: value being observed whose shape is
        (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    min_vals, max_vals = self.get_current_min_max(observed)

    if self.past_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(self.past_min_vals, min_vals, self.avg_constant)
        max_vals = self._lerp(self.past_max_vals, max_vals, self.avg_constant)

    self.past_min_vals = min_vals
    self.past_max_vals = max_vals

    return min_vals, max_vals

Observer

Observer(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: InternalModule, RegistryMixin

Base class for observers which compute quantization parameters given observerations of weights, activations, or attention states.

Example:

module = ...
observer = Observer.load_from_registry(observer, base_name="weight", args=...)
module.global_scale = observer.get_global_scale(module.weight)
scales, zero_points = observer(module.weight)

Parameters:

base_name
(str) –

str used to name the observer attribute
args
(QuantizationArgs) –

quantization args used to calibrate and quantize the observed value
module
(Optional[Module], default: None ) –

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

Methods:

forward –

Calculate updated scales and zero points from observed value
get_global_min_max –

Calculate min and max values from observed value for the purposes of
get_global_scale –

Calculate updated global scale from observed value
get_min_max –

Calculate min and max values from observed value

Source code in llmcompressor/observers/base.py

def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

forward

forward(observed: Tensor) -> ScaleZpTuple

Calculate updated scales and zero points from observed value (weight, activation, or attention state).

Parameters:

observed
(Tensor) –

value being observed

Returns:

ScaleZpTuple –

calibrated scale and zero point

Source code in llmcompressor/observers/base.py

@torch.no_grad
def forward(self, observed: torch.Tensor) -> ScaleZpTuple:
    """
    Calculate updated scales and zero points from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated scale and zero point
    """
    scales, zero_points, _min, _max = self._forward_with_minmax(observed)
    return (scales, zero_points)

get_global_min_max `abstractmethod`

get_global_min_max(observed: Tensor) -> MinMaxTuple

Calculate min and max values from observed value for the purposes of global scale calculation

Parameters:

observed
(Tensor) –

value of shape (num_observations, 1, group_size)

Returns:

MinMaxTuple –

minimum value and maximum value whose shapes are (1, )

Source code in llmcompressor/observers/base.py

@abstractmethod
def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value for the purposes of
    global scale calculation

    :param observed: value of shape (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    raise NotImplementedError()

get_global_scale

get_global_scale(observed: Tensor) -> torch.Tensor

Calculate updated global scale from observed value (weight, activation, or attention state).

Parameters:

observed
(Tensor) –

value being observed

Returns:

Tensor –

calibrated global parameter

Source code in llmcompressor/observers/base.py

@torch.no_grad
def get_global_scale(self, observed: torch.Tensor) -> torch.Tensor:
    """
    Calculate updated global scale from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated global parameter
    """
    global_scale, _min, _max = self._get_global_scale_with_minmax(observed)
    return global_scale

get_min_max `abstractmethod`

get_min_max(observed: Tensor) -> MinMaxTuple

Calculate min and max values from observed value

Parameters:

observed
(Tensor) –

value of shape (num_observations, *qparam_shape, group_size)

Returns:

MinMaxTuple –

minimum value and maximum value whose shapes are (*qparam_shape, )

Source code in llmcompressor/observers/base.py

@abstractmethod
def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value

    :param observed: value of shape (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    raise NotImplementedError()

StaticMinMaxObserver

StaticMinMaxObserver(*args, **kwargs)

Bases: Observer

Compute quantization parameters by taking the min/max of all observed values

Parameters:

base_name
–

str used to name the observer attribute
args
–

quantization args used to calibrate and quantize the observed value
module
–

optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
**observer_kwargs
–

keyword arguments for observer initialization

Source code in llmcompressor/observers/min_max.py

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

flatten_for_calibration

flatten_for_calibration(
    value: Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[Tensor] = None,
) -> torch.Tensor

Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:

(num_observations, *qparam_shape, group_size)

For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.

The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.

Parameters:

value
(Tensor) –

value being flattened
base_name
(str) –

weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state
args
(QuantizationArgs) –

quantization args for determining how the value is flattened
g_idx
(Optional[Tensor], default: None ) –

optional gidx for weight activation ordering

Returns:

Tensor –

value which has been reshaped for calibration

Source code in llmcompressor/observers/helpers.py

def flatten_for_calibration(
    value: torch.Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """
    Reshapes the value according to the quantization strategy for the purposes of
    scale/zp calibration. The value after flattening has the following shape:

    `(num_observations, *qparam_shape, group_size)`

    For block quantization, value will be zero-padded if it is not evenly
    divisible by block_size, so as not to distort the calculated qparams and to be
    compatible with vllm block-wise kernels that do not require even divisibility.

    The first dim is the number of observations (usually the batch size times number of
    tokens), the middle dims are the dimension of the scales, and the last dim is the
    number of elements being quantized per group.

    :param value: value being flattened
    :param base_name: weight, input, output, q/k/v. Used to characterize the value as
        being a weight, activation, or attention state
    :param args: quantization args for determining how the value is flattened
    :param g_idx: optional gidx for weight activation ordering
    :return: value which has been reshaped for calibration
    """
    if base_name == "weight":
        return _flatten_weight(value, args, g_idx)
    elif base_name in ("input", "output"):
        return _flatten_activation(value, args)
    elif base_name in ("q", "k", "v"):
        return _flatten_attention(value, args)
    else:
        raise ValueError(f"Unknown quantization base name: {base_name}")

llmcompressor.observers

MemorylessMinMaxObserver

base_name

args

module

**observer_kwargs

MinMaxObserver

base_name

args

module

**observer_kwargs

MovingAverageMSEObserver

base_name

args

module

**observer_kwargs

MovingAverageObserverBase

base_name

args

module

**observer_kwargs

get_current_global_min_max abstractmethod

get_current_min_max abstractmethod

get_global_min_max

observed

get_min_max

observed

Observer

base_name

args

module

**observer_kwargs

forward

observed

get_global_min_max abstractmethod

observed

get_global_scale

observed

get_min_max abstractmethod

observed

StaticMinMaxObserver

base_name

args

module

**observer_kwargs

flatten_for_calibration

value

base_name

args

g_idx

`base_name`

`args`

`module`

`observer_kwargs`**

`base_name`

`args`

`module`

`observer_kwargs`**

`base_name`

`args`

`module`

`observer_kwargs`**

`base_name`

`args`

`module`

`observer_kwargs`**

get_current_global_min_max `abstractmethod`

get_current_min_max `abstractmethod`

`observed`

`observed`

`base_name`

`args`

`module`

`observer_kwargs`**

`observed`

get_global_min_max `abstractmethod`

`observed`

`observed`

get_min_max `abstractmethod`

`observed`

`base_name`

`args`

`module`

`observer_kwargs`**

`value`

`base_name`

`args`

`g_idx`