llmcompressor.observers
Framework for monitoring and analyzing model behavior during compression.
Provides observers for tracking tensor statistics, activation ranges, and model behavior during compression workflows. Includes min-max observers, MSE observers, and helper utilities for quantization and other compression techniques.
Modules:
-
base– -
helpers–Helper functions for observer token counting and analysis.
-
min_max– -
moving_base– -
mse–
Classes:
-
MemorylessMinMaxObserver–Compute quantization parameters by taking the min/max of the observed value
-
MinMaxObserver–Compute quantization parameters by taking the moving average of all min/max values
-
MovingAverageMSEObserver–Compute quantization parameters by finding the optimal min/max values which minimize
-
MovingAverageObserverBase–Compute quantization parameters by taking the moving average of min/max values
-
Observer–Base class for observers which compute quantization parameters given observerations
-
StaticMinMaxObserver–Compute quantization parameters by taking the min/max of all observed values
Functions:
-
flatten_for_calibration–Reshapes the value according to the quantization strategy for the purposes of
MemorylessMinMaxObserver
MemorylessMinMaxObserver(
base_name: str,
args: QuantizationArgs,
module: Optional[Module] = None,
**observer_kwargs,
)
Bases: Observer
Compute quantization parameters by taking the min/max of the observed value
Parameters:
-
(base_namestr) –str used to name the observer attribute
-
(argsQuantizationArgs) –quantization args used to calibrate and quantize the observed value
-
(moduleOptional[Module], default:None) –optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
Source code in llmcompressor/observers/base.py
MinMaxObserver
MinMaxObserver(
base_name: str,
args: QuantizationArgs,
module: Optional[Module] = None,
**observer_kwargs,
)
Bases: MovingAverageObserverBase
Compute quantization parameters by taking the moving average of all min/max values
Parameters:
-
(base_namestr) –str used to name the observer attribute
-
(argsQuantizationArgs) –quantization args used to calibrate and quantize the observed value
-
(moduleOptional[Module], default:None) –optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
Source code in llmcompressor/observers/moving_base.py
MovingAverageMSEObserver
Bases: MovingAverageObserverBase
Compute quantization parameters by finding the optimal min/max values which minimize the mean of quantization error squared.
mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
Parameters:
-
–base_namestr used to name the observer attribute
-
–argsquantization args used to calibrate and quantize the observed value
-
–moduleoptional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
maxshrink: maximum shrink amount (in “grid steps”). The number of search steps is int(maxshrink * grid)
patience: number of consecutive search steps without improvement before early stopping
grid: resolution of the shrink search. Larger values give finer granularity in shrink factors
norm: exponent used when computing the error. norm = 2 approximates MSE
global_scale: precomputed global scale to use for quantization. Ignored if
optimize_global_scaleis Trueoptimize_global_scale: If True, recompute
global_scalefrom the candidate min/max during each step of the search
Source code in llmcompressor/observers/mse.py
MovingAverageObserverBase
MovingAverageObserverBase(
base_name: str,
args: QuantizationArgs,
module: Optional[Module] = None,
**observer_kwargs,
)
Bases: Observer
Compute quantization parameters by taking the moving average of min/max values
Parameters:
-
(base_namestr) –str used to name the observer attribute
-
(argsQuantizationArgs) –quantization args used to calibrate and quantize the observed value
-
(moduleOptional[Module], default:None) –optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
Methods:
-
get_current_global_min_max–Calculate the min and max value of the observed value (without moving average)
-
get_current_min_max–Calculate the min and max value of the observed value (without moving average)
-
get_global_min_max–Calculate moving average of min and max values from observed value
-
get_min_max–Calculate moving average of min and max values from observed value
Source code in llmcompressor/observers/moving_base.py
get_current_global_min_max abstractmethod
Calculate the min and max value of the observed value (without moving average) for the purposes of global scale calculation
Source code in llmcompressor/observers/moving_base.py
get_current_min_max abstractmethod
Calculate the min and max value of the observed value (without moving average)
get_global_min_max
Calculate moving average of min and max values from observed value for the purposes of global scale calculation
Parameters:
-
(observedTensor) –value being observed whose shape is (num_observations, 1, group_size)
Returns:
-
MinMaxTuple–minimum value and maximum value whose shapes are (1, )
Source code in llmcompressor/observers/moving_base.py
get_min_max
Calculate moving average of min and max values from observed value
Parameters:
-
(observedTensor) –value being observed whose shape is (num_observations, *qparam_shape, group_size)
Returns:
-
MinMaxTuple–minimum value and maximum value whose shapes are (*qparam_shape, )
Source code in llmcompressor/observers/moving_base.py
Observer
Observer(
base_name: str,
args: QuantizationArgs,
module: Optional[Module] = None,
**observer_kwargs,
)
Bases: InternalModule, RegistryMixin
Base class for observers which compute quantization parameters given observerations of weights, activations, or attention states.
Example:
module = ...
observer = Observer.load_from_registry(observer, base_name="weight", args=...)
module.global_scale = observer.get_global_scale(module.weight)
scales, zero_points = observer(module.weight)
Parameters:
-
(base_namestr) –str used to name the observer attribute
-
(argsQuantizationArgs) –quantization args used to calibrate and quantize the observed value
-
(moduleOptional[Module], default:None) –optional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
Methods:
-
forward–Calculate updated scales and zero points from observed value
-
get_global_min_max–Calculate min and max values from observed value for the purposes of
-
get_global_scale–Calculate updated global scale from observed value
-
get_min_max–Calculate min and max values from observed value
Source code in llmcompressor/observers/base.py
forward
Calculate updated scales and zero points from observed value (weight, activation, or attention state).
Parameters:
-
(observedTensor) –value being observed
Returns:
-
ScaleZpTuple–calibrated scale and zero point
Source code in llmcompressor/observers/base.py
get_global_min_max abstractmethod
Calculate min and max values from observed value for the purposes of global scale calculation
Parameters:
-
(observedTensor) –value of shape (num_observations, 1, group_size)
Returns:
-
MinMaxTuple–minimum value and maximum value whose shapes are (1, )
Source code in llmcompressor/observers/base.py
get_global_scale
Calculate updated global scale from observed value (weight, activation, or attention state).
Parameters:
-
(observedTensor) –value being observed
Returns:
-
Tensor–calibrated global parameter
Source code in llmcompressor/observers/base.py
get_min_max abstractmethod
Calculate min and max values from observed value
Parameters:
-
(observedTensor) –value of shape (num_observations, *qparam_shape, group_size)
Returns:
-
MinMaxTuple–minimum value and maximum value whose shapes are (*qparam_shape, )
Source code in llmcompressor/observers/base.py
StaticMinMaxObserver
Bases: Observer
Compute quantization parameters by taking the min/max of all observed values
Parameters:
-
–base_namestr used to name the observer attribute
-
–argsquantization args used to calibrate and quantize the observed value
-
–moduleoptional module with attached quantization parameters. This argument is required to utilize existing qparams such as global_scale or g_idx
-
–**observer_kwargskeyword arguments for observer initialization
Source code in llmcompressor/observers/min_max.py
flatten_for_calibration
flatten_for_calibration(
value: Tensor,
base_name: str,
args: QuantizationArgs,
g_idx: Optional[Tensor] = None,
) -> torch.Tensor
Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:
(num_observations, *qparam_shape, group_size)
For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.
The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.
Parameters:
-
(valueTensor) –value being flattened
-
(base_namestr) –weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state
-
(argsQuantizationArgs) –quantization args for determining how the value is flattened
-
(g_idxOptional[Tensor], default:None) –optional gidx for weight activation ordering
Returns:
-
Tensor–value which has been reshaped for calibration