llmcompressor.observers.helpers
Helper functions for observer token counting and analysis.
Provides utility functions for analyzing observer statistics and token counts across model modules. Used for monitoring compression effects and understanding model behavior during quantization and pruning operations.
Functions:
-
flatten_for_calibration–Reshapes the value according to the quantization strategy for the purposes of
flatten_for_calibration
flatten_for_calibration(
value: Tensor,
base_name: str,
args: QuantizationArgs,
g_idx: Optional[Tensor] = None,
) -> torch.Tensor
Reshapes the value according to the quantization strategy for the purposes of scale/zp calibration. The value after flattening has the following shape:
(num_observations, *qparam_shape, group_size)
For block quantization, value will be zero-padded if it is not evenly divisible by block_size, so as not to distort the calculated qparams and to be compatible with vllm block-wise kernels that do not require even divisibility.
The first dim is the number of observations (usually the batch size times number of tokens), the middle dims are the dimension of the scales, and the last dim is the number of elements being quantized per group.
Parameters:
-
(valueTensor) –value being flattened
-
(base_namestr) –weight, input, output, q/k/v. Used to characterize the value as being a weight, activation, or attention state
-
(argsQuantizationArgs) –quantization args for determining how the value is flattened
-
(g_idxOptional[Tensor], default:None) –optional gidx for weight activation ordering
Returns:
-
Tensor–value which has been reshaped for calibration