llmcompressor.modeling.llama4

Classes:

SequentialLlama4TextMoe –

Calibration version of Llama4TextMoe that unpacks experts for sequential processing.

SequentialLlama4TextMoe

SequentialLlama4TextMoe(
    original: Llama4TextMoe,
    config: Llama4Config,
    calibrate_all_experts: bool = True,
)

Bases: MoECalibrationModule

Calibration version of Llama4TextMoe that unpacks experts for sequential processing.

This module: 1. Unpacks the packed expert weights (3D -> 2D) for calibration 2. Optionally sends all tokens to all experts during calibration 3. Stays in unpacked form (permanent) for vLLM compatibility

Source code in llmcompressor/modeling/llama4.py

def __init__(
    self,
    original: Llama4TextMoe,
    config: Llama4Config,
    calibrate_all_experts: bool = True,
):
    super().__init__()
    # Extract text config from multimodal config
    text_config: Llama4TextConfig = config.get_text_config()
    self.top_k = text_config.num_experts_per_tok
    self.hidden_dim = text_config.hidden_size
    self.num_experts = text_config.num_local_experts

    self.experts = SequentialLlama4TextExperts(text_config, original.experts)
    self.router = original.router
    self.shared_expert = original.shared_expert
    self.calibrate_all_experts = calibrate_all_experts