llmcompressor.modeling.gpt_oss
Classes:
-
LinearExpert–One MoE expert with separate gate / up / down projections.
-
LinearExperts–Container of multiple LinearExpert modules, driven by
Functions:
-
convert_model_for_quantization_gptoss–In-place conversion of a GPT-OSS model:
-
find_experts–Locate GPT-OSS MoE expert modules under model.model.layers[*].mlp.experts.
LinearExpert
Bases: Module
One MoE expert with separate gate / up / down projections.
This mirrors the GPT-OSS expert behavior: gate = clamp(gate_proj(x)) up = clamp(up_proj(x)) glu = gate * sigmoid(alpha * gate) y = down_proj((up + 1) * glu)
Source code in llmcompressor/modeling/gpt_oss.py
LinearExperts
LinearExperts(
hidden_size: int,
intermediate_size: int,
num_experts: int,
alpha: float = 1.702,
limit: float = 7.0,
)
Bases: Module
Container of multiple LinearExpert modules, driven by router_indices / routing_weights.
This is the "separate gate/up" layout. It is meant to replace the original GPT-OSS experts submodule.
Methods:
-
copy_from_fused_weights–De-interleave fused gate_up weights/bias and copy into separate gate/up experts.
-
forward–Implements the MoE computation using the router outputs.
Source code in llmcompressor/modeling/gpt_oss.py
copy_from_fused_weights
copy_from_fused_weights(
legacy_gate_up_W: Tensor,
legacy_gate_up_b: Tensor,
legacy_down_W: Tensor,
legacy_down_b: Tensor,
) -> None
De-interleave fused gate_up weights/bias and copy into separate gate/up experts.
Source code in llmcompressor/modeling/gpt_oss.py
forward
forward(
hidden_states: Tensor,
router_indices: Optional[Tensor] = None,
routing_weights: Optional[Tensor] = None,
) -> torch.Tensor
Implements the MoE computation using the router outputs.
This is compatible with the GPT-OSS MoE call pattern: experts(hidden_states, router_indices, routing_weights)
Source code in llmcompressor/modeling/gpt_oss.py
convert_model_for_quantization_gptoss
In-place conversion of a GPT-OSS model:
- Finds all fused MoE expert blocks (with gate_up_proj/down_proj).
- Replaces them with LinearExperts that expose plain nn.Linear parameters (gate_proj, up_proj, down_proj), which play nicely with LLM Compressor W4A8 quantization.
Source code in llmcompressor/modeling/gpt_oss.py
find_experts
Locate GPT-OSS MoE expert modules under model.model.layers[*].mlp.experts.