FP8 Example
Mistral Large 3 FP8 Example
Code Walkthrough
Prerequisite: Script
#NOTE: Please run the following script before using `model_free_ptq`
#
# This script is used to reindex the safetensors files of a model such that all fused
# modules (gate_up, qkv) are in the same safetensors file. This is required by
# model_free_ptq for microscale schemes (NVFP4A16, MXFP4A16)
llmcompressor.reindex_fused_weights \
mistralai/Mistral-Large-3-675B-Instruct-2512-BF16 \
Mistral-Large-3-675B-Instruct-2512-BF16-reindexed \
--num_workers=10
Let's walk through the main steps of the quantization process: 1. Load model 2. Apply quantization 3. Modify ignore list
1. Load Model
from llmcompressor import model_free_ptq
MODEL_ID = "mistralai/Mistral-Large-3-675B-Instruct-2512-BF16"
REINDEX_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-reindexed"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-BLOCK"
2. Apply Quantization
model_free_ptq(
REINDEX_DIR,
SAVE_DIR,
scheme="FP8_BLOCK",
ignore=[
"tok_embeddings", # embeddings
"re:patch_merger.*", # patch merger
"re:vision_encoder.*", # vision tower
"re:vision_language_adapter.*", # vision adapter
"re:.*wkv_a_with_mqa$", # non divisible
"re:.*wq_a$", # fused with wkv_a_with_mqa
"re:.*gate$", # gate layers
"output", # lm head
],
max_workers=10,
device="cuda:0",
)
3. Modify Ignore List
vLLM uses different weight names than the names of the huggingface transformers model. To reflect this, please update the ignore list in params.json to the following: