FP8 Example
Llama4 FP8 Example
Code Walkthrough
Let's walk through the main steps of the quantization process: 1. Load model 2. Configure quantization algorithm and scheme 3. Apply quantization 4. Confirm generations of the quantized model look sane 5. Save to disk in compressed-tensors format
1. Load Model
Load the model using AutoModelForCausalLM:
from compressed_tensors.offload import dispatch_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
MODEL_ID = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
2. Configure the Quantization Algorithm and Scheme
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=[
"re:.*lm_head",
"re:.*self_attn",
"re:.*router",
"re:.*vision_model.*",
"re:.*multi_modal_projector.*",
"Llama4TextAttention",
],
)
3. Apply Quantization
4. Confirm Generations of the Quantized Model Look Sane
print("========== SAMPLE GENERATION ==============")
dispatch_model(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")