Distributed Oneshot

As an experimental feature, LLM Compressor supports distributed oneshot for the purpose of greatly speeding up the runtime of model calibration and compression. For more information on implementation, see [RFC][Performance Refactor][Distributed] Sequential Onloading with Data-Parallel Calibration and Weight-Parallel Optimization as well as [GPTQ][ddp] enabling DDP for GPTQ.

Usage

In order to convert a script meant for single-threaded compression into one of distributed compression, please make the following changes:

1. Initialize the Distributed Context

In order to utilize the torch.distributed module, each rank must initialize the distributed module and assign itself a separate GPU device. This can be done by calling the init_dist utility provided by compressed_tensors.

from compressed_tensors.offload import init_dist

init_dist()

2. Modify Model Loading

In order to prevent separate processes from loading the model multiple times and creating excess work/memory usage, we must load our model using the load_offloaded_model context. For more information, see Model Loading.

Before:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto"
)

After:

from compressed_tensors.offload import load_offloaded_model

with load_offloaded_model():
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        dtype="auto",
        device_map="auto_offload",
    )

3. Modify Dataset Loading

In order to prevent separate processes loading the entire dataset and creating excess work/memory usage, we must partition our dataset into disjoint sets. For a dataset of N samples and R ranks, each rank only loads N/R samples.

ds = load_dataset(
    DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"
)

from llmcompressor.datasets.utils import get_rank_partition

ds = load_dataset(
    DATASET_ID, split=get_rank_partition(DATASET_SPLIT, NUM_CALIBRATION_SAMPLES)
)

4. Call your script with `torchrun`

Now, your script is ready to run using distributed processes. To start, simply run your script using python3 -m torchrun --nproc_per_node=2 YOUR_EXAMPLE.py to run with two GPU devices. For a complete example script, see llama_ddp_example.py. The below table shows results and speedups as of LLM Compressor v0.10.0, future changes will bring these numbers closer to linear speedups.

model_id	world_size	max_time	max_memory	save_time	flex_extract	eval_time
Meta-Llama-3-8B-Instruct	1	745.03	5.82	19.57	0.7066	95.28
Meta-Llama-3-8B-Instruct	2	372.20	5.57	49.10	0.7089	95.24
Meta-Llama-3-8B-Instruct	4	264.07	5.82	52.50	0.7180	96.74
Qwen3-30B-A3B	1	14207.53	6.56	748.23	0.8704	209.93
Qwen3-30B-A3B	2	7018.25	6.36	696.65	0.8810	205.89
Qwen3-30B-A3B	4	3694.46	6.36	723.05	0.8832	217.62

Distributed Oneshot