llmcompressor.args.dataset_arguments

Dataset argument classes for LLM compression workflows.

This module defines dataclass-based argument containers for configuring dataset loading, preprocessing, and calibration parameters across different dataset sources and processing pipelines. Supports various input formats including HuggingFace datasets, custom JSON/CSV files, and DVC-managed datasets.

Classes:

CustomDatasetArguments –

Arguments for calibration using custom datasets
DVCDatasetArguments –

Arguments for calibration using DVC
DatasetArguments –

Arguments pertaining to what data we are going to use for

CustomDatasetArguments `dataclass`

CustomDatasetArguments(
    dvc_data_repository: str | None = None,
    dataset_path: str | None = None,
    text_column: str = "text",
    remove_columns: None | str | list[str] = None,
    preprocessing_func: None | str | Callable = None,
    batch_size: int = 1,
    data_collator: str | Callable = "truncation",
)

Bases: DVCDatasetArguments

Arguments for calibration using custom datasets

DVCDatasetArguments `dataclass`

DVCDatasetArguments(dvc_data_repository: str | None = None)

Arguments for calibration using DVC

DatasetArguments `dataclass`

DatasetArguments(
    dvc_data_repository: str | None = None,
    dataset_path: str | None = None,
    text_column: str = "text",
    remove_columns: None | str | list[str] = None,
    preprocessing_func: None | str | Callable = None,
    batch_size: int = 1,
    data_collator: str | Callable = "truncation",
    dataset: str | None = None,
    dataset_config_name: str | None = None,
    max_seq_length: int = 384,
    concatenate_data: bool = False,
    raw_kwargs: dict = dict(),
    splits: None | str | list[str] | dict[str, str] = None,
    num_calibration_samples: int | None = 512,
    shuffle_calibration_samples: bool = True,
    streaming: bool | None = False,
    overwrite_cache: bool = False,
    preprocessing_num_workers: int | None = None,
    pad_to_max_length: bool = True,
    min_tokens_per_module: float | None = None,
    moe_calibrate_all_experts: bool = True,
    pipeline: str | None = "independent",
    tracing_ignore: list[str] = (
        lambda: [
            "_update_causal_mask",
            "create_causal_mask",
            "_update_mamba_mask",
            "make_causal_mask",
            "get_causal_mask",
            "mask_interface",
            "mask_function",
            "_prepare_4d_causal_attention_mask",
            "_prepare_fsmt_decoder_inputs",
            "_prepare_4d_causal_attention_mask_with_cache_position",
            "_update_linear_attn_mask",
            "project_per_layer_inputs",
        ]
    )(),
    sequential_targets: list[str] | None = None,
    sequential_offload_device: str = "cpu",
    quantization_aware_calibration: bool = True,
    use_loss_mask: bool = False,
    dataloader_num_workers: int = 0,
    sequential_prefetch: bool = False,
)

Bases: CustomDatasetArguments

Arguments pertaining to what data we are going to use for calibration

Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line

llmcompressor.args.dataset_arguments

CustomDatasetArguments dataclass

DVCDatasetArguments dataclass

DatasetArguments dataclass

CustomDatasetArguments `dataclass`

DVCDatasetArguments `dataclass`

DatasetArguments `dataclass`