llmcompressor.args.dataset_arguments
Dataset argument classes for LLM compression workflows.
This module defines dataclass-based argument containers for configuring dataset loading, preprocessing, and calibration parameters across different dataset sources and processing pipelines. Supports various input formats including HuggingFace datasets, custom JSON/CSV files, and DVC-managed datasets.
Classes:
-
CustomDatasetArguments–Arguments for calibration using custom datasets
-
DVCDatasetArguments–Arguments for calibration using DVC
-
DatasetArguments–Arguments pertaining to what data we are going to use for
CustomDatasetArguments dataclass
DVCDatasetArguments dataclass
Arguments for calibration using DVC
DatasetArguments dataclass
DatasetArguments(
dvc_data_repository: str | None = None,
dataset_path: str | None = None,
text_column: str = "text",
remove_columns: None | str | list[str] = None,
preprocessing_func: None | str | Callable = None,
batch_size: int = 1,
data_collator: str | Callable = "truncation",
dataset: str | None = None,
dataset_config_name: str | None = None,
max_seq_length: int = 384,
concatenate_data: bool = False,
raw_kwargs: dict = dict(),
splits: None | str | list[str] | dict[str, str] = None,
num_calibration_samples: int | None = 512,
shuffle_calibration_samples: bool = True,
streaming: bool | None = False,
overwrite_cache: bool = False,
preprocessing_num_workers: int | None = None,
pad_to_max_length: bool = True,
min_tokens_per_module: float | None = None,
moe_calibrate_all_experts: bool = True,
pipeline: str | None = "independent",
tracing_ignore: list[str] = (
lambda: [
"_update_causal_mask",
"create_causal_mask",
"_update_mamba_mask",
"make_causal_mask",
"get_causal_mask",
"mask_interface",
"mask_function",
"_prepare_4d_causal_attention_mask",
"_prepare_fsmt_decoder_inputs",
"_prepare_4d_causal_attention_mask_with_cache_position",
"_update_linear_attn_mask",
"project_per_layer_inputs",
]
)(),
sequential_targets: list[str] | None = None,
sequential_offload_device: str = "cpu",
quantization_aware_calibration: bool = True,
use_loss_mask: bool = False,
dataloader_num_workers: int = 0,
sequential_prefetch: bool = False,
)
Bases: CustomDatasetArguments
Arguments pertaining to what data we are going to use for calibration
Using HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line