auto-round Command: Examples, Options, and Usage

常用示例

Quantize a model

auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"

Use the best recipe

auto-round-best --model [model_id] --scheme "[W4A16]"

Use the light recipe

auto-round-light --model [model_id] --scheme "[W4A16]"

Quantize to 4-bit

auto-round --model [model_id] --bits 4 --group_size 128 --format "[auto_round,auto_awq,auto_gptq]" --output_dir [path/to/output]

Calibration-free

auto-round --model [model_id] --bits 4 --iters 0

Multi-GPU

auto-round --model [model_id] --device_map "[0,1,2,3]"

Evaluate

auto-round --model [path/to/quantized] --eval --tasks [mmlu,lambada_openai]

说明

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

参数

--model _MODEL_: Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).
--scheme _SCHEME_: Quantization scheme such as _W4A16_, _W2A16_, _W8A16_.
--bits _N_: Weight bit width: 2, 3, 4, or 8.
--group_size _N_: Quantization group size (e.g. 32, 64, 128).
--format _FORMAT_: Export format(s), comma-separated: _auto_round_, _auto_gptq_, _auto_awq_, _gguf:q4_k_m_, etc.
--output_dir _PATH_: Directory where the quantized model is written.
--dataset _SPEC_: Calibration data (local path or HuggingFace dataset). Supports _name:num=N_, _:concat=True_, _:apply_chat_template_, and comma-separated lists.
--iters _N_: Tuning iterations (_0_ for RTN, default _200_, up to _1000_ for best accuracy).
--bs _N_: Batch size (default 8).
--seqlen _N_: Calibration sequence length (default 2048).
--nsamples _N_: Number of calibration samples (default 128, up to 512 for best).
--lr _RATE_: Learning rate.
--device_map _SPEC_: GPU assignment, e.g. _auto_ or _0,1,2,3_.
--low_gpu_mem_usage: Reduce VRAM at the cost of more time.
--enable_torch_compile: Use torch.compile (requires PyTorch 2.6+).
--quant_lm_head: Also quantize the language-model head (auto_round format only).
--adam: Use the AdamW optimizer instead of signed gradient descent.
--eval: Evaluate the model after quantization.
--eval_backend _BACKEND_: Evaluation engine, _vllm_ or default Hugging Face.
--tasks _LIST_: Comma-separated lm-eval-harness tasks (e.g. _mmlu,lambada_openai_).

FAQ

What is the auto-round command used for?

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

How do I run a basic auto-round example?

Run `auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does --model _MODEL_ do in auto-round?

Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).

auto-round 命令

常用示例

说明

参数

FAQ

相关命令