← 返回命令列表

Linux command

auto-round 命令

文本

复制后可按需替换文件名、目录或参数。

常用示例

Quantize a model

auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"

Use the best recipe

auto-round-best --model [model_id] --scheme "[W4A16]"

Use the light recipe

auto-round-light --model [model_id] --scheme "[W4A16]"

Quantize to 4-bit

auto-round --model [model_id] --bits 4 --group_size 128 --format "[auto_round,auto_awq,auto_gptq]" --output_dir [path/to/output]

Calibration-free

auto-round --model [model_id] --bits 4 --iters 0

Multi-GPU

auto-round --model [model_id] --device_map "[0,1,2,3]"

Evaluate

auto-round --model [path/to/quantized] --eval --tasks [mmlu,lambada_openai]

说明

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

参数

--model _MODEL_
Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).
--scheme _SCHEME_
Quantization scheme such as _W4A16_, _W2A16_, _W8A16_.
--bits _N_
Weight bit width: 2, 3, 4, or 8.
--group_size _N_
Quantization group size (e.g. 32, 64, 128).
--format _FORMAT_
Export format(s), comma-separated: _auto_round_, _auto_gptq_, _auto_awq_, _gguf:q4_k_m_, etc.
--output_dir _PATH_
Directory where the quantized model is written.
--dataset _SPEC_
Calibration data (local path or HuggingFace dataset). Supports _name:num=N_, _:concat=True_, _:apply_chat_template_, and comma-separated lists.
--iters _N_
Tuning iterations (_0_ for RTN, default _200_, up to _1000_ for best accuracy).
--bs _N_
Batch size (default 8).
--seqlen _N_
Calibration sequence length (default 2048).
--nsamples _N_
Number of calibration samples (default 128, up to 512 for best).
--lr _RATE_
Learning rate.
--device_map _SPEC_
GPU assignment, e.g. _auto_ or _0,1,2,3_.
--low_gpu_mem_usage
Reduce VRAM at the cost of more time.
--enable_torch_compile
Use torch.compile (requires PyTorch 2.6+).
--quant_lm_head
Also quantize the language-model head (auto_round format only).
--adam
Use the AdamW optimizer instead of signed gradient descent.
--eval
Evaluate the model after quantization.
--eval_backend _BACKEND_
Evaluation engine, _vllm_ or default Hugging Face.
--tasks _LIST_
Comma-separated lm-eval-harness tasks (e.g. _mmlu,lambada_openai_).

FAQ

What is the auto-round command used for?

auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).

How do I run a basic auto-round example?

Run `auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does --model _MODEL_ do in auto-round?

Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).