Linux command
auto-round 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Quantize a model
auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"
Use the best recipe
auto-round-best --model [model_id] --scheme "[W4A16]"
Use the light recipe
auto-round-light --model [model_id] --scheme "[W4A16]"
Quantize to 4-bit
auto-round --model [model_id] --bits 4 --group_size 128 --format "[auto_round,auto_awq,auto_gptq]" --output_dir [path/to/output]
Calibration-free
auto-round --model [model_id] --bits 4 --iters 0
Multi-GPU
auto-round --model [model_id] --device_map "[0,1,2,3]"
Evaluate
auto-round --model [path/to/quantized] --eval --tasks [mmlu,lambada_openai]
说明
auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).
参数
- --model _MODEL_
- Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).
- --scheme _SCHEME_
- Quantization scheme such as _W4A16_, _W2A16_, _W8A16_.
- --bits _N_
- Weight bit width: 2, 3, 4, or 8.
- --group_size _N_
- Quantization group size (e.g. 32, 64, 128).
- --format _FORMAT_
- Export format(s), comma-separated: _auto_round_, _auto_gptq_, _auto_awq_, _gguf:q4_k_m_, etc.
- --output_dir _PATH_
- Directory where the quantized model is written.
- --dataset _SPEC_
- Calibration data (local path or HuggingFace dataset). Supports _name:num=N_, _:concat=True_, _:apply_chat_template_, and comma-separated lists.
- --iters _N_
- Tuning iterations (_0_ for RTN, default _200_, up to _1000_ for best accuracy).
- --bs _N_
- Batch size (default 8).
- --seqlen _N_
- Calibration sequence length (default 2048).
- --nsamples _N_
- Number of calibration samples (default 128, up to 512 for best).
- --lr _RATE_
- Learning rate.
- --device_map _SPEC_
- GPU assignment, e.g. _auto_ or _0,1,2,3_.
- --low_gpu_mem_usage
- Reduce VRAM at the cost of more time.
- --enable_torch_compile
- Use torch.compile (requires PyTorch 2.6+).
- --quant_lm_head
- Also quantize the language-model head (auto_round format only).
- --adam
- Use the AdamW optimizer instead of signed gradient descent.
- --eval
- Evaluate the model after quantization.
- --eval_backend _BACKEND_
- Evaluation engine, _vllm_ or default Hugging Face.
- --tasks _LIST_
- Comma-separated lm-eval-harness tasks (e.g. _mmlu,lambada_openai_).
FAQ
What is the auto-round command used for?
auto-round is a weight-only post-training quantization (PTQ) toolkit for LLMs and VLMs, developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, achieving high accuracy at ultra-low bit widths (down to 2 bits) with minimal calibration time. The toolkit supports CPU, Intel GPU (XPU), HPU, and CUDA back-ends and exports to several popular quantization formats including auto_round, auto_awq, auto_gptq, and gguf, so models can be served via Transformers, vLLM, SGLang, or llm-compressor without re-quantization. Three recipes are provided: auto-round (default balance), auto-round-best (slowest, highest accuracy, 4–5× slower), and auto-round-light (fastest, 2–3× speedup).
How do I run a basic auto-round example?
Run `auto-round --model [Qwen/Qwen3-0.6B] --scheme "[W4A16]" --format "[auto_round]"` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does --model _MODEL_ do in auto-round?
Model identifier or local path (e.g. _Qwen/Qwen3-0.6B_).