llama.cpp Command: Examples, Options, and Usage

常用示例

Run interactive chat

./main -m [model.gguf] -i

Generate text with prompt

./main -m [model.gguf] -p "[Your prompt here]"

Set context size

./main -m [model.gguf] -c [4096] -p "[prompt]"

Use multiple threads

./main -m [model.gguf] -t [8] -p "[prompt]"

Run server mode

./server -m [model.gguf] --port [8080]

Quantize model

./quantize [model.gguf] [output.gguf] [q4_0]

说明

llama.cpp is a port of Meta's LLaMA model to C/C++ for efficient CPU and GPU inference. It supports various quantization formats and runs LLMs on consumer hardware. The project includes tools for model conversion, quantization, and serving.

参数

-m _model_: Path to GGUF model file.
-p _prompt_: Input prompt.
-i: Interactive mode.
-c _size_: Context size.
-t _threads_: Number of threads.
-n _tokens_: Number of tokens to generate.
--temp _temp_: Temperature for sampling.
-ngl _layers_: GPU layers to offload.

FAQ

What is the llama.cpp command used for?

llama.cpp is a port of Meta's LLaMA model to C/C++ for efficient CPU and GPU inference. It supports various quantization formats and runs LLMs on consumer hardware. The project includes tools for model conversion, quantization, and serving.

How do I run a basic llama.cpp example?

Run `./main -m [model.gguf] -i` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does -m _model_ do in llama.cpp?

Path to GGUF model file.

llama.cpp 命令

常用示例

说明

参数

FAQ

相关命令