← 返回命令列表

Linux command

torchrun 命令

文本

复制后可按需替换文件名、目录或参数。

常用示例

Run distributed training

torchrun --standalone --nproc_per_node=4 [train.py]

Run multi-node training

torchrun --nnodes=2 --nproc_per_node=4 --rdzv_endpoint=[master_ip:29500] [train.py]

Run with specific rendezvous backend

torchrun --nnodes=2 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=[master_ip:29500] [train.py]

Run with fault tolerance

torchrun --nnodes=2 --nproc_per_node=4 --max_restarts=3 --rdzv_endpoint=[master_ip:29500] [train.py]

Run single GPU training

torchrun --standalone --nproc_per_node=1 [train.py]

说明

torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.

参数

--nnodes _min_:max_ or _N_
Number of nodes participating in training. Can be a range for elastic training.
--nproc_per_node _N_
Number of processes to spawn per node. Typically equals the number of GPUs.
--standalone
Single-node mode without external rendezvous. Sets up local rendezvous automatically.
--rdzv_backend _backend_
Rendezvous backend: c10d (default), etcd, etcd-v2, or static.
--rdzv_endpoint _host:port_
Rendezvous endpoint address. For c10d, the master node's IP and port.
--rdzv_id _id_
User-defined ID for the rendezvous group. All nodes must use the same ID.
--max_restarts _N_
Maximum number of worker group restarts on failure. Default is 0.
--node_rank _N_
Rank of this node (for static rendezvous).
--master_addr _addr_
Master node address (legacy, use --rdzv_endpoint instead).
--master_port _port_
Master node port (legacy, use --rdzv_endpoint instead).
--local-addr _addr_
Local address to bind to. Defaults to localhost.
--redirects _N_
Redirect stdout and stderr for each worker to log files. Format: 0:1,1:2 redirects stdout of worker 0 to file 1, etc.
--tee _N_
Tee stdout/stderr to both console and log files. Same format as --redirects.
--log-dir _dir_
Directory for log files when using --redirects or --tee.

FAQ

What is the torchrun command used for?

torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.

How do I run a basic torchrun example?

Run `torchrun --standalone --nproc_per_node=4 [train.py]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does --nnodes _min_:max_ or _N_ do in torchrun?

Number of nodes participating in training. Can be a range for elastic training.