torchrun Command: Examples, Options, and Usage

常用示例

Run distributed training

torchrun --standalone --nproc_per_node=4 [train.py]

Run multi-node training

torchrun --nnodes=2 --nproc_per_node=4 --rdzv_endpoint=[master_ip:29500] [train.py]

Run with specific rendezvous backend

torchrun --nnodes=2 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=[master_ip:29500] [train.py]

Run with fault tolerance

torchrun --nnodes=2 --nproc_per_node=4 --max_restarts=3 --rdzv_endpoint=[master_ip:29500] [train.py]

Run single GPU training

torchrun --standalone --nproc_per_node=1 [train.py]

说明

torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.

参数

--nnodes _min_:max_ or _N_: Number of nodes participating in training. Can be a range for elastic training.
--nproc_per_node _N_: Number of processes to spawn per node. Typically equals the number of GPUs.
--standalone: Single-node mode without external rendezvous. Sets up local rendezvous automatically.
--rdzv_backend _backend_: Rendezvous backend: c10d (default), etcd, etcd-v2, or static.
--rdzv_endpoint _host:port_: Rendezvous endpoint address. For c10d, the master node's IP and port.
--rdzv_id _id_: User-defined ID for the rendezvous group. All nodes must use the same ID.
--max_restarts _N_: Maximum number of worker group restarts on failure. Default is 0.
--node_rank _N_: Rank of this node (for static rendezvous).
--master_addr _addr_: Master node address (legacy, use --rdzv_endpoint instead).
--master_port _port_: Master node port (legacy, use --rdzv_endpoint instead).
--local-addr _addr_: Local address to bind to. Defaults to localhost.
--redirects _N_: Redirect stdout and stderr for each worker to log files. Format: 0:1,1:2 redirects stdout of worker 0 to file 1, etc.
--tee _N_: Tee stdout/stderr to both console and log files. Same format as --redirects.
--log-dir _dir_: Directory for log files when using --redirects or --tee.

FAQ

What is the torchrun command used for?

torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.

How do I run a basic torchrun example?

Run `torchrun --standalone --nproc_per_node=4 [train.py]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does --nnodes _min_:max_ or _N_ do in torchrun?

Number of nodes participating in training. Can be a range for elastic training.

torchrun 命令

常用示例

说明

参数

FAQ

相关命令