Linux command
torchrun 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Run distributed training
torchrun --standalone --nproc_per_node=4 [train.py]
Run multi-node training
torchrun --nnodes=2 --nproc_per_node=4 --rdzv_endpoint=[master_ip:29500] [train.py]
Run with specific rendezvous backend
torchrun --nnodes=2 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=[master_ip:29500] [train.py]
Run with fault tolerance
torchrun --nnodes=2 --nproc_per_node=4 --max_restarts=3 --rdzv_endpoint=[master_ip:29500] [train.py]
Run single GPU training
torchrun --standalone --nproc_per_node=1 [train.py]
说明
torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.
参数
- --nnodes _min_:max_ or _N_
- Number of nodes participating in training. Can be a range for elastic training.
- --nproc_per_node _N_
- Number of processes to spawn per node. Typically equals the number of GPUs.
- --standalone
- Single-node mode without external rendezvous. Sets up local rendezvous automatically.
- --rdzv_backend _backend_
- Rendezvous backend: c10d (default), etcd, etcd-v2, or static.
- --rdzv_endpoint _host:port_
- Rendezvous endpoint address. For c10d, the master node's IP and port.
- --rdzv_id _id_
- User-defined ID for the rendezvous group. All nodes must use the same ID.
- --max_restarts _N_
- Maximum number of worker group restarts on failure. Default is 0.
- --node_rank _N_
- Rank of this node (for static rendezvous).
- --master_addr _addr_
- Master node address (legacy, use --rdzv_endpoint instead).
- --master_port _port_
- Master node port (legacy, use --rdzv_endpoint instead).
- --local-addr _addr_
- Local address to bind to. Defaults to localhost.
- --redirects _N_
- Redirect stdout and stderr for each worker to log files. Format: 0:1,1:2 redirects stdout of worker 0 to file 1, etc.
- --tee _N_
- Tee stdout/stderr to both console and log files. Same format as --redirects.
- --log-dir _dir_
- Directory for log files when using --redirects or --tee.
FAQ
What is the torchrun command used for?
torchrun is PyTorch's distributed training launcher that replaces the deprecated torch.distributed.launch. It spawns multiple processes across GPUs and nodes, setting up the distributed environment for training neural networks at scale. The launcher supports various distributed strategies including Data Distributed Parallel (DDP), Fully Sharded Data Parallel (FSDP), tensor parallelism, and hybrid approaches. It automatically sets environment variables like RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, and MASTER_PORT for distributed communication. For single-node multi-GPU training, use --standalone mode. For multi-node training, all nodes must specify the same rendezvous endpoint where they coordinate. The launcher supports elastic training with dynamic node counts and fault tolerance with automatic restarts when workers fail.
How do I run a basic torchrun example?
Run `torchrun --standalone --nproc_per_node=4 [train.py]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does --nnodes _min_:max_ or _N_ do in torchrun?
Number of nodes participating in training. Can be a range for elastic training.