Linux command
dcgmi 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Show GPU discovery
dcgmi discovery -l
Display GPU health
dcgmi health -g [group_id]
Run diagnostics
dcgmi diag -r [1]
Show real-time stats
dcgmi dmon
Create a GPU group
dcgmi group -c [group_name]
Add GPU to group
dcgmi group -g [group_id] -a [gpu_id]
Show GPU topology
dcgmi topo -g [group_id]
说明
dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.
参数
- discovery -l
- List discovered GPUs.
- health -g _GROUP_
- Check health of GPU group.
- diag -r _LEVEL_
- Run diagnostics (level 1-4).
- dmon
- Real-time monitoring dashboard.
- group -c _NAME_
- Create named GPU group.
- topo -g _GROUP_
- Show interconnect topology.
- fieldgroup -c _NAME_
- Create a named field group for metric collection.
- modules -l
- List available DCGM modules and their status.
- policy -g _GROUP_
- View or set GPU policy conditions.
- stats -j _JOB_ID_
- Display job-level GPU statistics.
- --host _HOST_:_PORT_
- Connect to a remote DCGM host daemon (default: localhost:5555).
- --help
- Display help information.
FAQ
What is the dcgmi command used for?
dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.
How do I run a basic dcgmi example?
Run `dcgmi discovery -l` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does discovery -l do in dcgmi?
List discovered GPUs.