← 返回命令列表

Linux command

dcgmi 命令

文本

复制后可按需替换文件名、目录或参数。

常用示例

Show GPU discovery

dcgmi discovery -l

Display GPU health

dcgmi health -g [group_id]

Run diagnostics

dcgmi diag -r [1]

Show real-time stats

dcgmi dmon

Create a GPU group

dcgmi group -c [group_name]

Add GPU to group

dcgmi group -g [group_id] -a [gpu_id]

Show GPU topology

dcgmi topo -g [group_id]

说明

dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.

参数

discovery -l
List discovered GPUs.
health -g _GROUP_
Check health of GPU group.
diag -r _LEVEL_
Run diagnostics (level 1-4).
dmon
Real-time monitoring dashboard.
group -c _NAME_
Create named GPU group.
topo -g _GROUP_
Show interconnect topology.
fieldgroup -c _NAME_
Create a named field group for metric collection.
modules -l
List available DCGM modules and their status.
policy -g _GROUP_
View or set GPU policy conditions.
stats -j _JOB_ID_
Display job-level GPU statistics.
--host _HOST_:_PORT_
Connect to a remote DCGM host daemon (default: localhost:5555).
--help
Display help information.

FAQ

What is the dcgmi command used for?

dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.

How do I run a basic dcgmi example?

Run `dcgmi discovery -l` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does discovery -l do in dcgmi?

List discovered GPUs.