dcgmi Command: Examples, Options, and Usage

常用示例

Show GPU discovery

dcgmi discovery -l

Display GPU health

dcgmi health -g [group_id]

Run diagnostics

dcgmi diag -r [1]

Show real-time stats

dcgmi dmon

Create a GPU group

dcgmi group -c [group_name]

Add GPU to group

dcgmi group -g [group_id] -a [gpu_id]

Show GPU topology

dcgmi topo -g [group_id]

说明

dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.

参数

discovery -l: List discovered GPUs.
health -g _GROUP_: Check health of GPU group.
diag -r _LEVEL_: Run diagnostics (level 1-4).
dmon: Real-time monitoring dashboard.
group -c _NAME_: Create named GPU group.
topo -g _GROUP_: Show interconnect topology.
fieldgroup -c _NAME_: Create a named field group for metric collection.
modules -l: List available DCGM modules and their status.
policy -g _GROUP_: View or set GPU policy conditions.
stats -j _JOB_ID_: Display job-level GPU statistics.
--host _HOST_:_PORT_: Connect to a remote DCGM host daemon (default: localhost:5555).
--help: Display help information.

FAQ

What is the dcgmi command used for?

dcgmi is the command-line interface for NVIDIA's Data Center GPU Manager (DCGM). It provides monitoring, management, and diagnostic capabilities for NVIDIA GPUs in data center and HPC environments. The tool enables administrators to monitor GPU health, run diagnostics, track performance metrics, and manage GPU groups for policy enforcement. It integrates with job schedulers and cluster management systems for automated GPU management. DCGM tracks hundreds of GPU metrics including temperature, power, memory usage, and error counts. The diagnostic subsystem can detect hardware issues before they cause failures, supporting proactive maintenance.

How do I run a basic dcgmi example?

Run `dcgmi discovery -l` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does discovery -l do in dcgmi?

List discovered GPUs.

dcgmi 命令

常用示例

说明

参数

FAQ

相关命令