DCGM setup
InferCost reads real-time GPU power draw from the NVIDIA DCGM exporter. This page covers how to install it, how InferCost reports its health, and what each failure mode means.
The one-line summary
If you already run the NVIDIA GPU Operator, DCGM Exporter is already installed — just point InferCost at its service:
helm upgrade infercost infercost/infercost \
--namespace infercost-system \
--set dcgm.endpoint=http://nvidia-dcgm-exporter.gpu-operator-resources.svc:9400/metrics Installing DCGM Exporter standalone
If you're not using the GPU Operator (e.g. a homelab, a one-off GPU node, or a managed cluster where the operator isn't appropriate), install DCGM Exporter directly:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace gpu-telemetry --create-namespace Then point InferCost at the resulting service. The exact DNS depends on your cluster but
typically looks like http://dcgm-exporter.gpu-telemetry.svc:9400/metrics.
The DCGMReachable condition
Every CostProfile gets a DCGMReachable status condition on every
reconcile. This is the "why is my dashboard flat?" answer you can get with a single kubectl describe:
$ kubectl describe costprofile my-gpu-node | grep -A3 DCGMReachable
Type: DCGMReachable
Status: False
Reason: DCGMNotConfigured
Message: DCGM endpoint not set; using TDP fallback (360W). Install the DCGM exporter for real-time power — see https://infercost.ai/docs/dcgm. The four diagnostic states
| Status / Reason | What it means | What to do |
|---|---|---|
True / DCGMHealthy | Real-time readings matched the node selector. Power in the dashboard is live. | Nothing — you're done. |
Unknown / DCGMNoReadings | DCGM scrape succeeded but no readings matched the CostProfile's nodeSelector. Most common cause is a typo. | Confirm the hostname label matches a real node. Falls back to TDP so dashboards keep working. |
False / DCGMScrapeError | Endpoint set but the controller can't reach it. Network, DNS, or RBAC. | kubectl logs -n infercost-system deploy/infercost-controller-manager — the
error message has the URL it tried. |
False / DCGMNotConfigured | No --dcgm-endpoint configured. Costs use the TDP fallback only. | Set dcgm.endpoint in Helm values. TDP is ballpark; real-time power swings
30-40% under load. |
Why TDP fallback is not enough long-term
Thermal Design Power is the manufacturer's worst-case spec. Real power draw during inference is rarely that high for sustained periods — H100s idle at ~75W and peak at 700W. A cost model that uses TDP will overstate electricity cost at idle by ~8x and understate it during sustained batch runs.
For a quick proof-of-concept or lab, TDP is fine. For any cost comparison you'd show to a CFO,
get DCGM running. The DCGMReachable condition is specifically designed to tell you
when you're still on TDP so you don't quote a TDP-derived number unintentionally.
Apple Silicon (Metal)
DCGM is NVIDIA-only. Apple Silicon nodes (M-series Mac Studios used as homelab boxes) fall back
to TDP — the DCGMReachable condition reports False / DCGMNotConfigured. A Metal-aware power exporter is on the roadmap; until it
ships, the TDP in your CostProfile is the best available signal.