DCGM setup

InferCost reads real-time GPU power draw from the NVIDIA DCGM exporter. This page covers how to install it, how InferCost reports its health, and what each failure mode means.

The one-line summary

If you already run the NVIDIA GPU Operator, DCGM Exporter is already installed — just point InferCost at its service:

helm upgrade infercost infercost/infercost \
  --namespace infercost-system \
  --set dcgm.endpoint=http://nvidia-dcgm-exporter.gpu-operator-resources.svc:9400/metrics

Installing DCGM Exporter standalone

If you're not using the GPU Operator (e.g. a homelab, a one-off GPU node, or a managed cluster where the operator isn't appropriate), install DCGM Exporter directly:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace gpu-telemetry --create-namespace

Then point InferCost at the resulting service. The exact DNS depends on your cluster but typically looks like http://dcgm-exporter.gpu-telemetry.svc:9400/metrics.

The DCGMReachable condition

Every CostProfile gets a DCGMReachable status condition on every reconcile. This is the "why is my dashboard flat?" answer you can get with a single kubectl describe:

$ kubectl describe costprofile my-gpu-node | grep -A3 DCGMReachable
    Type:     DCGMReachable
    Status:   False
    Reason:   DCGMNotConfigured
    Message:  DCGM endpoint not set; using TDP fallback (360W). Install the DCGM exporter for real-time power — see https://infercost.ai/docs/dcgm.

The four diagnostic states

Status / Reason	What it means	What to do
`True / DCGMHealthy`	Real-time readings matched the node selector. Power in the dashboard is live.	Nothing — you're done.
`Unknown / DCGMNoReadings`	DCGM scrape succeeded but no readings matched the CostProfile's `nodeSelector`. Most common cause is a typo.	Confirm the hostname label matches a real node. Falls back to TDP so dashboards keep working.
`False / DCGMScrapeError`	Endpoint set but the controller can't reach it. Network, DNS, or RBAC.	`kubectl logs -n infercost-system deploy/infercost-controller-manager` — the error message has the URL it tried.
`False / DCGMNotConfigured`	No `--dcgm-endpoint` configured. Costs use the TDP fallback only.	Set `dcgm.endpoint` in Helm values. TDP is ballpark; real-time power swings 30-40% under load.

Why TDP fallback is not enough long-term

Thermal Design Power is the manufacturer's worst-case spec. Real power draw during inference is rarely that high for sustained periods — H100s idle at ~75W and peak at 700W. A cost model that uses TDP will overstate electricity cost at idle by ~8x and understate it during sustained batch runs.

For a quick proof-of-concept or lab, TDP is fine. For any cost comparison you'd show to a CFO, get DCGM running. The DCGMReachable condition is specifically designed to tell you when you're still on TDP so you don't quote a TDP-derived number unintentionally.

Apple Silicon (Metal)

DCGM is NVIDIA-only. Apple Silicon nodes (M-series Mac Studios used as homelab boxes) fall back to TDP — the DCGMReachable condition reports False / DCGMNotConfigured. A Metal-aware power exporter is on the roadmap; until it ships, the TDP in your CostProfile is the best available signal.