Quickstart

Five minutes, one Helm command, one YAML file. At the end you'll have live per-token cost data for every inference pod in your cluster.

Prerequisites

  • A Kubernetes cluster with at least one node running GPU inference workloads
  • DCGM Exporter installed (or see DCGM setup for a five-minute install)
  • Inference pods exposing Prometheus metrics on a known port (llama.cpp or vLLM)
  • helm and kubectl configured against your cluster

1. Install the Helm chart

helm repo add infercost https://defilantech.github.io/infercost
helm repo update

helm install infercost infercost/infercost \
  --namespace infercost-system \
  --create-namespace

On a cluster running kube-prometheus-stack, the chart automatically creates a PodMonitor so Prometheus discovers InferCost without extra config. Without prometheus-operator the chart skips that resource silently.

2. Declare your hardware

One CostProfile per node or GPU pool. The sample library ships ready-to-use manifests for H100, A100, L40S, RTX 4090/5090/5060 Ti, A6000, and Apple M2 Ultra — copy the one closest to your setup and tweak the purchasePriceUSD.

# costprofile.yaml — using a 2x RTX 5060 Ti lab setup as the example
apiVersion: finops.infercost.ai/v1alpha1
kind: CostProfile
metadata:
  name: my-gpu-node
spec:
  hardware:
    gpuModel: "NVIDIA GeForce RTX 5060 Ti"
    gpuCount: 2
    purchasePriceUSD: 960          # what you actually paid
    amortizationYears: 4
    tdpWatts: 180                  # per-GPU fallback when DCGM unreachable
  electricity:
    ratePerKWh: 0.08
    pueFactor: 1.0
  nodeSelector:
    kubernetes.io/hostname: gpu-node-01
kubectl apply -f costprofile.yaml

3. Tell InferCost where DCGM is

If DCGM is already discoverable at the GKE/GPU-operator default location, nothing to do. Otherwise point the controller at it via Helm values:

helm upgrade infercost infercost/infercost \
  --namespace infercost-system \
  --set dcgm.endpoint=http://nvidia-dcgm-exporter.gpu-operator-resources.svc:9400/metrics

When DCGM is missing or unreachable, CostProfile.status.conditions surfaces a DCGMReachable=False condition with an actionable message — so you know exactly why a dashboard is flat instead of guessing. See DCGM setup for all four diagnostic states.

4. See your costs

$ kubectl get costprofiles
NAME           GPU MODEL                     GPUs   $/HR     POWER (W)   AGE
my-gpu-node    NVIDIA GeForce RTX 5060 Ti    2      $0.0531  86W         2m

$ kubectl get usagereports
NAME             PERIOD       COST ($)   $/MTOK   INPUT TOKENS   OUTPUT TOKENS   AGE
daily-rollup     2026-04-20   $0.8941    $0.4192  1,200,000      800,000         5m

Or from the CLI:

$ infercost status
INFRASTRUCTURE COSTS
PROFILE         GPU MODEL                     GPUs  $/HOUR   POWER    AGE
my-gpu-node     NVIDIA GeForce RTX 5060 Ti    2     $0.0531  86W      2m

$ infercost compare --namespace engineering
  PROVIDER    MODEL              CLOUD COST   SAVINGS
  Anthropic   claude-sonnet-4-6  $15.60       $14.70 (94%)
  OpenAI      gpt-5.4            $15.00       $14.10 (94%)

5. Open the Grafana dashboard

kubectl create configmap infercost-dashboard \
  --from-file=https://raw.githubusercontent.com/defilantech/infercost/main/config/grafana/infercost-dashboard.json \
  -n monitoring
kubectl label configmap infercost-dashboard grafana_dashboard=1 -n monitoring

The Grafana sidecar auto-loads the dashboard within a minute. You will see cost-per-token, hourly cost, GPU efficiency, and the cloud-equivalent comparison panels populate as soon as Prometheus scrapes its first data point.

Next steps