Kubernetes-native AI FinOps

Know the true cost of AI inference on your hardware

Cloud FinOps tools see GPUs but not tokens. LLM gateways see tokens but assume on-prem is free. InferCost computes what nobody else does: real cost-per-token from hardware amortization, electricity, and actual GPU power draw.

infercost / shadowstack-rtx5060ti
live

Hourly Cost

$0.053

amort + electricity

Monthly

$38

projected at current rate

Cost/MTok

$0.41

under active load

GPU Power

84.7W

2x RTX 5060 Ti

Savings vs Cloud APIs

94%

Claude Opus 4.6

89%

GPT-5.4

90%

Claude Sonnet 4.6

84%

Gemini 2.5 Pro

69%

Claude Haiku 4.5

n/a

Gemini Flash-Lite

cloud cheaper

Real data from a homelab running Qwen3-32B on 2x RTX 5060 Ti

The problem

Nobody computes the true cost of on-prem inference

Organizations are making million-dollar hardware decisions with zero visibility into true unit economics. The FinOps Foundation's own working group explicitly acknowledges that on-premises AI cost is out of scope.

CapabilityOpenCostKubecostLiteLLMLangfuseInferCost
Token-level tracking
Per-user attribution
On-prem hardware cost
GPU amortization
Electricity + PUE
Cloud comparison
Kubernetes-native
Open source

How it works

Three steps. Five minutes.

No database. No UI to host. One controller pod that plugs into infrastructure you already run.

1

Declare your hardware

# costprofile.yaml

spec:

hardware:

gpuModel: "RTX 5060 Ti"

gpuCount: 2

purchasePriceUSD: 960

amortizationYears: 3

electricity:

ratePerKWh: 0.08

2

Deploy the operator

$ helm install infercost \

infercost/infercost \

--set dcgm.endpoint=auto

✓ CRDs installed

✓ Controller running

✓ Metrics flowing

3

See your true costs

$ infercost status

INFRASTRUCTURE COSTS

shadowstack RTX 5060 Ti 2 $0.053/hr

SAVINGS vs CLOUD

Opus 4.6 $9.20 saved (94%)

GPT-5.4 $5.22 saved (89%)

Flash-Lite cloud 280% cheaper

Architecture

One pod. Zero dependencies.

InferCost plugs into your existing Prometheus and Grafana stack. No new databases, no UI to host, no infrastructure to manage.

Data Sources

DCGM Exporter GPU power draw (watts)
llama.cpp token counts per pod
CostProfile CRD hardware economics
LiteLLM PG per-user (optional)

InferCost Controller

Single Pod

GPU Power Scraper

Token Counter

Cost Calculator

Attribution Engine

Cloud Comparator

Report Writer

Outputs

Prometheus metrics any monitoring tool
REST API custom integrations
Grafana Dashboard pre-built, ships as JSON
UsageReport CRDs kubectl, GitOps

Roadmap

From visibility to enforcement

Each tier builds on the last. Start with cost visibility, grow into budget enforcement and optimization.

Observe

Live

Cost-per-token, GPU power, efficiency

Report

Live

Per-team, per-model, cloud comparison

Alert

Coming Soon

Budget thresholds, anomaly detection

Enforce

Coming Soon

Rate-limit over-budget teams

Optimize

Planned

Model switching, scale-down scheduling

Comply

Planned

Audit export, EU AI Act, FOCUS spec

OpenTelemetry GenAI

Metric conventions

FOCUS Spec

Compatible export

OpenCost

Complementary

Apache 2.0

Open source

Get early access

InferCost is in active development. Join the list to be notified when we launch.