Like DC current, tokens can’t “go the distance” AI agents require

Like DC current, tokens can’t “go the distance” AI agents require

When Thomas Edison built the first electrical systems, they worked—but they didn’t scale: Direct current (DC) couldn’t travel long distances.

Nikola Tesla solved that with alternating current (AC). Same electricity—different system—and suddenly power could go the distance.

AI cost is hitting the same wall. Tokens can’t “go the distance” agentic AI requires.

Like DC, tokens worked—until they didn’t.

Here's the difference between a traditional AI workflow and an agentic workflow:

Workflow A - simple prompt/responseWorkflow B - agentic workflow
1. User prompt: “Summarize this document”
Tokens generated: 25
Compute: single inference pass
1. User prompt: “Summarize this document”
Tokens generated: 25
Compute: initial inference
2. Model produces summary
Tokens generated: 200
Compute: single inference pass
2. Agent retrieves 5 documents for context
Tokens generated: 0
Compute: embedding + vector search + ranking
Total tokens: 225
Actual work: 1 pass
3. Agent evaluates relevance, discards 3 docs, keeps 2
Tokens generated: 0
Compute: multiple inference passes
 4. Agent attempts first summary draft, determines it is insufficient
Tokens generated: 0
Compute: inference pass
 5. Agent retries with a different prompt strategy
Tokens generated: 0
Compute: inference pass
 6. Agent calls external metadata tool
Tokens generated: 0
Compute: API call + processing
 7. Agent produces final summary
Tokens generated: 200
Compute: final inference pass
 Total tokens: 225
Actual work: 5-7 inference passes + retrieval + tool use
What billing sees vs what actually happened
Tokens billed: 225
Inference passes: 1
Retrieval ops: 0
Tool/API calls: 0
Actual compute: Low
Tokens billed: 225
Inference passes: 5-7
Retrieval ops: multiple
Tool/API calls: 1+
Actual compute: Significantly higher

AI’s two-part problem and the two-step solution

1. Lack of accuracy
Tokens count words. They don’t measure compute.

And agentic AI performs retrieval, tool use, and iteration—work that often never generates tokens. So part of the work—and cost—is never captured.

“We are seeing agent workflows expanded compute (iteration, retries, tool use) without a proportional increaser in tokens” – LLM CEO

That means providers are effectively delivering unmetered compute.

Margin compression are confirmed by 10-Qs filed by 25 public cloud companies and LLMs.

2. Lack of control
Tokens are only counted after execution.

So cost is incurred before it can be measured—let alone controlled.
Finance can’t estimate, budget, or govern spend in advance.

The two-step solution

  1. Accuracy requires measuring work based on compute, not tokens.
  2. Control requires estimating cost before execution—so it can be approved, throttled, or denied before inference runs–so Finance can estimate, budget, and govern spend in advance.

The standard

Any system that actually controls AI cost must:

  • Estimate work before execution
  • Govern execution at the point of use
  • Measure compute—not text

If it doesn’t do those three things, cost cannot be predicted or controlled.

The analogy

Inference isn’t SaaS. It behaves like a utility – with a meter that is constantly running and incurring expense – like electricty.

Tokens merely count words – they don't count or measure compute.

We don’t bill electricity by counting electrons.
We meter energy in kWh—a unit tied to actual consumption

So price and control AI, we don’t need to count every unit of work.
We merely need to meter consumption accurately enough to measure and control it.

Compute is what actually drives cost.

At the lowest level, that work is compute (FLOPs).
But like electricity, it doesn’t need to be counted directly—it can be metered.

Like kWh, metering at the level of FLOPS measures the energy—and the cost.



Contact us

© 2025 BrassTacksDesign, LLC