Tokens don’t measure what matters
Elementary school teachers use a system called Lexile to determine the reading level of texts.
It uses word length and sentence length as a proxy for difficulty.
AI does the same thing.
It uses token count as a proxy for compute.
Different domain. Same mistake.
Both systems rely on proxies that don’t actually measure what matters.
The Lexile problem
In Lexile:
word length + sentence length ≠ reading level
You can see the failure immediately.
Using the Lexile system, The Catcher in the Rye is rated at roughly a fourth-grade level — because the words are short and the sentences are simple.
But “prostitution” shows up on page two. No fourth grader should be reading that.
Now look at Ernest Hemingway.Hemingway is famous for short, simple, declarative sentences.
Lexile rates much of his work at a third-grade level.
That’s obviously wrong.
Now compare that to William Faulkner. Faulkner writes in long, dense, multi-page sentences.
Lexile rates him at a postgraduate level.
Also wrong.
Because in reality:
- The Catcher in the Rye
- Hemingway
- Faulkner
…are all taught to 10th graders.
The proxy fails because it measures structure, not meaning.
The token problem
AI makes the exact same mistake.
In AI:
token count ≠ amount of compute
Two requests can generate the same number of tokens:
- one trivial
- one deeply complex
But the compute required can be radically different.
Tokens don’t understand:
- reasoning depth
- branching paths
- retries
- tool usage
- background execution
They just count pieces of text.
So they are:
semantically blind — same tokens ≠ same work
operationally blind — large parts of agentic execution never become tokens at all
Same failure pattern
Lexile fails because it confuses surface structure with cognitive difficulty
Tokens fail because they confuse text output with compute work
In both cases:
The system measures something that is easy to count
Instead of something that is actually real
The takeaway
Bad proxies create false confidence.
Lexile tells you a book is “easy” when it isn’t.
Tokens tell you compute is “controlled” when it isn’t.
And in both cases, the error compounds at scale.
Because once you build decisions, pricing, and governance on top of a broken unit…everything downstream inherits the mistake.
Consider these two scenarios:
A guy talks to AI for thirty minutes about his girlfriend. He goes on and on…
- How she seems distant.
- How she is slow to respond to texts.
- How she is mysteriously unavailable.
The system dutifully transcribes every word, responds empathetically and consumes a massive number of tokens — all while avoiding the four words a human would scream immediately: SHE’S CHEATING ON YOU!
Now consider a three-word query:
“Is God real?”
Few questions demand more reasoning, context, philosophy and depth. Yet under token-based billing, that interaction may never recover the cost of compute.
And in both cases, look at the asymmetry between input, output and effort. There is no correlation between number of tokens — either in or out — and compute.
To put it in historical context again, using tokens to measure compute is like measuring electricity in horsepower: It didn’t work after machines replaced horses.
Tokens are the horsepower of AI.
Compute is the kilowatt-hour
Tokens don’t fail because they’re imprecise.
They fail because they’re the wrong unit entirely.
And no amount of refinement fixes a fundamentally bad proxy.
– Published on Saturday, March 28, 2026