What if we could reduce AI capex and opex by 90 percent? We can.

Pre-execution provisioning is the better mousetrap: It reduces both the compute required for inference and the number of datacenters needed.

By Alan Jacobson, Systems Architect & Analyst

That’s right. I said it.

We can reduce AI capex by 90%.

Before I tell you how, let me explain why this matters — because the importance is both multi-faceted and not immediately obvious.

In the meantime, know this: The problem isn’t intelligence. It’s that AI is designed to waste money — both at runtime and in the infrastructure built to support it – as I will explain further along.

This design flaw inflates both LLM operating expense and the amount of infrastructure companies must build to support it.

But allow me to start with some history, so you can see the pattern:

“Those who cannot remember the past are condemned to repeat it.” — George Santayana

Yes, it’s a familiar quote. But it’s also a warning.

In 1873, markets crashed because capital expenditures in railroads far outpaced revenue.

Railroads eventually became one of the great economic engines of the 19th and 20th centuries — but not before a massive destruction of capital. Many fortunes were lost before the industry found its footing.

I don’t need to go far out on a limb to suggest we’re heading for something similar.

And it’s not just me saying this. Jim Cramer and James Mackintosh both called the AI bubble two weeks ago.

According to The Information just today:

“The market is turning skeptical of the AI game. Even as OpenAI, Meta Platforms and others commit hundreds of billions to data center build-outs, doubts are emerging over whether the AI business can justify spending on this scale.”

This isn’t a question of if. It’s a question of when.

Just like 1873, AI capex is racing far ahead of revenue. Valuations are increasingly grounded in narrative rather than fundamentals.

Today, AI capex is one of the single biggest drivers of the global economy. But what if that capital could be redirected — toward education, science, or healthcare — without slowing AI’s expansion?

Sounds impossible, right?

What if LLMs could grow exponentially…

without turning a single spadeful of soil?
without overloading the electrical grid?
without stressing water supplies?
without worsening environmental impact?

If you’re thinking “Where do I sign?” hang with me for a few more paragraphs.

Like the railroads in the 19th century, LLMs have the potential to reshape the world — by orders of magnitude greater than railroads or even the Industrial Revolution.

Railroads transformed transportation.
The Industrial Revolution transformed manufacturing.

LLMs will transform cognition itself — by providing automated alternatives to thinking across vast classes of work.

So how do we get there without blowing up the balance sheet?

Let’s get down to Brass Tacks.

What follows isn’t a model breakthrough. It’s a way to cut LLMs’ biggest costs — at runtime and at the infrastructure level — by changing what decisions are made and when they are made.

The problem

Once upon a time, there were three queries:

One query was very complicated.
One query was just so-so.
And one query was duck-soup simple.

But all three queries received the same amount of compute — at the same cost.

Barely makes sense, right?

And yet, that’s exactly how LLMs work today.

Most LLMs operate the way early cloud computing did:

A request arrives.
Resources are spooled up.
The job runs.

Crucially, the same amount of compute headroom is provisioned for every request, regardless of complexity.

To ensure that every query succeeds, generous headroom is allocated to every job — simple and complex alike — even though the vast majority of requests never need it.

There is no meaningful pre-execution understanding of:

how expensive a request will be
whether that cost is justified
or whether a cheaper execution path would produce an acceptable result

Compute is allocated first.
Cost is discovered later.

That is the opposite of how companies like Amazon built their empires.

In every other part of Amazon’s business, provisioning is sacred.

Inventory is forecast.
Warehouses are right-sized.
Logistics are optimized before trucks roll.
Capacity is allocated based on expected value.

Waste is designed out before it happens.

AI breaks that discipline.

Most LLM workloads today are blindly provisioned:

No cost estimate before execution
No right-sizing of models to tasks
No gating by value
No cheaper alternatives offered up front
No predictable spend envelopes

As a result, inference costs scale linearly — or worse — with usage.

The more successful the AI system becomes, the more expensive it is to operate.

That is a complete inversion of Amazon’s core operating philosophy.

Why right-size computing doesn’t happen today

Because current AI stacks were never designed for it.

Model selection is opaque
Compute paths are unpredictable
FLOPs cannot be estimated pre-execution
And no governance layer is empowered to say, “This job is not worth that much compute.”

So development teams – siloed away from finance – default to the safest option:

Run the biggest model.
Burn the compute.
Deal with the bill later.

That approach works in demos.
It fails at scale.

Why provisioning is the 10× lever

The vast majority of LLM workloads do not require maximum intelligence.

Summaries
Classifications
Lookups
Transformations
Routine reasoning

These can often be handled with:

smaller models
shorter context windows
cheaper hardware paths

But without provisioning, every request is treated as mission-critical.

Thank you for bearing with me so far. Here’s the payoff

Pre-execution provisioning — estimating cost, matching each task to the cheapest acceptable execution path and gating compute accordingly — reduces LLM spend at runtime. That directly lowers operating expense.

More importantly, it also reduces the capacity that has to be built and held in reserve to support AI workloads in the first place. That is how AI capital expenditure is reduced.

The result is savings on both sides of the ledger — operating expense and capex — in many cases by as much as 90% each.

Not through better hardware.
Not through more hardware.
Not through better models.

Through better decisions made before the job runs.

Why this matters to real companies

For companies like Bloomberg, Palantir, Intuit, Salesforce, Oracle, SAP, Adobe and Amazon, this is existential.

They are not selling AI as a standalone product.

They are embedding it into existing platforms:

CRM workflows.
Creative suites.
Analytics, search, and productivity tools — all with established pricing and margin expectations.

Which means inference cost becomes cost of goods sold.

You cannot cleanly pass usage-based LLM pricing through to customers without breaking your product model.

And the more successful your AI features become, the more margin they quietly destroy.

Adoption does not scale profit.
It scales expense.

The missing architecture

Pre-execution provisioning cannot simply be bolted on to existing models.

It must be baked in.

That requires a new architecture — one that inserts a metering and governance layer between the user and the model.

The industry has largely concluded that pre-execution cost estimation is impossible without unacceptable latency or compute overhead.

That conclusion is wrong.

It isn’t impossible. It’s patent pending.