LLMs can cut capex and opex by 90% today

By Alan Jacobson, Systems Architect & Analyst

Once upon a time, there were three queries:

One query was very complicated
One query was just so-so
And one query was duck-soup simple

But all three queries got the same amount of compute at the same cost – barely makes sense, right?

But that’s not a fairy tale. That’s how AI works today.

Today, most AI systems operate the way cloud did in its earliest days:

A request arrives.
The system spools up resources.
The job runs.

Crucially, the same amount of compute headroom is provisioned for every request, regardless of complexity. To be sure that every query processes successfully, generous headroom is allocated to every job — simple and complex alike — even though the vast majority of requests never need it.

There is no meaningful pre-execution understanding of:

how expensive a request will be
whether the compute is justified
or whether a cheaper execution path would produce an acceptable result

A fixed amount of compute is allocated first.
Cost is discovered later.

That is the opposite of how companies like Amazon built their empires.

In every other part of Amazon’s business, provisioning is sacred. Inventory is forecast. Warehouses are right-sized. Logistics are optimized before trucks roll. Capacity is allocated based on expected value. Waste is designed out before it happens.

AI breaks that discipline.

Most LLM workloads today are blindly provisioned:

The system does not estimate cost before execution.
It does not right-size the model to the task.
It does not gate execution by value.
It does not offer cheaper alternatives up front.
It does not enforce predictable spend envelopes.

As a result, inference costs scale linearly — or worse — with usage. The more successful the AI system becomes, the more expensive it is to operate.

That is an inversion of Amazon’s core operating philosophy.

Why doesn’t provisioning happen today?

Because current AI stacks were never designed for it:

Model selection is opaque
Compute paths are unpredictable
There is no apparent way to estimate FLOPs before execution
And there is no governance layer empowered to say, “This job is not worth that much compute”

So development teams – siloed away from finance – default to the safest option: run the biggest model, burn the compute and deal with the bill later.

That approach works in demos.
It fails at scale.

Why provisioning is the 10× lever.

The majority of AI workloads do not require maximum intelligence.

Summaries, classifications, lookups, transformations and routine reasoning can often be done with:

smaller models
shorter context windows
cheaper hardware paths
or approximate results that are “good enough”

But without provisioning, every request is treated as mission-critical.

Pre-execution provisioning — estimating cost, matching the task to the cheapest acceptable model and gating execution accordingly — can reduce total AI compute spend dramatically.

In many enterprise environments, cost reductions approaching 90% are achievable — not through better hardware, but through better decisions made before the job runs.

This is not a research problem.
It is a systems problem.

And here is the part that matters most for companies like Bloomberg, Intuit, Salesforce, Oracle, SAP, Adobe and Amazon:

You are not selling AI as a standalone product.

You are embedding it into existing platforms:

Into CRM workflows.
Into creative suites.
Into analytics, search and productivity tools that already have established pricing and margin expectations.

Which means you are eating inference cost as cost of goods sold.

You cannot cleanly pass usage-based AI pricing through to customers without breaking your product model.

And the more successful your AI features become, the more margin they quietly destroy.

Adoption does not scale profit.
It scales expense.

Pre-execution provisioning is not impossible. It’s patent pending.