← Back to Research

What's Actually Driving Down the Cost of AI Inference

Everyone's talking about better chips. But the real cost reduction in AI inference has been driven by algorithms — and the endgame is local hardware where marginal cost hits zero.

Published: March 2026 · Updated: March 2026 · inferencecostquantizationedge-aichipsnebulus

What’s Actually Driving Down the Cost of AI Inference

The price to achieve GPT-4-level performance on a benchmark has dropped 280-fold since late 2022. GPT-4-equivalent inference that cost $20 per million tokens in 2023 now costs roughly $0.40 per million tokens. Epoch AI’s analysis of frontier model pricing found cost reductions of 5× to 10× per year across knowledge, reasoning, math, and software engineering tasks — with some benchmarks showing price drops of up to 900× in a single year.

The media narrative credits this to better chips. NVIDIA’s stock price reflects that story. But the data tells a different one: algorithmic efficiency improvements are the dominant lever, contributing roughly 3× per year in cost reduction independent of hardware gains. The chip story is real — but it’s the smaller factor. Understanding what’s actually driving inference costs matters for anyone planning infrastructure, allocating capital, or building an AI strategy that needs to survive the next three years.

The Chip Narrative vs. The Reality

NVIDIA’s roadmap is genuinely impressive. The Vera Rubin NVL72 platform, arriving in H2 2026, promises a 10× reduction in inference token cost and the ability to train MoE (Mixture-of-Experts) models with one-quarter the GPUs compared to its predecessor Blackwell. Beyond that, the Feynman architecture (2028, built on the most advanced chip manufacturing process currently available) is expected to deliver another 14× performance gain over current systems.

These are real numbers, but they describe a future that’s 18–36 months out. Meanwhile, algorithmic improvements have already delivered comparable or greater gains — and they’re compounding on existing hardware right now.

A landmark November 2025 paper from Epoch AI, “The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference,” isolated algorithmic contributions from hardware and economic factors. Their finding: after controlling for hardware price-performance improvements and competitive pricing pressure, algorithmic efficiency alone is improving at approximately 3× per year. For context, Moore’s Law delivered roughly 2× improvement every 18–24 months. Algorithmic efficiency is outpacing it.

The practical takeaway: if you’re waiting for the next GPU generation to make inference affordable, you’re already behind. The techniques exist today to run frontier-quality models at a fraction of the cost people paid 12 months ago — on the same hardware.

What Quantization Actually Did

Quantization — reducing the numerical precision of model weights from 16-bit floating point to 8-bit, 4-bit, or lower — is the single most impactful cost reduction technique in production inference today. Think of it as compressing a high-res photo: you lose some detail, but it’s dramatically smaller and loads faster, and for most uses you can’t tell the difference. The numbers are concrete:

A Llama 2 13B model at full precision requires 26GB of memory and produces ~8 tokens/second. The same model compressed to 4-bit precision fits in 7.9GB (a 70% memory reduction), runs at 15 tokens/second (nearly 2× faster), and retains ~95% of output quality according to Red Hat’s analysis of over 500,000 evaluations across quantized LLMs.
ExLlamaV2 (EXL2) — the leading GPU-optimized quantization format — delivers roughly 2× the inference speed of standard CPU-based inference formats at equivalent quality. On a mid-tier GPU, EXL2 achieves 56+ tokens/second on 13B-class models.
Production AI systems achieved a 33× energy reduction per prompt between May 2024 and May 2025, with model architecture and quantization contributing a 23× improvement versus only 1.4× from better hardware utilization.

The key insight: quantization doesn’t just save memory. When models are smaller, the bottleneck shifts from memory speed to raw processing power — and processing power is cheaper and easier to scale. When you can fit a model that previously required an $80,000 datacenter GPU onto a $2,000 consumer card with minimal quality loss, the cost-per-token equation changes by an order of magnitude overnight.

The MoE Efficiency Story

Mixture-of-Experts (MoE) architectures have rewritten the relationship between model capability and inference cost. The principle: instead of activating every parameter for every token, a routing mechanism selects a small subset of specialized “expert” sub-networks. The model stays large — and therefore knowledgeable — but only a fraction of it runs for any given request.

DeepSeek-V3 is the clearest proof point. It has 671 billion total parameters but activates only 37 billion per token — roughly 5.5% of the model. A traditional “dense” model of equivalent capability would require ~5× the compute each time it generates a token. The result: DeepSeek-V3 achieves GPT-4-class performance at an API price of $0.14/$0.28 per million tokens (input/output), compared to GPT-4o’s $3/$10. That’s a 20–35× cost advantage at comparable quality.

Mixtral 8x7B demonstrated the approach earlier at smaller scale: 8 expert networks with 2 active per token, delivering performance competitive with models 3–4× its active parameter count.

NVIDIA is explicitly optimizing for this trend. Their current Blackwell NVL72 architecture delivers 10× throughput per megawatt for MoE inference compared to its predecessor, translating directly to one-tenth the cost per million tokens. The Vera Rubin platform extends this with native MoE routing optimization at the silicon level.

The MoE shift isn’t incremental. It represents a structural change: making a model smarter no longer means it costs proportionally more to run. You can scale a model’s knowledge while holding its per-query compute cost roughly constant. That breaks the old scaling equation in favor of the user.

Speculative Decoding, KV Compression — The Quiet Wins

While quantization and MoE get attention, several other algorithmic techniques are delivering meaningful real-world gains:

Speculative decoding uses a small, fast model to predict multiple tokens ahead, then verifies them in a single pass through the larger model. If the predictions are right (which they often are for predictable text like code), you get multiple tokens for the price of one verification step. The vLLM team demonstrated up to 2.8× speedup on large models with no quality loss. BentoML’s analysis showed 1.5–3× latency reduction depending on the task. The technique is particularly effective for code generation and structured output.

KV cache compression addresses a memory problem that grows with conversation length. The KV cache is the model’s running memory of everything it has read so far in a session — it grows linearly the longer the conversation goes. Compression techniques from Berkeley and others achieve 1.5–1.7× latency savings while maintaining accuracy. As AI applications push toward million-token context windows, compressing this cache shifts from “nice to have” to “required for feasibility.”

Multi-Level Attention (MLA), pioneered by DeepSeek, takes this further by restructuring how that memory is stored in the first place. DeepSeek-V2 and V3 use MLA to achieve cache sizes roughly 5–10× smaller than standard approaches, enabling much longer conversations on the same hardware.

These techniques compound. A quantized MoE model with speculative decoding and compressed memory cache can be 10–50× cheaper to run than an equivalent model from 18 months ago — on identical hardware.

Purpose-Built Inference Silicon

The chip roadmap is converging on a clear thesis: inference is a fundamentally different workload than training, and it needs its own silicon.

Training an AI model is a massive parallel computation done once. Inference — actually running the model to generate responses — happens millions of times per day, often with tight latency requirements and at batch sizes of one. The H100 GPUs that power most AI infrastructure today were designed primarily for training. Using them for inference is like hauling groceries in a dump truck.

NVIDIA’s Feynman architecture (2028) is designed from the ground up for this workload, with interconnects that use light instead of electrical signals to move data between chips at dramatically higher speeds. The Feynman NVL576 is projected to deliver 14× the inference performance of current systems.

Apple Silicon has already demonstrated what purpose-built unified memory architecture can do at the edge. The M4 Max offers memory bandwidth roughly one-sixth of a datacenter GPU — but at a hardware cost roughly one-tenth that of a datacenter GPU. For single-user, interactive workloads, the cost-per-token on Apple Silicon is already competitive with cloud APIs after just a few months of amortization. The M5 pushes this further with up to 4× speedup on AI inference tasks versus M4.

Reports emerged in early March 2026 that NVIDIA is developing a dedicated inference chip separate from its general-purpose GPU line — a potential debut at GTC 2026. If confirmed, it would mark a formal split in the chip roadmap between training and inference: two different jobs, two different tools.

The Endgame: Local/Edge, Marginal Cost Approaching Zero

The convergence of quantization, MoE, and purpose-built inference silicon points to a specific endgame: local deployment where the per-query cost approaches zero.

The economics are already shifting. A cost-benefit analysis from late 2025 found that on-premise AI deployment breaks even with commercial API services at surprisingly modest utilization rates — often within months rather than years. The key variables:

Hardware amortization: A Mac Studio M4 Ultra ($4,000–$7,000) running quantized 70B models at 20+ tokens/second amortizes to near-zero marginal cost over 18–24 months.
API cost trajectory: Even as cloud prices fall 5–10× per year, the floor is set by datacenter costs — power, cooling, networking, staffing. Local hardware has no recurring per-query cost once purchased.
Privacy and latency: Local inference eliminates network round-trips and keeps your data on your own machines — two factors that don’t appear on API invoices but drive real deployment decisions.

For workloads that are privacy-sensitive — internal assistants, document processing, code completion, anything touching confidential data — local inference already wins on total cost of ownership. As quantization and architecture efficiency continue their 3× annual improvement trajectory, the crossover point will expand to cover increasingly demanding use cases.

The West AI Labs Thesis

This is the environment we built West AI Labs to operate in.

The Nebulus Stack — our modular, container-first platform for local AI infrastructure — is designed around a specific bet: that the combination of algorithmic efficiency and purpose-built silicon will make sovereign, local-first AI deployment the default for organizations that care about cost, privacy, and control.

Nebulus-Prime handles GPU inference on Linux/NVIDIA hardware. Nebulus-Edge targets Apple Silicon. Both use the leading quantized model serving tools as core primitives — not because we’re hardware-constrained, but because we’ve observed firsthand that quantized inference on modest hardware consistently beats cloud API economics for sustained workloads.

The Nebulus approach bets on the same curve the data shows: algorithmic efficiency is compounding faster than chip generations ship. Every improvement in algorithmic efficiency makes the hardware you already own more capable. That’s a fundamentally different investment thesis than “wait for the next GPU” — and it’s the one the data supports.

Sources

Gundlach, H. et al. “The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference.” arXiv:2511.23455, November 2025.
Epoch AI. “LLM Inference Prices Have Fallen Rapidly but Unequally Across Tasks.” epoch.ai, 2025.
Red Hat. “We Ran Over Half a Million Evaluations on Quantized LLMs.” developers.redhat.com, October 2024.
NVIDIA. “Vera Rubin NVL72 Co-Designed Infrastructure for Agentic AI.” nvidia.com, January 2026.
NVIDIA. “Mixture of Experts Powers the Most Intelligent Frontier AI Models.” NVIDIA Blog, December 2025.
DeepSeek-AI. “DeepSeek-V3 Technical Report.” arXiv:2412.19437, December 2024.
vLLM Blog. “How Speculative Decoding Boosts vLLM Performance by up to 2.8x.” October 2024.
Apple Machine Learning Research. “Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU.” 2025.
Arcade.dev. “AI Compute Optimization & Cost Efficiency Analysis 2025.” November 2025.
Stanford HAI. “2025 AI Index Report.” 2025.
Introl. “Inference Unit Economics: The True Cost Per Million Tokens.” December 2025.
BuySellRam. “NVIDIA Next-Gen Feynman: Beyond Training, Toward Inference Sovereignty.” March 2026.

West AI Labs builds sovereign AI infrastructure for organizations that refuse to rent their intelligence. Learn more at westailabs.com.

← Back to Research