Google Research has unveiled TurboQuant, a new compression algorithm that the company claims reduces the working memory an AI model requires during inference by at least 6x — with zero accuracy loss. The announcement landed with an immediate market reaction: shares of major memory suppliers including Micron and Western Digital fell on the same day as the release.

The Bottleneck TurboQuant Is Targeting

To understand why this matters, it helps to understand what it is compressing. Large language models maintain a key-value (KV) cache during inference — a high-speed data store that holds context so the model does not have to recompute prior tokens with every new generation step. As context windows grow, this cache expands rapidly, consuming GPU memory that could otherwise be used to serve more users or run larger models.

TurboQuant compresses the cache to just 3 bits per value, down from the standard 16, reducing its memory footprint by at least six times without, according to Google’s benchmarks, any measurable loss in accuracy. On NVIDIA H100 GPUs, 4-bit TurboQuant delivered up to an 8x speedup in computing attention logits over 32-bit unquantized keys.

Critically, TurboQuant requires no retraining or fine-tuning and claims negligible runtime overhead — in theory, it drops straight into existing inference pipelines. That plug-and-play characteristic is precisely what alarmed the memory hardware sector.

How It Works: PolarQuant and QJL

TurboQuant achieves its results through two key steps working in sequence.

The first is PolarQuant. PolarQuant handles the primary compression step by converting standard Cartesian coordinate vectors into polar coordinates, separating each vector into a radius and a set of angles. Because the angular distribution is predictable and concentrated, the method eliminates the normalization step and the overhead costs it generates. Traditional vector quantization methods must store normalization constants alongside every compressed block — adding between one and two extra bits of overhead per number. PolarQuant eliminates that overhead entirely.

The second stage is QJL — Quantized Johnson-Lindenstrauss. Using the Johnson-Lindenstrauss Transform, QJL reduces each remaining residual vector value to a single sign bit, either positive or negative, introducing zero memory overhead. To maintain accuracy despite operating on one-bit representations, QJL uses an estimator that pairs high-precision query vectors with the simplified stored data when computing attention scores.

The combined pipeline is what Google’s research paper describes as operating near theoretical lower bounds — not just a practical engineering improvement, but one backed by mathematical proof.

Benchmarks and Real-World Validation

Google tested all three algorithms across long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. TurboQuant achieved perfect downstream scores on needle-in-haystack retrieval tasks while compressing KV memory by at least six times.

Within hours of Google’s blog post going live, independent developers started implementing TurboQuant from scratch — not using Google’s code, because Google hasn’t released any yet, but reading the paper and writing their own implementations based on the math alone. One developer built a PyTorch implementation with a custom Triton kernel, tested it on a Gemma 3 4B model running on an RTX 4090, and got character-identical output to the uncompressed baseline at 2-bit precision.

The DeepSeek Comparison — and Its Limits

The reaction online was swift. Cloudflare CEO Matthew Prince called it Google’s DeepSeek moment — a reference to the efficiency shock DeepSeek delivered by achieving competitive model performance at a fraction of the compute cost. The analogy has merit in spirit, but important differences exist in practice.

Unlike DeepSeek’s efficiency gains, which required deep architectural decisions baked in from the start, TurboQuant requires no retraining. But “zero accuracy loss” needs context — it applies to KV cache compression during inference, not to the model’s weights. And as a lab research result rather than a production-deployed system, TurboQuant remains absent from vLLM, llama.cpp, Ollama, and every major serving framework, with open-source code widely expected around Q2 2026.

The memory crisis it could address is also primarily an inference-side phenomenon. Training data centers, which consume far larger volumes of RAM, would see little direct benefit. And with memory procurement orders locked in months in advance across the industry, near-term supply chain dynamics are unlikely to shift immediately.

What Comes Next

TurboQuant will be formally presented at ICLR 2026 in Rio de Janeiro in late April, alongside PolarQuant’s presentation at AISTATS 2026 in Tangier, Morocco. Within 24 hours of release, community members had already begun porting the algorithm to popular local AI libraries including MLX for Apple Silicon and llama.cpp.

Google was clear about its confidence in the technology’s production readiness: “These methods don’t just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds. This rigorous foundation is what makes them robust and trustworthy for critical, large-scale systems.”

Whether TurboQuant shortens the RAM shortage runway before the end of the decade will depend entirely on how fast framework integrations materialize. The math is done. The engineering work is just beginning.

Comments

No comments yet. Be the first to share your thoughts.

or to leave a comment.