AI & ML Efficiency Breakthrough

Parallel multi-token prediction can be achieved in standard LLMs without training auxiliary models or modifying weights.

March 19, 2026

Original Paper

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott

arXiv · 2603.17942

The Takeaway

By probing the embedding space with 'mask' tokens on the fly, this method enables speculative decoding with 12-19% throughput gains. It democratizes multi-token prediction capabilities for practitioners who cannot afford to retrain models with MTP heads.

From the abstract

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a l

Read the original paper →

← Back to today's papers