Parallel multi-token prediction can be achieved in standard LLMs without training auxiliary models or modifying weights.
March 19, 2026
Original Paper
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
arXiv · 2603.17942
The Takeaway
By probing the embedding space with 'mask' tokens on the fly, this method enables speculative decoding with 12-19% throughput gains. It democratizes multi-token prediction capabilities for practitioners who cannot afford to retrain models with MTP heads.
From the abstract
Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a l