Fuses categorical sampling into the LM-head matmul to eliminate logit materialization and speed up LLM decoding by up to 19%.
March 18, 2026
Original Paper
FlashSampling: Fast and Memory-Efficient Exact Sampling
arXiv · 2603.15854
The Takeaway
FlashSampling turns a bandwidth-bound post-processing step into a fused kernel epilogue without any approximation. This is a significant optimization for high-throughput inference engines like vLLM.
From the abstract
Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over til