Enables RMSNorm to reuse MXFP8 block scales, reducing the reduction operation size by 32x with a 2.4x kernel speedup.
arXiv · March 16, 2026 · 2603.13180
Why it matters
As matrix multiplication gets faster via low-precision formats, normalization becomes a bottleneck. This method provides a drop-in replacement that speeds up end-to-end 8B-model layers by 2.6%.
From the abstract
Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the