AI & ML Efficiency Breakthrough

Enables RMSNorm to reuse MXFP8 block scales, reducing the reduction operation size by 32x with a 2.4x kernel speedup.

arXiv · March 16, 2026 · 2603.13180

Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi

Why it matters

As matrix multiplication gets faster via low-precision formats, normalization becomes a bottleneck. This method provides a drop-in replacement that speeds up end-to-end 8B-model layers by 2.6%.

From the abstract

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix multiplication performance have far outstripped improvements in performance on reductions and elementwise computations, which are still being performed in higher precision. In this work, we propose MXNorm, a drop-in replacement for RMSNorm that estimates the