AI & ML Efficiency Breakthrough

MUD provides a faster, lower-overhead alternative to Muon for transformer training, achieving up to 2.6x higher throughput.

March 19, 2026

Original Paper

Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

Ben S. Southworth, Stephen Thomas

arXiv · 2603.17970

The Takeaway

Muon (orthogonalized momentum) has become a popular choice for faster transformer convergence, but its matrix-whitening overhead is high. MUD replaces the polar decomposition with a triangular Cholesky-like surrogate, delivering 10-50% wall-clock improvements over tuned AdamW and Muon while maintaining fast convergence.

From the abstract

Orthogonalized-momentum optimizers such as Muon improve transformer training by approximately whitening/orthogonalizing matrix-valued momentum updates via a short polar-decomposition iteration. However, polar-factor approximations typically require multiple large matrix multiplications, and the resulting overhead can be substantial and hardware-dependent. We introduce MUD (MomentUm Decorrelation), a complementary whitening approach that replaces Muon's polar update with a triangular (Cholesky-li

Read the original paper →

← Back to today's papers