AI & ML Scaling Insight

Introduces Matrix-to-Matrix RNNs (M$^2$RNN) with matrix-valued hidden states that outperform hybrid Transformers while using 3x smaller state sizes.

arXiv · March 17, 2026 · 2603.14360

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao

The Takeaway

It demonstrates that the performance limit of non-linear RNNs is primarily driven by state size rather than architecture. This allows RNNs to efficiently leverage tensor cores and achieve perfect state tracking generalization, providing a scalable, high-memory alternative to Transformers for complex reasoning.

From the abstract

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling perfor