Introduces Matrix-to-Matrix RNNs (M$^2$RNN) with matrix-valued hidden states that outperform hybrid Transformers while using 3x smaller state sizes.
arXiv · March 17, 2026 · 2603.14360
The Takeaway
It demonstrates that the performance limit of non-linear RNNs is primarily driven by state size rather than architecture. This allows RNNs to efficiently leverage tensor cores and achieve perfect state tracking generalization, providing a scalable, high-memory alternative to Transformers for complex reasoning.
From the abstract
Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling perfor