AI & ML Efficiency Breakthrough

Combines the YOCO architecture with recursive computation to scale representational depth without inflating the KV cache.

April 2, 2026

Original Paper

Universal YOCO for Efficient Depth Scaling

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei

arXiv · 2604.01220

The Takeaway

Enables much deeper model reasoning (test-time scaling) with constant global memory overhead, addressing one of the primary hardware bottlenecks of long-context, compute-heavy inference.

From the abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either