AI & ML Paradigm Shift

Establishes a duality between sequence-axis attention and depth-wise residual connections, treating layer depth as an ordered variable.

March 18, 2026

Original Paper

Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

arXiv · 2603.16039

The Takeaway

This perspective unifies several recent architecture innovations (like DenseFormer and Vertical Attention) into a single theoretical framework. It provides a new lens for designing 'Transformer^2' architectures that optimize information flow across both sequence and depth.

From the abstract

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the d

Read the original paper →

← Back to today's papers