AI & ML Paradigm Shift

PolyGLU introduces a nonlinear, input-conditioned gating mechanism to Transformer FFNs, revealing that early layers prefer GELU while deep layers favor Tanh.

March 17, 2026

Original Paper

PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks

Daniel Nobrega Medeiros

arXiv · 2603.13347

The Takeaway

It challenges the convention of using a single fixed activation function (like SwiGLU) across the entire network. By adding only 0.23% parameter overhead, it achieves significantly higher data efficiency, reaching 60-90% of a model trained on 3,600x more tokens.

From the abstract

Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism comb