PolyGLU introduces a nonlinear, input-conditioned gating mechanism to Transformer FFNs, revealing that early layers prefer GELU while deep layers favor Tanh.
March 17, 2026
Original Paper
PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks
arXiv · 2603.13347
The Takeaway
It challenges the convention of using a single fixed activation function (like SwiGLU) across the entire network. By adding only 0.23% parameter overhead, it achieves significantly higher data efficiency, reaching 60-90% of a model trained on 3,600x more tokens.
From the abstract
Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism comb