Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.
March 24, 2026
Original Paper
Mixture of Chapters: Scaling Learnt Memory in Transformers
arXiv · 2603.21096
The Takeaway
It demonstrates a new scaling axis for LLMs by decoupling associative memory from model parameters. This allows models to store significantly more knowledge during pre-training and improves knowledge retention during continued fine-tuning.
From the abstract
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and tr