AI & ML Scaling Insight

Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.

March 24, 2026

Original Paper

Mixture of Chapters: Scaling Learnt Memory in Transformers

Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi

arXiv · 2603.21096

The Takeaway

It demonstrates a new scaling axis for LLMs by decoupling associative memory from model parameters. This allows models to store significantly more knowledge during pre-training and improves knowledge retention during continued fine-tuning.

From the abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and tr

Read the original paper →

← Back to today's papers