Shrinking a model memory cache forces it to spend more time 'thinking' through deeper layers to solve the same problem.
April 23, 2026
Original Paper
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
arXiv · 2604.17935
The Takeaway
Transformer reasoning capabilities follow a strict mathematical trade-off between memory and depth. If you compress the KV cache to save space, the model must be deeper to maintain its logic. This research establishes a hard lower bound on how much depth is required for tasks like pointer chasing. It quantifies the physical limits of making AI smaller and faster. Developers can now calculate the exact cost of memory compression before they start training. Reasoning is a physical process that requires a minimum amount of either space or time.
From the abstract
The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results.(1) Product depth lower bo