GradMem replaces the massive KV-cache with a compact memory state updated via test-time gradient descent.
arXiv · March 17, 2026 · 2603.13875
The Takeaway
Enables models to 'write' long contexts into a fixed-size memory through optimization rather than just forward-pass aggregation. It scales context capacity more effectively than standard memory writers and works with pretrained models without retraining.
From the abstract
Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We int