The ICaRus architecture allows multiple different models to share a single, frozen KV cache for the same prompt.
March 17, 2026
Original Paper
ICaRus: Identical Cache Reuse for Efficient Multi Model Inference
arXiv · 2603.13281
The Takeaway
In agentic systems involving multiple model calls, KV cache redundancy is a massive memory bottleneck. By decoupling Transformers into shared encoders and specialized decoders, ICaRus eliminates recomputation and cache explosion, making multi-model workflows significantly more scalable.
From the abstract
Multi model inference has recently emerged as a prominent paradigm, particularly in the development of agentic AI systems. However, in such scenarios, each model must maintain its own Key-Value (KV) cache for the identical prompt, leading to substantial memory consumption. This explosive growth of KV caches forces LLM serving systems to evict previously stored caches, which in turn introduces significant recomputation overhead whenever the evicted caches are required again. Moreover, prefix cach