CARE provides a recipe for converting standard GQA models into high-efficiency Multi-head Latent Attention (MLA) architectures.
March 19, 2026
Original Paper
CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
arXiv · 2603.17946
The Takeaway
By accounting for activation covariance and non-uniform layer importance, CARE allows practitioners to adopt the KV-cache-saving benefits of MLA (as seen in DeepSeek-V3) for existing pretrained weights. It significantly lowers the hardware requirements for long-context inference.
From the abstract
Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ign