AI & ML New Capability

Leverages cross-lingual inconsistencies to pinpoint exactly which experts in a Mixture-of-Experts (MoE) model store specific factual knowledge.

arXiv · March 19, 2026 · 2603.17102

Lucas Bandarkar, Alan Ansell, Trevor Cohn

The Takeaway

It provides a scalable interpretability method to isolate roughly 20 critical experts out of 6000. Deactivating these specific experts causes the model to lose the specific knowledge, enabling surgical model editing and safety filtering in massive MoE architectures.

From the abstract

Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it

Read the original paper →

← Back to today's papers