WARP provides provable, guaranteed repairs for inner layers of Transformers, overcoming the limitation of previous methods restricted to the final layer.
April 2, 2026
Original Paper
WARP: Guaranteed Inner-Layer Repair of NLP Transformers
arXiv · 2604.00938
The Takeaway
Model editing and repair are usually 'best-effort' and can be easily undone by adversarial prompts. By formulating repair as a convex quadratic program, WARP ensures the model satisfies specific margin and robustness constraints on corrected samples.
From the abstract
Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repa