AI & ML Breaks Assumption

Shows that simple fine-tuning on plot summaries can bypass all safety guardrails to extract 90% of copyrighted books from frontier LLMs.

March 24, 2026

Original Paper

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty

arXiv · 2603.20957

The Takeaway

Proves that RLHF and system prompts do not remove training data from weights but merely hide it. This 'Alignment Whack-a-Mole' finding has massive implications for AI security, copyright liability, and the long-term effectiveness of current safety strategies.

From the abstract

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task

Read the original paper →

← Back to today's papers