AI & ML Nature Is Weird

The 'nicest' and most 'agreeable' AI personalities are actually the easiest to turn evil with internal brain-tweaking.

April 14, 2026

Original Paper

Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue

arXiv · 2604.11120

The Takeaway

Prosocial personas are the most vulnerable to safety failures when manipulated via activation steering, despite appearing safe under standard prompting. This 'prosocial persona paradox' shows that surface-level personality is a poor indicator of true model safety.

From the abstract

Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\r

Read the original paper →

← Back to today's papers