Proves that standard 'wisdom' like Chain-of-Thought and Few-Shot prompting actually degrades performance in specialized medical LLMs.
March 30, 2026
Original Paper
When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
arXiv · 2603.25960
The Takeaway
The study reveals a sharp disconnect between general-purpose and domain-specific LLM behavior, showing accuracy drops of up to 11.9% when using common prompting tricks. Practitioners in specialized fields should pivot toward log-probability 'cloze' scoring, which significantly outperformed all generative prompting strategies.
From the abstract
Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasin