Prompt compression can paradoxically increase total energy consumption and cost by over 2000% due to aggressive model 'output expansion'.
March 26, 2026
Original Paper
The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression
arXiv · 2603.23528
The Takeaway
It challenges the prevailing industry practice of compressing prompts to save money, proving that reducing input tokens often triggers high-entropy, long-winded model responses. Practitioners must now optimize for total trajectory cost rather than just input token count.
From the abstract
The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four co