AI & ML Efficiency Breakthrough

LongFlow provides an 11x throughput boost for reasoning models by specifically optimizing KV cache for long-output (vs long-input) scenarios.

arXiv · March 13, 2026 · 2603.11504

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

Why it matters

Current KV cache compression is designed for long prompts, but reasoning models (o1/R1) generate massive outputs. LongFlow introduces a low-overhead importance metric and a custom kernel that fuses FlashAttention with token eviction, maintaining accuracy at 80% compression.

From the abstract

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed

Read the original paper →

← Back to today's papers