TIDE is a post-training early-exit system that allows individual tokens to skip unnecessary layers, improving throughput by up to 8% with minimal calibration.
March 24, 2026
Original Paper
TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
arXiv · 2603.21365
The Takeaway
Unlike most early-exit schemes, TIDE requires no retraining and works with standard causal LMs via fused CUDA kernels. It demonstrates that nearly all tokens can exit early during decoding without sacrificing accuracy on complex tasks like multi-step math.
From the abstract
Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1