IsoQuant leverages SO(4) isoclinic rotations to achieve a 4.5x-4.7x speedup in low-bit KV-cache quantization over existing methods.
March 31, 2026
Original Paper
IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression
arXiv · 2603.28430
The Takeaway
It replaces computationally expensive random orthogonal transforms with blockwise quaternion-based rotations that are highly aligned with modern GPU hardware. This significantly reduces the overhead of feature decorrelation, which is critical for maintaining accuracy in 2-bit or 3-bit LLM deployment.
From the abstract
Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing.We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D