Enables high-rank (r=384) DoRA training on single GPUs through factored norms and fused Triton kernels.
March 24, 2026
Original Paper
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
arXiv · 2603.22276
The Takeaway
Standard DoRA implementations materialize dense matrices, making high-rank adaptation memory-prohibitive. This implementation reduces memory traffic 4x and peak VRAM by up to 7GB, allowing practitioners to use higher capacity adapters on 8B-32B models without elite hardware.
From the abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted