AI & ML Efficiency Breakthrough

Enables high-rank (r=384) DoRA training on single GPUs through factored norms and fused Triton kernels.

March 24, 2026

Original Paper

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Alexandra Zelenin, Alexandra Zhuravlyova

arXiv · 2603.22276

The Takeaway

Standard DoRA implementations materialize dense matrices, making high-rank adaptation memory-prohibitive. This implementation reduces memory traffic 4x and peak VRAM by up to 7GB, allowing practitioners to use higher capacity adapters on 8B-32B models without elite hardware.

From the abstract

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted

Read the original paper →

← Back to today's papers