AI & ML Paradigm Challenge

It is mathematically impossible for standard gradient descent to reach the optimal solution in the last iterate without knowing the exact time horizon in advance.

April 17, 2026

Original Paper

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

arXiv · 2604.13870

The Takeaway

Practitioners usually assume that running gradient descent longer always leads to better convergence toward the optimum. This paper proves a poly-logarithmic error factor is unavoidable for the last iterate on convex Lipschitz functions. Even with sophisticated tuning, you are likely leaving performance on the table if you aren't averaging iterates or using specific schedules. This challenges the 'just one more epoch' mentality by showing that the final weights are often slightly suboptimal by design. It forces a rethink of training loops for anyone looking to squeeze every last drop of performance out of a model.

From the abstract

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in