Demonstrates that the stochasticity in standard regularized model training (like cross-validation) can serve as a 'free' and effective exploration strategy for contextual bandits.
arXiv · March 13, 2026 · 2603.11276
Why it matters
Practitioners often struggle to apply Thompson Sampling or UCB to complex black-box estimators like Gradient Boosted Trees. This work shows that pure-greedy selection using regularized models naturally induces exploration, providing a simpler and more scalable alternative for large-scale production bandit systems.
From the abstract
Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy)