Learns task-specific dense reward functions directly from images using vision foundation models, without requiring privileged simulator states.
arXiv · March 19, 2026 · 2603.16978
The Takeaway
This addresses a major bottleneck in real-world robotics: the inability to provide dense feedback for reinforcement learning outside of simulation. By using a language-conditioned DINO-based model, it generalizes reward structures to new real-world environments and objects where analytical reward functions cannot be defined.
From the abstract
Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual