Uses cycle-consistency as a label-free reward signal for reinforcement learning to resolve contradictions in multimodal reasoning.
March 27, 2026
Original Paper
R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
arXiv · 2603.25720
The Takeaway
Instead of relying on human labels or standard RLAIF, it enforces that a model must be able to perform backward inference and switch modalities while maintaining consistent internal logic. This autonomous alignment reduces modality-specific errors and improves advanced reasoning capabilities.
From the abstract
Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves intern