Fine-tunes Vision-Language Models using raw images alone by using a text-to-image model as a cycle-consistency reward.
March 20, 2026
Original Paper
CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning
arXiv · 2603.18282
The Takeaway
This removes the dependency on expensive human-annotated caption datasets for VLM improvement. By closing the loop (Image -> Caption -> Reconstructed Image), practitioners can use the reconstruction error as a self-supervised training signal via GRPO to improve grounding and reduce hallucinations.
From the abstract
Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: