AI & ML Paradigm Shift

Fine-tunes Large Vision Language Models for medical tasks using only image-description pairs, bypassing the need for expensive expert-curated instructions.

March 23, 2026

Original Paper

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

arXiv · 2603.19482

The Takeaway

This challenges the assumption that visual instruction tuning requires curated triplets. It enables high-performance domain adaptation in fields like medicine where generating high-quality instruction-output pairs is a major bottleneck.

From the abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning appr