Challenges the 'filter-first' data paradigm by showing that training on uncurated data with quality-score labels outperforms training on high-quality filtered subsets.
March 31, 2026
Original Paper
LACON: Training Text-to-Image Model from Uncurated Data
arXiv · 2603.26866
The Takeaway
Instead of aggressively discarding 'bad' data, the LACON framework teaches the model the explicit boundary between high and low quality. This approach allows generative models to leverage the full distribution of available data, achieving better results with the same compute budget by improving the model's understanding of aesthetic and structural quality markers.
From the abstract
The success of modern text-to-image generation is largely attributed to massive, high-quality datasets. Currently, these datasets are curated through a filter-first paradigm that aggressively discards low-quality raw data based on the assumption that it is detrimental to model performance. Is the discarded bad data truly useless, or does it hold untapped potential? In this work, we critically re-examine this question. We propose LACON (Labeling-and-Conditioning), a novel training framework that