SeriesFusion
Science, curated & edited by AI
Paradigm Challenge  /  Economics

The best websites are blocking AI from reading them, so future bots are going to be trained on the absolute trash left behind.

AI-generated illustration for: The best websites are blocking AI from reading them, so future bots are going to be trained on the absolute trash left behind.
AI-generated illustration

As top-tier publishers block AI crawlers, the pool of training data is increasingly filled with low-quality content and misinformation. This 'adverse selection' means the more we use AI, the more we might be training it on garbage.

Original Paper

Adverse Selection in the AI Data Commons

Kai Zhu

SSRN  ·  6438640

Generative AI depends on high-quality web content, yet no market compensates its producers. We document adverse selection in this AI data commons: facing a binary opt-out choice, the highest-quality producers exit first, degrading the remaining commons. Studying media and news sites at scale, we find a steep quality-blocking gradient: high-factual outlets block at nearly six times the rate of low-factual sources, with misinformation sources remaining most accessible. Publishers strategically tar