This theoretical work refutes the 'Garbage In, Garbage Out' mantra for modern ML, proving that high-dimensional model capacity can asymptotically overcome predictor error and structural uncertainty.
arXiv · March 16, 2026 · 2603.12288
Why it matters
It provides a formal explanation for why modern models thrive on noisy, collinear tabular data. This insight helps practitioners understand when to focus on adding more features versus cleaning existing ones, grounded in Information Theory and Latent Factor Models.
From the abstract
Tabular machine learning presents a paradox: modern models achieve state-of-the-art performance using high-dimensional (high-D), collinear, error-prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor-space "noise"