Limits To (Machine) Learning
This addresses the issue of overestimating model performance in machine learning, particularly for researchers and practitioners in finance, by providing a theoretical correction, though it is incremental as it builds on existing bounds.
The paper tackles the problem of machine learning models' inability to approximate the true data-generating process due to finite samples, introducing the Limits-to-Learning Gap (LLG) as a universal lower bound to quantify this discrepancy and showing that it is large in financial data, indicating that standard ML approaches understate true predictability.
Machine learning (ML) methods are highly flexible, but their ability to approximate the true data-generating process is fundamentally constrained by finite samples. We characterize a universal lower bound, the Limits-to-Learning Gap (LLG), quantifying the unavoidable discrepancy between a model's empirical fit and the population benchmark. Recovering the true population $R^2$, therefore, requires correcting observed predictive performance by this bound. Using a broad set of variables, including excess returns, yields, credit spreads, and valuation ratios, we find that the implied LLGs are large. This indicates that standard ML approaches can substantially understate true predictability in financial data. We also derive LLG-based refinements to the classic Hansen and Jagannathan (1991) bounds, analyze implications for parameter learning in general-equilibrium settings, and show that the LLG provides a natural mechanism for generating excess volatility.