Loss Given Default Prediction Under Measurement-Induced Mixture Distributions: An Information-Theoretic Approach
This provides practical guidance for financial institutions under Basel III requirements and generalizes to domains like medical outcomes and climate forecasting where mixture data structures are unavoidable.
The paper tackled the problem of Loss Given Default (LGD) prediction under data quality constraints, where 90% of training data consists of proxy estimates, leading to systematic failure of methods like Random Forest with negative r-squared (-0.664). Using an information-theoretic approach, they achieved an r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies.
Loss Given Default (LGD) modeling faces a fundamental data quality constraint: 90% of available training data consists of proxy estimates based on pre-distress balance sheets rather than actual recovery outcomes from completed bankruptcy proceedings. We demonstrate that this mixture-contaminated training structure causes systematic failure of recursive partitioning methods, with Random Forest achieving negative r-squared (-0.664, worse than predicting the mean) on held-out test data. Information-theoretic approaches based on Shannon entropy and mutual information provide superior generalization, achieving r-squared of 0.191 and RMSE of 0.284 on 1,218 corporate bankruptcies (1980-2023). Analysis reveals that leverage-based features contain 1.510 bits of mutual information while size effects contribute only 0.086 bits, contradicting regulatory assumptions about scale-dependent recovery. These results establish practical guidance for financial institutions deploying LGD models under Basel III requirements when representative outcome data is unavailable at sufficient scale. The findings generalize to medical outcomes research, climate forecasting, and technology reliability-domains where extended observation periods create unavoidable mixture structure in training data.