Are you using test log-likelihood correctly?
This work highlights a critical issue for researchers and practitioners in machine learning and statistics who rely on test log-likelihood for evaluating models, showing it can be misleading.
The paper tackles the problem of using test log-likelihood for model comparison by presenting examples where it contradicts other objectives, such as posterior accuracy and forecast error metrics like root mean squared error.
Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.