A Note on the Inception Score
This highlights a critical flaw in evaluation practices for generative models, which is incremental as it builds on existing metrics but essential for advancing the field.
The paper critiques the Inception Score, a widely used evaluation metric for generative models, showing that it fails to provide useful guidance for model comparisons due to suboptimalities in the metric and issues in its application.
Deep generative models are powerful tools that have produced impressive results in recent years. These advances have been for the most part empirically driven, making it essential that we use high quality evaluation metrics. In this paper, we provide new insights into the Inception Score, a recently proposed and widely used evaluation metric for generative models, and demonstrate that it fails to provide useful guidance when comparing models. We discuss both suboptimalities of the metric itself and issues with its application. Finally, we call for researchers to be more systematic and careful when evaluating and comparing generative models, as the advancement of the field depends upon it.