The Practice of Averaging Rate-Distortion Curves over Testsets to Compare Learned Video Codecs Can Cause Misleading Conclusions
This addresses a methodological flaw in the learned video compression community that could affect fair codec comparisons, though it is incremental as it builds on established practices from traditional video coding.
The paper demonstrates that averaging rate-dististortion curves across test videos can mislead evaluations of learned video codecs, showing how a single outlier video can skew results and lead to contradictory conclusions compared to per-sequence metrics.
This paper aims to demonstrate how the prevalent practice in the learned video compression community of averaging rate-distortion (RD) curves across a test video set can lead to misleading conclusions in evaluating codec performance. Through analytical analysis of a simple case and experimental results with two recent learned video codecs, we show how averaged RD curves can mislead comparative evaluation of different codecs, particularly when videos in a dataset have varying characteristics and operating ranges. We illustrate how a single video with distinct RD characteristics from the rest of the test set can disproportionately influence the average RD curve, potentially overshadowing a codec's superior performance across most individual sequences. Using two recent learned video codecs on the UVG dataset as a case study, we demonstrate computing performance metrics, such as the BD rate, from the average RD curve suggests conclusions that contradict those reached from calculating the average of per-sequence metrics. Hence, we argue that the learned video compression community should also report per-sequence RD curves and performance metrics for a test set should be computed from the average of per-sequence metrics, similar to the established practice in traditional video coding, to ensure fair and accurate codec comparisons.