The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study
For researchers in intrinsic image decomposition, this work identifies a critical evaluation flaw (frame-level leakage) that inflates performance metrics, and provides a corrected protocol and a cost-efficient method with interpretable uncertainty.
The paper reveals that frame-level splits in MPI Sintel inflate intrinsic image decomposition test metrics by 1.6–2.0 dB (up to >10 dB under extended training) compared to scene-level splits, and advocates scene-level splits as the standard. It also presents a physics-informed decomposition with source-separable uncertainty that achieves 15.98 dB R_PSNR at one-fifth the cost of a Deep Ensemble, with uncertainty channels specializing in non-Lambertian errors (r=0.67) and filtering high-uncertainty pixels reducing MSE by 77%.
Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel's correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.