Beyond Mixtures and Products for Ensemble Aggregation: A Likelihood Perspective on Generalized Means

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei

arXiv:2603.04204v11.7h-index: 9

Originality Highly original

AI Analysis

This work provides a principled justification for the effectiveness of common ensemble aggregation techniques (linear and geometric pooling) for machine learning practitioners, particularly those using Deep Ensembles.

This paper investigates density aggregation methods for combining predictions from Deep Ensembles, focusing on normalized generalized means. It demonstrates that only aggregation rules within the range r \in [0,1] consistently improve upon individual distributions, providing a theoretical basis for the reliability of linear (r=1) and geometric (r=0) pooling. Conversely, rules outside this range may not offer consistent gains, a finding supported by empirical evaluations on image and text classification.

Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.

View on arXiv PDF

Similar