LGSep 27, 2025

More Data or Better Algorithms: Latent Diffusion Augmentation for Deep Imbalanced Regression

arXiv:2509.23240v1h-index: 2

Originality Highly original

AI Analysis

This addresses the lack of data-level solutions for deep imbalanced regression, which is a domain-specific problem for applications with skewed data distributions in high-dimensional inputs like images and text.

The paper tackles the problem of deep imbalanced regression, where models struggle with minority labels in high-dimensional data, by proposing LatentDiff, a framework that uses conditional diffusion models to synthesize features, achieving substantial improvements in minority regions while maintaining overall accuracy on three benchmarks.

In many real-world regression tasks, the data distribution is heavily skewed, and models learn predominantly from abundant majority samples while failing to predict minority labels accurately. While imbalanced classification has been extensively studied, imbalanced regression remains relatively unexplored. Deep imbalanced regression (DIR) represents cases where the input data are high-dimensional and unstructured. Although several data-level approaches for tabular imbalanced regression exist, deep imbalanced regression currently lacks dedicated data-level solutions suitable for high-dimensional data and relies primarily on algorithmic modifications. To fill this gap, we propose LatentDiff, a novel framework that uses conditional diffusion models with priority-based generation to synthesize high-quality features in the latent representation space. LatentDiff is computationally efficient and applicable across diverse data modalities, including images, text, and other high-dimensional inputs. Experiments on three DIR benchmarks demonstrate substantial improvements in minority regions while maintaining overall accuracy.

View on arXiv PDF

Similar