Mitigating Catastrophic Forgetting and Mode Collapse in Text-to-Image Diffusion via Latent Replay
This addresses the problem of maintaining model versatility and reducing memory usage for generative AI models, enabling personalized text-to-image systems that evolve incrementally without high computational costs.
The paper tackles catastrophic forgetting and mode collapse in text-to-image diffusion models during continual learning by applying Latent Replay, which stores compact feature representations instead of raw images. It achieved 77.59% Image Alignment on the earliest concept after learning five sequentially, 14% higher than baselines, while maintaining output diversity.
Continual learning -- the ability to acquire knowledge incrementally without forgetting previous skills -- is fundamental to natural intelligence. While the human brain excels at this, artificial neural networks struggle with "catastrophic forgetting," where learning new tasks erases previously acquired knowledge. This challenge is particularly severe for text-to-image diffusion models, which generate images from textual prompts. Additionally, these models face "mode collapse," where their outputs become increasingly repetitive over time. To address these challenges, we apply Latent Replay, a neuroscience-inspired approach, to diffusion models. Traditional replay methods mitigate forgetting by storing and revisiting past examples, typically requiring large collections of images. Latent Replay instead retains only compact, high-level feature representations extracted from the model's internal architecture. This mirrors the hippocampal process of storing neural activity patterns rather than raw sensory inputs, reducing memory usage while preserving critical information. Through experiments with five sequentially learned visual concepts, we demonstrate that Latent Replay significantly outperforms existing methods in maintaining model versatility. After learning all concepts, our approach retained 77.59% Image Alignment (IA) on the earliest concept, 14% higher than baseline methods, while maintaining diverse outputs. Surprisingly, random selection of stored latent examples outperforms similarity-based strategies. Our findings suggest that Latent Replay enables efficient continual learning for generative AI models, paving the way for personalized text-to-image models that evolve with user needs without excessive computational costs.