CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization
This addresses a core limitation in prompt tuning for vision-language models, offering an incremental improvement for researchers and practitioners in this domain.
The paper tackles the challenge of improving both specialization for specific tasks and generalization to unseen domains in prompt tuning for vision-language models by proposing CoCoA-Mix, a confusion-and-confidence-aware mixture model. It outperforms state-of-the-art methods, though no concrete numbers are provided in the abstract.
Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at https://github.com/url-kaist/CoCoA-Mix.