Independent Density Estimation
This addresses a key limitation in Vision-Language models for tasks like image captioning and generation, though it appears incremental as it builds on existing disentanglement and VAE methods.
The paper tackles the challenge of achieving human-like compositional generalization in Vision-Language models by proposing Independent Density Estimation (IDE), which learns connections between words and image features, resulting in superior generalization to unseen compositions across datasets.
Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.