Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images
This addresses the challenge of measuring image realism for AI systems, though it appears incremental as it builds on existing LVLMs and datasets.
The paper tackled the problem of evaluating common sense consistency in images, such as a boy with a vacuum cleaner in a desert, by introducing the Through the Looking Glass (TLG) method, which achieved state-of-the-art performance on the WHOOPS! and WEIRD datasets.
Measuring how real images look is a complex task in artificial intelligence research. For example, an image of a boy with a vacuum cleaner in a desert violates common sense. We introduce a novel method, which we call Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLMs to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.