Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
This work addresses the detection of disturbing images, a domain-specific task for content moderation, but it is incremental as it builds on existing LMM and CLIP methods.
The paper tackled the problem of Disturbing Image Detection by using Large Multimodal Models to extract semantic descriptions and elicited emotions, combined with CLIP embeddings, resulting in state-of-the-art performance on an augmented dataset with significant accuracy improvements over baselines.
In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.