Understanding Art through Multi-Modal Retrieval in Paintings
This work addresses the gap in computer vision for art analysis by integrating aesthetics and semantics, though it appears incremental as it applies existing multi-modal methods to a new domain.
The paper tackles the problem of understanding art by bridging visual appearance and underlying meaning through multi-modal techniques, resulting in the collection of a multi-modal dataset with fine-art paintings and comments and exploration of robust visual and textual representations.
In computer vision, visual arts are often studied from a purely aesthetics perspective, mostly by analysing the visual appearance of an artistic reproduction to infer its style, its author, or its representative features. In this work, however, we explore art from both a visual and a language perspective. Our aim is to bridge the gap between the visual appearance of an artwork and its underlying meaning, by jointly analysing its aesthetics and its semantics. We introduce the use of multi-modal techniques in the field of automatic art analysis by 1) collecting a multi-modal dataset with fine-art paintings and comments, and 2) exploring robust visual and textual representations in artistic images.