Speaking images. A novel framework for the automated self-description of artworks
This work addresses the need for innovative tools to enhance access and interpretation of digital art collections, but it appears incremental as it combines existing models in a new application domain.
The authors tackled the challenge of making digitized artworks more accessible by developing a framework that automatically generates short videos where the main character animates to explain the artwork's content, using open-source models like large-language, face detection, text-to-speech, and audio-to-animation, though no concrete performance numbers are provided.
Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.