Multi-Modal Summary Generation using Multi-Objective Optimization
This addresses the need for more comprehensive multi-modal summarization in communication technology, though it is incremental by extending existing text and image methods to include videos.
The paper tackles the problem of generating multi-modal summaries that include text, images, and videos by proposing an extractive multi-objective optimization model, which outperforms state-of-the-art approaches in evaluations.
Significant development of communication technology over the past few years has motivated research in multi-modal summarization techniques. A majority of the previous works on multi-modal summarization focus on text and images. In this paper, we propose a novel extractive multi-objective optimization based model to produce a multi-modal summary containing text, images, and videos. Important objectives such as intra-modality salience, cross-modal redundancy and cross-modal similarity are optimized simultaneously in a multi-objective optimization framework to produce effective multi-modal output. The proposed model has been evaluated separately for different modalities, and has been found to perform better than state-of-the-art approaches.