CVSep 28, 2022

Thinking Hallucination for Video Captioning

arXiv:2209.13853v16.510 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses hallucination issues in video captioning, which is crucial for improving the reliability of automated video descriptions, though it is incremental as it builds on existing methods to mitigate specific bottlenecks.

The paper tackles the problem of hallucination in video captioning, where models generate descriptions detached from the source video, by identifying three fundamental sources and proposing solutions including auxiliary heads and context gates, achieving state-of-the-art performance with a massive margin in CIDEr score on MSR-VTT and MSVD datasets.

With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.

View on arXiv PDF Code

Similar