TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
This addresses the problem of interpretable video understanding for AI systems, offering a neuro-symbolic alternative to black-box models, though it is incremental as it builds on existing entailment tree concepts.
The paper tackles the challenge of understanding complex multimodal content like television clips by proposing TV-TREES, the first multimodal entailment tree generator, which achieves state-of-the-art zero-shot performance on the TVQA benchmark, demonstrating interpretable reasoning.
It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-language models often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by searching for trees of entailment relationships between simple text-video evidence and higher-level conclusions that prove question-answer pairs. We also introduce the task of multimodal entailment tree generation to evaluate reasoning quality. Our method's performance on the challenging TVQA benchmark demonstrates interpretable, state-of-the-art zero-shot performance on full clips, illustrating that multimodal entailment tree generation can be a best-of-both-worlds alternative to black-box systems.