CVAIApr 23, 2025

Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

arXiv:2504.16788v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the need for interpretable video analysis in applications like intelligent monitoring and autonomous systems, but it is incremental as it builds on existing transformer and feature extraction methods.

The paper tackles the problem of generating natural language descriptions from video datasets to enhance explainable AI, achieving BLEU-4 scores of 0.755 on BDD-X and 0.778 on MSVD, along with improvements in CIDEr, METEOR, and ROUGE-L metrics.

Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes