IRMay 15

MERVIN: A Unified Framework for Multimodal Event Retrieval in Vietnamese News Videos

Anh-Tai Pham-Nguyen, Tung-Duong Le-Duc, Anh-Duy Le, Trung-Hieu Truong-Le

arXiv:2605.1612052.6

Predicted impact top 68% in IR · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers and practitioners working on Vietnamese video retrieval, MERVIN provides an effective multimodal solution, though it is an incremental application of existing methods.

MERVIN is a unified multimodal framework for event retrieval in Vietnamese news videos, integrating keyframes, transcripts, and summaries. It achieved 79/88 points in the AI Challenge HCMC 2025 qualification phase and retrieved all results for every query in the final round.

The growth of online video platforms drives the need for effective, semantically grounded event retrieval. We present MERVIN, a unified multimodal framework for Vietnamese news videos that integrates keyframes, transcripts, and video summaries. Transcript quality is enhanced via Gemini 1.5 Flash, reducing noise from accents, background sounds, and recognition errors. Visual features are extracted with Perception Encoder, while a Vietnamese language model produces textual embeddings; both are indexed in Milvus for efficient similarity-based retrieval. In addition, a React-based interface enables iterative query refinement across modalities, improving semantic alignment. Experimental results on Vietnamese news videos demonstrate the effectiveness of the proposed system, with MERVIN achieving 79 out of 88 points in AI Challenge HCMC 2025 qualification phase and successfully retrieved all results for every query in the final round.

View on arXiv PDF

Similar