CLLGMMSDASMay 24, 2025

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

arXiv:2505.18614v41 citationsh-index: 3EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of producing natural-sounding translations for animated musicals, which is an incremental improvement in a domain-specific area.

The paper tackles the problem of singable lyrics translation for animated songs by introducing MAVL, a multilingual audio-video dataset, and proposes SylAVL-CoT, a method that outperforms text-based models in singability and contextual accuracy.

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes