CLSep 16, 2025

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

arXiv:2509.12591v13 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the challenge of generating captions for audio clips without extensive training, though it is incremental as it builds on existing pre-trained models.

The paper tackles the problem of automated audio captioning with limited datasets by proposing a zero-shot system that uses a pre-trained audio CLIP model and a large language model, achieving a 35% improvement in NLG mean score from 4.7 to 7.3.

Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes