CLSep 16, 2025

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap

arXiv:2509.12591v12.73 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the challenge of generating captions for audio clips without extensive training, though it is incremental as it builds on existing pre-trained models.

The paper tackles the problem of automated audio captioning with limited datasets by proposing a zero-shot system that uses a pre-trained audio CLIP model and a large language model, achieving a 35% improvement in NLG mean score from 4.7 to 7.3.

Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

View on arXiv PDF

Similar