SDAIASSep 29, 2024

PALM: Few-Shot Prompt Learning for Audio Language Models

arXiv:2409.19806v125 citationsh-index: 8Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of improving few-shot learning efficiency for audio recognition tasks, though it is incremental as it adapts existing prompt learning techniques from vision to audio.

The paper tackled the sensitivity of zero-shot audio recognition in Audio-Language Models to hand-crafted text prompts by proposing PALM, a method that optimizes the text encoder feature space, achieving performance on par with or better than baselines on 11 audio datasets while being computationally efficient.

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes