Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification
For practitioners of zero-shot audio classification, DAS offers a lightweight, text-only method to improve robustness to acoustic noise without retraining or test-time adaptation.
Contrastive audio-language models like CLAP suffer from 12-30 percentage point drops in accuracy and mAP under acoustic noise at 0 dB SNR. The authors propose Drift Augmented Scoring (DAS), a text-derived bonus that improves accuracy by +2.60 to +5.75 points on UrbanSound8K and mAP by +1.50 to +1.74 points on FSD50K across various SNRs.
Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.