SDAIASSep 20, 2025

AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval

arXiv:2509.16649v1h-index: 1
Originality Synthesis-oriented
AI Analysis

This work addresses audio retrieval for researchers in sound event detection, but it is incremental as it builds on prior methods from the DCASE challenge.

The paper tackled language-based audio retrieval by proposing a dual encoder system with contrastive learning, data augmentation using LLMs, and auxiliary clustering, achieving a mAP@16 of 46.62 for a single system and 48.83 for an ensemble on the Clotho development test split.

This report presents the AISTAT team's submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year's challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated clustering to introduce an auxiliary classification task for further finetuning. Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes