SDCLIRASJul 8, 2022

Automated Audio Captioning and Language-Based Audio Retrieval

CMU
arXiv:2207.04156v22 citationsh-index: 34
Originality Synthesis-oriented
AI Analysis

This work addresses audio understanding tasks for multimedia applications, but it is incremental as it modifies existing baseline models.

The project tackled automated audio captioning and language-based audio retrieval using the Clotho dataset, achieving baseline performance for captioning and surpassing the baseline for retrieval.

This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho dataset was used. The models were evaluated on BLEU1, BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have conducted a handful of experiments that modify the baseline models for these tasks. Our final architecture for Automated Audio Captioning is close to the baseline performance, while our model for Language-Based Audio Retrieval has surpassed its counterpart.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes