SD CL ASSep 10, 2024

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

arXiv:2409.06223v320.415 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in audio AI for commercial and edge applications, representing an incremental improvement in temporal reasoning for LALMs.

The paper tackled the problem of limited temporal reasoning in Large Audio Language Models (LALMs) for Audio Question Answering (AQA) by introducing a data augmentation technique and fine-tuning with curriculum learning, resulting in improved performance on public audio benchmarks and enabling on-device CPU inference for edge applications.

The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning. Recently, AQA has garnered attention due to the advent of Large Audio Language Models (LALMs). Current literature focuses on constructing LALMs by integrating audio encoders with text-only Large Language Models (LLMs) through a projection module. While LALMs excel in general audio understanding, they are limited in temporal reasoning, which may hinder their commercial applications and on-device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we perform a further fine-tuning of an existing baseline using curriculum learning strategy to specialize in temporal reasoning without compromising performance on fine-tuned tasks. We demonstrate the performance of our model using state-of-the-art LALMs on public audio benchmark datasets. Third, we implement our AQA model on-device locally and investigate its CPU inference for edge applications.

View on arXiv PDF

Similar