CLJan 14

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin

arXiv:2601.09270v10.6h-index: 7Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a gap in audio data for researchers in Chinese Classical Studies, but it is incremental as it extends existing multimodal approaches to a new domain.

The authors tackled the lack of audio resources in Chinese Classical Studies by creating MCGA, a multi-task audio corpus covering six tasks like speech recognition and reasoning, and found that current multimodal large language models perform poorly on it, with no specific numbers provided.

With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA

View on arXiv PDF Code

Similar