CLMar 17
On the Emotion Understanding of Synthesized SpeechYuan Ge, Haishu Zhao, Aokai Hao et al.
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
CLAug 28, 2025
SageLM: A Multi-aspect and Explainable Large Language Model for Speech JudgementYuan Ge, Junxiang Zhang, Xiaoqian Liu et al.
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.
CVJul 13, 2025
SDTN and TRN: Adaptive Spectral-Spatial Feature Extraction for Hyperspectral Image ClassificationFuyin Ye, Erwen Yao, Jianyong Chen et al.
Hyperspectral image classification plays a pivotal role in precision agriculture, providing accurate insights into crop health monitoring, disease detection, and soil analysis. However, traditional methods struggle with high-dimensional data, spectral-spatial redundancy, and the scarcity of labeled samples, often leading to suboptimal performance. To address these challenges, we propose the Self-Adaptive Tensor- Regularized Network (SDTN), which combines tensor decomposition with regularization mechanisms to dynamically adjust tensor ranks, ensuring optimal feature representation tailored to the complexity of the data. Building upon SDTN, we propose the Tensor-Regularized Network (TRN), which integrates the features extracted by SDTN into a lightweight network capable of capturing spectral-spatial features at multiple scales. This approach not only maintains high classification accuracy but also significantly reduces computational complexity, making the framework highly suitable for real-time deployment in resource-constrained environments. Experiments on PaviaU datasets demonstrate significant improvements in accuracy and reduced model parameters compared to state-of-the-art methods.