Sunan Li

AI
4papers
42citations
Novelty28%
AI Score38

4 Papers

SDOct 22, 2022
Speech Emotion Recognition via an Attentive Time-Frequency Neural Network

Cheng Lu, Wenming Zheng, Hailun Lian et al.

Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time-frequency pattern of speech signal for speech emotion recognition (SER). \textcolor{black}{Generally, different emotions correspond to specific energy activations both within frequency bands and time frames on spectrogram, which indicates the frequency and time domains are both essential to represent the emotion for SER. However, recent spectrogram-based works mainly focus on modeling the long-term dependency in time domain, leading to these methods encountering the following two issues: (1) neglecting to model the emotion-related correlations within frequency domain during the time-frequency joint learning; (2) ignoring to capture the specific frequency bands associated with emotions.} To cope with the issues, we propose an attentive time-frequency neural network (ATFNN) for SER, including a time-frequency neural network (TFNN) and time-frequency attention. Specifically, aiming at the first issue, we design a TFNN with a frequency-domain encoder (F-Encoder) based on the Transformer encoder and a time-domain encoder (T-Encoder) based on the Bidirectional Long Short-Term Memory (Bi-LSTM). The F-Encoder and T-Encoder model the correlations within frequency bands and time frames, respectively, and they are embedded into a time-frequency joint learning strategy to obtain the time-frequency patterns for speech emotions. Moreover, to handle the second issue, we also adopt time-frequency attention with a frequency-attention network (F-Attention) and a time-attention network (T-Attention) to focus on the emotion-related frequency band ranges and time frame ranges, which can enhance the discriminability of speech emotion features.

CVJul 17, 2024
Temporal Label Hierachical Network for Compound Emotion Recognition

Sunan Li, Hailun Lian, Cheng Lu et al.

The emotion recognition has attracted more attention in recent decades. Although significant progress has been made in the recognition technology of the seven basic emotions, existing methods are still hard to tackle compound emotion recognition that occurred commonly in practical application. This article introduces our achievements in the 7th Field Emotion Behavior Analysis (ABAW) competition. In the competition, we selected pre trained ResNet18 and Transformer, which have been widely validated, as the basic network framework. Considering the continuity of emotions over time, we propose a time pyramid structure network for frame level emotion prediction. Furthermore. At the same time, in order to address the lack of data in composite emotion recognition, we utilized fine-grained labels from the DFEW database to construct training data for emotion categories in competitions. Taking into account the characteristics of valence arousal of various complex emotions, we constructed a classification framework from coarse to fine in the label space.

HCMay 18
A Brief Overview: On-Policy Self-Distillation In Large Language Models

Fangming Cui, Sunan Li, Jiahong Li

On-Policy Self-Distillation (OPSD) introduces a unified learning framework in which a single large language model simultaneously serves as both teacher and student. Unlike conventional knowledge distillation that relies on a separate, often larger teacher model, OPSD operates under different contextual roles: the teacher policy is granted privileged access to verified reasoning traces, while the student policy observes only the problem statement. OPSD is trained to minimize per-token distributional divergence between the two roles over trajectories sampled from the student itself, thereby aligning its own reasoning behavior with solution-aware rationalizations. OPSD eliminates the need for an external teacher, directly leverages ground-truth solution information, and resolves the distribution mismatch inherent in off-policy distillation. OPSD typically reduces GPU memory consumption by approximately 40%-60% compared to standard On-Policy Distillation (OPD). In this paper, we present a brief analysis of the conceptual foundations, methodological innovations, and principled designs underlying recent advances in OPSD for large language models. This discussion, crafted from the perspective of beginners in this field, aims to provide a concise overview of the design principles and emerging patterns of OPSD in LLMs, intended for researchers who are similarly new to this area.

AIApr 30
Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang et al.

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.