Yuanzhe Chen

SD
h-index65
12papers
523citations
Novelty42%
AI Score44

12 Papers

SDNov 11, 2025
SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

Xueyao Zhang, Chaoren Wang, Huan Liao et al.

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

SDMay 19, 2025
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu et al.

We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.

SDApr 27, 2024
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Yi Yuan, Zhuo Chen, Xubo Liu et al.

Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events and outperforms state-of-the-art models by a significant margin.

SDAug 22, 2025
Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning

Xueyao Zhang, Junan Zhang, Yuancheng Wang et al.

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate (12.5 Hz) content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.

SDMay 25, 2025
Towards Reliable Large Audio Language Model

Ziyang Ma, Xiquan Li, Yakun Song et al.

Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a "meta ability", which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.

CVOct 8, 2025
Heptapod: Language Modeling on Visual Signals

Yongxin Zhu, Jiawei Chen, Yuanzhe Chen et al.

We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

ASOct 7, 2021
Cloning one's voice using very limited data in the wild

Dongyang Dai, Yuanzhe Chen, Li Chen et al.

With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.

HCMay 7, 2020
DFSeer: A Visual Analytics Approach to Facilitate Model Selection for Demand Forecasting

Dong Sun, Zezheng Feng, Yuanzhe Chen et al.

Selecting an appropriate model to forecast product demand is critical to the manufacturing industry. However, due to the data complexity, market uncertainty and users' demanding requirements for the model, it is challenging for demand analysts to select a proper model. Although existing model selection methods can reduce the manual burden to some extent, they often fail to present model performance details on individual products and reveal the potential risk of the selected model. This paper presents DFSeer, an interactive visualization system to conduct reliable model selection for demand forecasting based on the products with similar historical demand. It supports model comparison and selection with different levels of details. Besides, it shows the difference in model performance on similar products to reveal the risk of model selection and increase users' confidence in choosing a forecasting model. Two case studies and interviews with domain experts demonstrate the effectiveness and usability of DFSeer.

HCJul 29, 2019
PlanningVis: A Visual Analytics Approach to Production Planning in Smart Factories

Dong Sun, Renfei Huang, Yuanzhe Chen et al.

Production planning in the manufacturing industry is crucial for fully utilizing factory resources (e.g., machines, raw materials and workers) and reducing costs. With the advent of industry 4.0, plenty of data recording the status of factory resources have been collected and further involved in production planning, which brings an unprecedented opportunity to understand, evaluate and adjust complex production plans through a data-driven approach. However, developing a systematic analytics approach for production planning is challenging due to the large volume of production data, the complex dependency between products, and unexpected changes in the market and the plant. Previous studies only provide summarized results and fail to show details for comparative analysis of production plans. Besides, the rapid adjustment to the plan in the case of an unanticipated incident is also not supported. In this paper, we propose PlanningVis, a visual analytics system to support the exploration and comparison of production plans with three levels of details: a plan overview presenting the overall difference between plans, a product view visualizing various properties of individual products, and a production detail view displaying the product dependency and the daily production details in related factories. By integrating an automatic planning algorithm with interactive visual explorations, PlanningVis can facilitate the efficient optimization of daily production planning as well as support a quick response to unanticipated incidents in manufacturing. Two case studies with real-world data and carefully designed interviews with domain experts demonstrate the effectiveness and usability of PlanningVis.

CLOct 30, 2017
Understanding Hidden Memories of Recurrent Neural Networks

Yao Ming, Shaozu Cao, Ruixiang Zhang et al.

Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding of the mechanisms behind their effectiveness limits further improvements on their architectures. In this paper, we present a visual analytics method for understanding and comparing RNN models for NLP tasks. We propose a technique to explain the function of individual hidden state units based on their expected response to input texts. We then co-cluster hidden state units and words based on the expected response and visualize co-clustering results as memory chips and word clouds to provide more structured knowledge on RNNs' hidden states. We also propose a glyph-based sequence visualization based on aggregate information to analyze the behavior of an RNN's hidden state at the sentence-level. The usability and effectiveness of our method are demonstrated through case studies and reviews from domain experts.

CVFeb 21, 2015
Intra-and-Inter-Constraint-based Video Enhancement based on Piecewise Tone Mapping

Yuanzhe Chen, Weiyao Lin, Chongyang Zhang et al.

Video enhancement plays an important role in various video applications. In this paper, we propose a new intra-and-inter-constraint-based video enhancement approach aiming to 1) achieve high intra-frame quality of the entire picture where multiple region-of-interests (ROIs) can be adaptively and simultaneously enhanced, and 2) guarantee the inter-frame quality consistencies among video frames. We first analyze features from different ROIs and create a piecewise tone mapping curve for the entire frame such that the intra-frame quality of a frame can be enhanced. We further introduce new inter-frame constraints to improve the temporal quality consistency. Experimental results show that the proposed algorithm obviously outperforms the state-of-the-art algorithms.

CVFeb 21, 2015
A new network-based algorithm for human activity recognition in video

Weiyao Lin, Yuanzhe Chen, Jianxin Wu et al.

In this paper, a new network-transmission-based (NTB) algorithm is proposed for human activity recognition in videos. The proposed NTB algorithm models the entire scene as an error-free network. In this network, each node corresponds to a patch of the scene and each edge represents the activity correlation between the corresponding patches. Based on this network, we further model people in the scene as packages while human activities can be modeled as the process of package transmission in the network. By analyzing these specific "package transmission" processes, various activities can be effectively detected. The implementation of our NTB algorithm into abnormal activity detection and group activity recognition are described in detail in the paper. Experimental results demonstrate the effectiveness of our proposed algorithm.