Haoxiang Shi

CL
h-index39
14papers
691citations
Novelty45%
AI Score61

14 Papers

93.5SDMay 27Code
VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

Jashin Ye, Dongxiao Wang, Yixuan Ye et al.

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

CLMar 7, 2023
Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Jiaan Wang, Yunlong Liang, Fandong Meng et al. · tsinghua

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

CLJul 18, 2022Code
GOAL: Towards Benchmarking Few-Shot Sports Game Summarization

Jiaan Wang, Tingyi Zhang, Haoxiang Shi

Sports game summarization aims to generate sports news based on real-time commentaries. The task has attracted wide research attention but is still under-explored probably due to the lack of corresponding English datasets. Therefore, in this paper, we release GOAL, the first English sports game summarization dataset. Specifically, there are 103 commentary-news pairs in GOAL, where the average lengths of commentaries and news are 2724.9 and 476.3 words, respectively. Moreover, to support the research in the semi-supervised setting, GOAL additionally provides 2,160 unlabeled commentary documents. Based on our GOAL, we build and evaluate several baselines, including extractive and abstractive baselines. The experimental results show the challenges of this task still remain. We hope our work could promote the research of sports game summarization. The dataset has been released at https://github.com/krystalan/goal.

IRAug 5, 2023
Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

Haoxiang Shi, Sumio Fujita, Tetsuya Sakai

Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.

SDSep 9, 2025Code
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Jun Zhan, Mingyang Han, Yuxuan Xie et al.

Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project's homepage}.

CLMay 20, 2024Code
CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models

Haoxiang Shi, Jiaan Wang, Jiarong Xu et al.

Text-to-Table aims to generate structured tables to convey the key information from unstructured documents. Existing text-to-table datasets are typically oriented English, limiting the research in non-English languages. Meanwhile, the emergence of large language models (LLMs) has shown great success as general task solvers in multi-lingual settings (e.g., ChatGPT), theoretically enabling text-to-table in other languages. In this paper, we propose a Chinese text-to-table dataset, CT-Eval, to benchmark LLMs on this task. Our preliminary analysis of English text-to-table datasets highlights two key factors for dataset construction: data diversity and data hallucination. Inspired by this, the CT-Eval dataset selects a popular Chinese multidisciplinary online encyclopedia as the source and covers 28 domains to ensure data diversity. To minimize data hallucination, we first train an LLM to judge and filter out the task samples with hallucination, then employ human annotators to clean the hallucinations in the validation and testing sets. After this process, CT-Eval contains 88.6K task samples. Using CT-Eval, we evaluate the performance of open-source and closed-source LLMs. Our results reveal that zero-shot LLMs (including GPT-4) still have a significant performance gap compared with human judgment. Furthermore, after fine-tuning, open-source LLMs can significantly improve their text-to-table ability, outperforming GPT-4 by a large margin. In short, CT-Eval not only helps researchers evaluate and quickly understand the Chinese text-to-table ability of existing LLMs but also serves as a valuable resource to significantly improve the text-to-table performance of LLMs.

ROAug 13, 2025Code
DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation

Haoxiang Shi, Xiang Deng, Zaijing Li et al.

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces. Existing VLN-CE approaches typically use a two-stage waypoint planning framework, where a high-level waypoint predictor generates the navigable waypoints, and then a navigation planner suggests the intermediate goals in the high-level action space. However, this two-stage decomposition framework suffers from: (1) global sub-optimization due to the proxy objective in each stage, and (2) a performance bottleneck caused by the strong reliance on the quality of the first-stage predicted waypoints. To address these limitations, we propose DAgger Diffusion Navigation (DifNav), an end-to-end optimized VLN-CE policy that unifies the traditional two stages, i.e. waypoint generation and planning, into a single diffusion policy. Notably, DifNav employs a conditional diffusion policy to directly model multi-modal action distributions over future actions in continuous navigation space, eliminating the need for a waypoint predictor while enabling the agent to capture multiple possible instruction-following behaviors. To address the issues of compounding error in imitation learning and enhance spatial reasoning in long-horizon navigation tasks, we employ DAgger for online policy training and expert trajectory augmentation, and use the aggregated data to further fine-tune the policy. This approach significantly improves the policy's robustness and its ability to recover from error states. Extensive experiments on benchmark datasets demonstrate that, even without a waypoint predictor, the proposed method substantially outperforms previous state-of-the-art two-stage waypoint-based models in terms of navigation performance. Our code is available at: https://github.com/Tokishx/DifNav.

CLMay 19, 2025Code
An Empirical Study of Many-to-Many Summarization with Large Language Models

Jiaan Wang, Fandong Meng, Zengkui Sun et al.

Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs' general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.

CLOct 24, 2024
AlignCap: Aligning Speech Emotion Captioning to Human Preferences

Ziqi Liang, Haoxiang Shi, Hanhui Chen

Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task.

CVAug 25, 2025
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li, Peng Chen, Mingyang Han et al.

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

CVAug 3, 2025
Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu et al.

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

CVNov 20, 2024
ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

Xulong Zhang, Xiaoyang Qu, Haoxiang Shi et al.

This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.

SDMar 13, 2024
EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

Ziqi Liang, Haoxiang Shi, Jiawei Wang et al.

Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the parameters and training time of a TTS model while guaranteeing a certain performance due to their high parallelism, which alleviate these economic costs of training. In this paper, we propose a lightweight TTS system based on deep convolutional neural networks, which is a two-stage training end-to-end TTS model and does not employ any recurrent units. Our model consists of two stages: Text2Spectrum and SSRN. The former is used to encode phonemes into a coarse mel spectrogram and the latter is used to synthesize the complete spectrum from the coarse mel spectrogram. Meanwhile, we improve the robustness of our model by a series of data augmentations, such as noise suppression, time warping, frequency masking and time masking, for solving the low resource mongolian problem. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech compared to using mainstream TTS models. Our method uses NCMMSC2022-MTTSC Challenge dataset for validation, which significantly reduces training time while maintaining a certain accuracy.

CLNov 17, 2020
Self-supervised Document Clustering Based on BERT with Data Augment

Haoxiang Shi, Cen Wang

Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.