LGFeb 9, 2023
CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement LearningSheng Yue, Guanbo Wang, Wei Shao et al.
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.
82.1SDMar 24
The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language ModelsHeinrich Dinkel, Jiahao Zhou, Guanbo Wang et al.
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable understanding of complex acoustic scenes, their performance depends on the semantic richness of the underlying audio encoder representations. This challenge addresses the integration gap by providing a unified generative evaluation framework, XARES-LLM, which assesses submitted encoders across a diverse suite of downstream classification and generation tasks. By decoupling encoder development from LLM fine-tuning, the challenge establishes a standardized protocol for general-purpose audio representations that can effectively be used for the next generation of multimodal language models.
89.7CLMay 9Code
Dolphin-CN-Dialect: Where Chinese Dialects MatterYangyang Meng, Huihang Zhong, Guodong Lin et al.
We present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.
36.7CVMar 21Code
SupScene: Scene-Structured Overlap Supervision for Image Retrieval in Unconstrained SfMXulei Shi, Maoyu Wang, Yuning Peng et al.
Image retrieval is a critical step for reducing the quadratic cost of image matching in unconstrained Structure-from-Motion (SfM). Unlike generic image retrieval, however, the relevant goal of SfM is to identify geometrically matchable image pairs rather than merely semantically similar images. Prevailing methods are largely trained under anchor-centric tuple guidance, which organizes the training around isolated tuples and under-utilizes the dense, graded overlap structure naturally established within a SfM scene. In this work, we present SupScene, a scene-structured training framework that samples connected local subgraphs from SfM overlap graphs and jointly supervises all valid within-subgraph pairwise relations. To explicitly align the trained descriptor with geometric co-visibility, we further introduce an overlap-ordered objective that combines multi-similarity optimization with a continuous relative-overlap ranking term. In addition, the proposed framework is instantiated with a lightweight Structural Context Probe Pooling (SCPP) head that aggregates complementary structural responses into a compact global descriptor. Extensive experimental results on multiple benchmarks demonstrate that our method can significantly improve overall retrieval performance and enhance the completeness of downstream SfM reconstructions. Code and models are available at https://github.com/Suxilan/SupScene.
AIMay 28, 2025Code
Reinforced Reasoning for Embodied PlanningDi Wu, Jiaxin Fan, Junzhe Zang et al.
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.
CLMar 26, 2025Code
Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern LanguagesYangyang Meng, Jinpeng Li, Guodong Lin et al.
This report introduces Dolphin, a large-scale multilingual automatic speech recognition (ASR) model that extends the Whisper architecture to support a wider range of languages. Our approach integrates in-house proprietary and open-source datasets to refine and optimize Dolphin's performance. The model is specifically designed to achieve notable recognition accuracy for 40 Eastern languages across East Asia, South Asia, Southeast Asia, and the Middle East, while also supporting 22 Chinese dialects. Experimental evaluations show that Dolphin significantly outperforms current state-of-the-art open-source models across various languages. To promote reproducibility and community-driven innovation, we are making our trained models and inference source code publicly available.
LGOct 12, 2025Code
Gains: Fine-grained Federated Domain Adaptation in Open SetZhengyi Zhong, Wenzheng Jiang, Weidong Bao et al.
Conventional federated learning (FL) assumes a closed world with a fixed total number of clients. In contrast, new clients continuously join the FL process in real-world scenarios, introducing new knowledge. This raises two critical demands: detecting new knowledge, i.e., knowledge discovery, and integrating it into the global model, i.e., knowledge adaptation. Existing research focuses on coarse-grained knowledge discovery, and often sacrifices source domain performance and adaptation efficiency. To this end, we propose a fine-grained federated domain adaptation approach in open set (Gains). Gains splits the model into an encoder and a classifier, empirically revealing features extracted by the encoder are sensitive to domain shifts while classifier parameters are sensitive to class increments. Based on this, we develop fine-grained knowledge discovery and contribution-driven aggregation techniques to identify and incorporate new knowledge. Additionally, an anti-forgetting mechanism is designed to preserve source domain performance, ensuring balanced adaptation. Experimental results on multi-domain datasets across three typical data-shift scenarios demonstrate that Gains significantly outperforms other baselines in performance for both source-domain and target-domain clients. Code is available at: https://github.com/Zhong-Zhengyi/Gains.
LGMay 8, 2025
ConCISE: Confidence-guided Compression in Step-by-step Efficient ReasoningZiqing Qiao, Yongheng Deng, Jiali Zeng et al.
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs, increasing computational overhead. Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to remove redundant content thoroughly. To address these limitations, this work begins by framing two key patterns of redundant reflection in LRMs--Confidence Deficit, wherein the model reflects on correct intermediate steps, and Termination Delay, where reflection continues after a verified, confident answer--through a confidence-guided perspective. Based on this, we introduce ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that compared to baseline methods, fine-tuning LRMs on ConCISE-generated data yields a better balance between compression and task performance, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy.
LGFeb 6, 2025
Efficient Randomized Experiments Using Foundation ModelsPiersilvio De Bartolomeis, Javier Abad, Guanbo Wang et al. · eth-zurich
Randomized experiments are the preferred approach for evaluating the effects of interventions, but they are costly and often yield estimates with substantial uncertainty. On the other hand, in silico experiments leveraging foundation models offer a cost-effective alternative that can potentially attain higher statistical precision. However, the benefits of in silico experiments come with a significant risk: statistical inferences are not valid if the models fail to accurately predict experimental responses to interventions. In this paper, we propose a novel approach that integrates the predictions from multiple foundation models with experimental data while preserving valid statistical inference. Our estimator is consistent and asymptotically normal, with asymptotic variance no larger than the standard estimator based on experimental data alone. Importantly, these statistical properties hold even when model predictions are arbitrarily biased. Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone.
SDJun 13, 2021
GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed AudioGuoguo Chen, Shuzhou Chai, Guanbo Wang et al.
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.