Dongming Shen

AI
h-index3
5papers
19citations
Novelty55%
AI Score47

5 Papers

27.7CLMay 27
TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

Xinyu Wang, Ziyu Zhao, Ke Bai et al.

Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

91.0LGMay 9
ProactBench: Beyond What The User Asked For

Sepehr Harfi, Ahmad Salimi, Dongming Shen et al.

Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \textsc{Emergent}, inference from a single disclosed anchor; \textsc{Critical}, synthesis across multiple anchors; and \textsc{Recovery}, grounded forward-looking value after task completion. We operationalise the benchmark with three agents: a Planner, a User Agent, and an Assistant Model. Their information asymmetries defend against style-confounded scoring, rubric leakage, external-context contamination, and information dumps. The released corpus contains 198 curated dialogues with 624 trigger points across 24 communication styles drawn from a psychometric inventory and audited by an independent LLM judge. Across 16 frontier and open-weight models, \textsc{Recovery} is both difficult and weakly predicted by six standard benchmarks, making it a useful new evaluation signal.

CLFeb 1, 2025
RPGBENCH: Evaluating Large Language Models as Role-Playing Game Engines

Pengfei Yu, Dongming Shen, Silin Meng et al.

We present RPGBench, the first benchmark designed to evaluate large language models (LLMs) as text-based role-playing game (RPG) engines. RPGBench comprises two core tasks: Game Creation (GC) and Game Simulation (GS). In GC, an LLM must craft a valid and playable RPG world using a structured event-state representation, ensuring logical coherence and proper termination conditions. In GS, the LLM simulates interactive gameplay across multiple rounds while consistently updating states and enforcing game rules. To comprehensively assess performance, RPGBench integrates objective and subjective evaluation methodologies. Objective measures verify adherence to event mechanics and check variable updates without requiring human intervention. Subjective measures, such as content interestingness, action quality, and role-playing capability, are evaluated via an LLM-as-a-judge framework, where a strong LLM grades each candidate's outputs. Empirical results demonstrate that state-of-the-art LLMs can produce engaging stories but often struggle to implement consistent, verifiable game mechanics, particularly in long or complex scenarios. By combining structured, rule-based assessments with LLM-based judgments, RPGBench provides a new standard for evaluating how well LLMs can balance creativity, coherence, and complexity in text-based RPGs, opening avenues for more immersive and controllable interactive storytelling.

75.8AIMar 26
Back to Basics: Revisiting ASR in the Age of Voice Agents

Geeyang Tay, Wentao Ma, Jaewon Lee et al.

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

AIMay 24, 2023
Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes

Krishna C. Kalagarla, Dhruva Kartik, Dongming Shen et al.

Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.