CVApr 20
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language ModelsZhiyuan Jiang, Weihao Hong, Xinlei Guan et al.
Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families -- text-illegibility, time-reading, and object-absence -- each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels -- patterns that aggregate metrics obscure.
ROSep 26, 2025Code
WoW: Towards a World omniscient World model Through Embodied InteractionXiaowei Chi, Peidong Jia, Chun-Kai Fan et al.
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
ROJan 7
Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing TestChun-Kai Fan, Xiaowei Chi, Xiaozhu Ju et al.
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
DBMar 23
FuzzySQL: Uncovering Hidden Vulnerabilities in DBMS Special Features with LLM-Driven FuzzingYongxin Chen, Zhiyuan Jiang, Chao Zhang et al.
Traditional database fuzzing techniques primarily focus on syntactic correctness and general SQL structures, leaving critical yet obscure DBMS features, such as system-level modes (e.g., GTID), programmatic constructs (e.g., PROCEDURE), advanced process commands (e.g., KILL), largely underexplored. Although rarely triggered by typical inputs, these features can lead to severe crashes or security issues when executed under edge-case conditions. In this paper, we present FuzzySQL, a novel LLM-powered adaptive fuzzing framework designed to uncover subtle vulnerabilities in DBMS special features. FuzzySQL combines grammar-guided SQL generation with logic-shifting progressive mutation, a novel technique that explores alternative control paths by negating conditions and restructuring execution logic, synthesizing structurally and semantically diverse test cases. To further ensure deeper execution coverage of the back end, FuzzySQL employs a hybrid error repair pipeline that unifies rule-based patching with LLM-driven semantic repair, enabling automatic correction of syntactic and context-sensitive failures. We evaluate FuzzySQL across multiple DBMSs, including MySQL, MariaDB, SQLite, PostgreSQL and Clickhouse, uncovering 64 vulnerabilities, 27 of which are tied to under-tested DBMS special features. As of this writing, 60 cases have been confirmed with 9 assigned CVE identifiers, 31 already fixed by vendors, and additional vulnerabilities scheduled to be patched in upcoming releases. Our results highlight the limitations of conventional fuzzers in semantic feature coverage and demonstrate the potential of LLM-based fuzzing to discover deeply hidden bugs in complex database systems.
SPMay 7
TGPP: Trajectory-Guided Plug-and-Play Priors for Sparse Radio Map ReconstructionJiawen Zhang, Zhiyuan Jiang, Sheng Zhou et al.
Radio map (RM) reconstruction is essential for environment-aware wireless networks, but practical measurements are often collected along mobility trajectories rather than randomly scattered over the target region. Such trajectory-sampled observations induce spatially heterogeneous uncertainty: near-trajectory regions are directly constrained, whereas distant or occluded regions remain weakly observed, leading to degraded reconstruction accuracy in under-constrained areas. To address this problem, we propose Trajectory-Guided Plug-and-Play Priors (TGPP), a general guidance module for sparse RM reconstruction. TGPP learns an explicit guidance map as an interpretable input-space risk prior, and an implicit guide feature that is projected and fused with backbone hidden representations. TGPP can be attached to different reconstruction backbones without changing their original task formulation. We further introduce RadioFlow-LDM, a latent flow-based generative backbone, and apply TGPP to deterministic, adversarial, graph-based, and latent generative reconstruction models. Experiments on RadioMapSeer with five trajectory sampling rates show that trajectory-sampled reconstruction differs substantially from random sparse interpolation. TGPP improves most reconstruction metrics across backbones, achieving up to 43.1% NMSE reduction relative to the corresponding base backbone without trajectory-guided priors.
CVDec 5, 2025
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI GroundingZhiyuan Jiang, Shenghao Xie, Wenyi Li et al.
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
ITFeb 13, 2020
Deep Reinforcement Learning-Based Beam Tracking for Low-Latency Services in Vehicular NetworksYan Liu, Zhiyuan Jiang, Shunqing Zhang et al.
Ultra-Reliable and Low-Latency Communications (URLLC) services in vehicular networks on millimeter-wave bands present a significant challenge, considering the necessity of constantly adjusting the beam directions. Conventional methods are mostly based on classical control theory, e.g., Kalman filter and its variations, which mainly deal with stationary scenarios. Therefore, severe application limitations exist, especially with complicated, dynamic Vehicle-to-Everything (V2X) channels. This paper gives a thorough study of this subject, by first modifying the classical approaches, e.g., Extended Kalman Filter (EKF) and Particle Filter (PF), for non-stationary scenarios, and then proposing a Reinforcement Learning (RL)-based approach that can achieve the URLLC requirements in a typical intersection scenario. Simulation results based on a commercial ray-tracing simulator show that enhanced EKF and PF methods achieve packet delay more than $10$ ms, whereas the proposed deep RL-based method can reduce the latency to about $6$ ms, by extracting context information from the training data.
ITMar 6, 2019
Distributed Policy Learning Based Random Access for Diversified QoS RequirementsZhiyuan Jiang, Sheng Zhou, Zhisheng Niu
Future wireless access networks need to support diversified quality of service (QoS) metrics required by various types of Internet-of-Things (IoT) devices, e.g., age of information (AoI) for status generating sources and ultra low latency for safety information in vehicular networks. In this paper, a novel inner-state driven random access (ISDA) framework is proposed based on distributed policy learning, in particular a cross-entropy method. Conventional random access schemes, e.g., $p$-CSMA, assume state-less terminals, and thus assigning equal priorities to all. In ISDA, the inner-states of terminals are described by a time-varying state vector, and the transmission probabilities of terminals in the contention period are determined by their respective inner-states. Neural networks are leveraged to approximate the function mappings from inner-states to transmission probabilities, and an iterative approach is adopted to improve these mappings in a distributed manner. Experiment results show that ISDA can improve the QoS of heterogeneous terminals simultaneously compared to conventional CSMA schemes.
ITDec 4, 2018
A Two-Step Learning and Interpolation Method for Location-Based Channel DatabaseRuichen Deng, Zhiyuan Jiang, Sheng Zhou et al.
Timely and accurate knowledge of channel state information (CSI) is necessary to support scheduling operations at both physical and network layers. In order to support pilot-free channel estimation in cell sleeping scenarios, we propose to adopt a channel database that stores the CSI as a function of geographic locations. Such a channel database is generated from historical user records, which usually can not cover all the locations in the cell. Therefore, we develop a two-step interpolation method to infer the channels at the uncovered locations. The method firstly applies the K-nearest-neighbor method to form a coarse database and then refines it with a deep convolutional neural network. When applied to the channel data generated by ray tracing software, our method shows a great advantage in performance over the conventional interpolation methods.
ITDec 4, 2018
Inferring Remote Channel State Information: Cramér-Rao Lower Bound and Deep Learning ImplementationZhiyuan Jiang, Ziyan He, Sheng Chen et al.
Channel state information (CSI) is of vital importance in wireless communication systems. Existing CSI acquisition methods usually rely on pilot transmissions, and geographically separated base stations (BSs) with non-correlated CSI need to be assigned with orthogonal pilots which occupy excessive system resources. Our previous work adopts a data-driven deep learning based approach which leverages the CSI at a local BS to infer the CSI remotely, however the relevance of CSI between separated BSs is not specified explicitly. In this paper, we exploit a model-based methodology to derive the Cramér-Rao lower bound (CRLB) of remote CSI inference given the local CSI. Although the model is simplified, the derived CRLB explicitly illustrates the relationship between the inference performance and several key system parameters, e.g., terminal distance and antenna array size. In particular, it shows that by leveraging multiple local BSs, the inference error exhibits a larger power-law decay rate (w.r.t. number of antennas), compared with a single local BS; this explains and validates our findings in evaluating the deep-neural-network-based (DNN-based) CSI inference. We further improve on the DNN-based method by employing dropout and deeper networks, and show an inference performance of approximately $90\%$ accuracy in a realistic scenario with CSI generated by a ray-tracing simulator.
ITDec 4, 2018
Time-Sequence Channel Inference for Beam Alignment in Vehicular NetworksSheng Chen, Zhiyuan Jiang, Sheng Zhou et al.
In this paper, we propose a learning-based low-overhead beam alignment method for vehicle-to-infrastructure communication in vehicular networks. The main idea is to remotely infer the optimal beam directions at a target base station in future time slots, based on the CSI of a source base station in previous time slots. The proposed scheme can reduce channel acquisition and beam training overhead by replacing pilot-aided beam training with online inference from a sequence-to-sequence neural network. Simulation results based on ray-tracing channel data show that our proposed scheme achieves a $8.86\%$ improvement over location-based beamforming schemes with a positioning error of $1$m, and is within a $4.93\%$ performance loss compared with the genie-aided optimal beamformer.
ITDec 3, 2018
Exploiting Wireless Channel State Information Structures Beyond Linear Correlations: A Deep Learning ApproachZhiyuan Jiang, Sheng Chen, Andreas F. Molisch et al.
Knowledge of information about the propagation channel in which a wireless system operates enables better, more efficient approaches for signal transmissions. Therefore, channel state information (CSI) plays a pivotal role in the system performance. The importance of CSI is in fact growing in the upcoming 5G and beyond systems, e.g., for the implementation of massive multiple-input multiple-output (MIMO). However, the acquisition of timely and accurate CSI has long been considered as a major issue, and becomes increasingly challenging due to the need for obtaining CSI of many antenna elements in massive MIMO systems. To cope with this challenge, existing works mainly focus on exploiting linear structures of CSI, such as CSI correlations in the spatial domain, to achieve dimensionality reduction. In this article, we first systematically review the state-of-the-art on CSI structure exploitation; then extend to seek for deeper structures that enable remote CSI inference wherein a data-driven deep neural network (DNN) approach is necessary due to model inadequacy. We develop specific DNN designs suitable for CSI data. Case studies are provided to demonstrate great potential in this direction for future performance enhancement.
ITMar 23, 2018
SENATE: A Permissionless Byzantine Consensus Protocol in Wireless NetworksZhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou et al.
The blockchain technology has achieved tremendous success in open (permissionless) decentralized consensus by employing proof-of-work (PoW) or its variants, whereby unauthorized nodes cannot gain disproportionate impact on consensus beyond their computational power. However, PoW-based systems incur a high delay and low throughput, making them ineffective in dealing with real-time applications. On the other hand, byzantine fault-tolerant (BFT) consensus algorithms with better delay and throughput performance have been employed in closed (permissioned) settings to avoid Sybil attacks. In this paper, we present Sybil-proof wirelEss Network coordinAte based byzanTine consEnsus (SENATE), which is based on the conventional BFT consensus framework yet works in open systems of wireless devices where faulty nodes may launch Sybil attacks. As in a Senate in the legislature where the quota of senators per state (district) is a constant irrespective with the population of the state, "senators" in SENATE are selected from participating distributed nodes based on their wireless network coordinates (WNC) with a fixed number of nodes per district in the WNC space. Elected senators then participate in the subsequent consensus reaching process and broadcast the result. Thereby, SENATE is proof against Sybil attacks since pseudonyms of a faulty node are likely to be adjacent in the WNC space and hence fail to be elected.