Ziyuan Zheng

CV
h-index6
4papers
4citations
Novelty39%
AI Score38

4 Papers

55.2ITMay 22
Multi-User MIMO with Rotatable Antennas and IRS: Joint Antenna Boresight and IRS Orientation Design

Guoying Zhang, Qingqing Wu, Ziyuan Zheng et al.

In this paper, we investigate an intelligent reflecting surface (IRS)-assisted multi-user system, where the base station (BS) employs rotatable antennas (RAs) and the IRS can adjust the panel orientation.To alleviate the severe multiplicative path loss of the cascaded channel, the IRS is deployed near the BS, while the user-BS and user-IRS links remain in the far field. We formulate a sum-rate maximization problem by jointly optimizing the receive beamforming, IRS phase shifts, BS antenna boresights, and IRS panel orientation. To tackle the resulting highly coupled and non-convex problem, we first study a single-user case to reveal the structure of the dual-rotation gain, which is shown to be multiplicatively separable in the far field but coupled in the near field. For the general multi-user case, we develop an alternating optimization algorithm, where the receive beamforming is updated in closed form, the IRS phase shifts are optimized by an FP-assisted Riemannian conjugate gradient method, and the BS antenna boresights and IRS panel orientation are updated via projected gradient methods. Simulation results demonstrate the significant sum-rate gains achieved by the proposed coordinated rotation design over fixed-orientation and single-rotation benchmark schemes, and provide useful insights into near-field dual-rotation design.

81.9ITMay 14
Joint Transmit and Receive Antenna Orientation Design for Secure MIMO Communications

Ailing Zheng, Qingqing Wu, Xingxiang Peng et al.

Physical layer security (PLS) is a promising paradigm for safeguarding 6G wireless networks by exploiting the inherent characteristics of wireless channels. However, the efficiency of conventional PLS is often limited by fixed orientation antennas. This paper investigates a rotatable antenna (RA)-aided secure multiple-input multiple-output (MIMO) communication system, where both the transmitter and the receiver are equipped with RAs in the presence of an eavesdropper. By dynamically optimizing the orientations of RAs, we can proactively reshape the effective MIMO channels to enhance legitimate transmission while simultaneously suppressing information leakage to the eavesdropper. We formulate a secrecy rate maximization problem by jointly optimizing the transmit beamforming, artificial noise (AN) covariance matrix, and the transmit/receive RA orientations, subject to the transmit power budget and antenna orientation constraints. To tackle the resulting highly coupled and non-convex problem, we first study a simplified single-input single-output (SISO) case to reveal the structure of the optimal RA orientation. For the general MIMO case, we develop an alternating optimization algorithm by reformulating the original problem through the minimum mean-square error framework. In particular, the transmit beamforming and AN covariance matrix are derived in semi-closed forms, while the RA orientations are updated via the Riemannian Frank-Wolfe method. The proposed design is further extended to the multi-receiver secure transmission scenario. Simulation results show that the proposed scheme converges rapidly and achieves significant secrecy rate gains over the conventional fixed-orientation scheme.

CVMay 30, 2025
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

Zhuobai Dong, Junchao Yi, Ziyuan Zheng et al.

Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment - commonly believed to improve visual reasoning performance - often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.

CVJun 3, 2025
LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering

Xiaoyi Feng, Kaifeng Zou, Caichun Cen et al.

Existing optical flow datasets focus primarily on real-world simulation or synthetic human motion, but few are tailored to Celluloid(cel) anime character motion: a domain with unique visual and motion characteristics. To bridge this gap and facilitate research in optical flow estimation and downstream tasks such as anime video generation and line drawing colorization, we introduce LinkTo-Anime, the first high-quality dataset specifically designed for cel anime character motion generated with 3D model rendering. LinkTo-Anime provides rich annotations including forward and backward optical flow, occlusion masks, and Mixamo Skeleton. The dataset comprises 395 video sequences, totally 24,230 training frames, 720 validation frames, and 4,320 test frames. Furthermore, a comprehensive benchmark is constructed with various optical flow estimation methods to analyze the shortcomings and limitations across multiple datasets.