Shiyao Zhang

LG
h-index10
15papers
297citations
Novelty49%
AI Score54

15 Papers

ROMay 27
World Models for Robotic Manipulation: A Survey

Fangyuan Wang, Ziyuan Wang, Guorui Pei et al.

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

LGApr 23, 2023
DiffTraj: Generating GPS Trajectory with Diffusion Probabilistic Model

Yuanshao Zhu, Yongchao Ye, Shiyao Zhang et al.

Pervasive integration of GPS-enabled devices and data acquisition technologies has led to an exponential increase in GPS trajectory data, fostering advancements in spatial-temporal data mining research. Nonetheless, GPS trajectories contain personal geolocation information, rendering serious privacy concerns when working with raw data. A promising approach to address this issue is trajectory generation, which involves replacing original data with generated, privacy-free alternatives. Despite the potential of trajectory generation, the complex nature of human behavior and its inherent stochastic characteristics pose challenges in generating high-quality trajectories. In this work, we propose a spatial-temporal diffusion probabilistic model for trajectory generation (DiffTraj). This model effectively combines the generative abilities of diffusion models with the spatial-temporal features derived from real trajectories. The core idea is to reconstruct and synthesize geographic trajectories from white noise through a reverse trajectory denoising process. Furthermore, we propose a Trajectory UNet (Traj-UNet) deep neural network to embed conditional information and accurately estimate noise levels during the reverse process. Experiments on two real-world datasets show that DiffTraj can be intuitively applied to generate high-fidelity trajectories while retaining the original distributions. Moreover, the generated results can support downstream trajectory analysis tasks and significantly outperform other methods in terms of geo-distribution evaluations.

LGMar 13, 2023
Traffic Prediction with Transfer Learning: A Mutual Information-based Approach

Yunjie Huang, Xiaozhuang Song, Yuanshao Zhu et al.

In modern traffic management, one of the most essential yet challenging tasks is accurately and timely predicting traffic. It has been well investigated and examined that deep learning-based Spatio-temporal models have an edge when exploiting Spatio-temporal relationships in traffic data. Typically, data-driven models require vast volumes of data, but gathering data in small cities can be difficult owing to constraints such as equipment deployment and maintenance costs. To resolve this problem, we propose TrafficTL, a cross-city traffic prediction approach that uses big data from other cities to aid data-scarce cities in traffic prediction. Utilizing a periodicity-based transfer paradigm, it identifies data similarity and reduces negative transfer caused by the disparity between two data distributions from distant cities. In addition, the suggested method employs graph reconstruction techniques to rectify defects in data from small data cities. TrafficTL is evaluated by comprehensive case studies on three real-world datasets and outperforms the state-of-the-art baseline by around 8 to 25 percent.

CPAug 13, 2024
Harnessing Earnings Reports for Stock Predictions: A QLoRA-Enhanced LLM Approach

Haowei Ni, Shuchen Meng, Xupeng Chen et al.

Accurate stock market predictions following earnings reports are crucial for investors. Traditional methods, particularly classical machine learning models, struggle with these predictions because they cannot effectively process and interpret extensive textual data contained in earnings reports and often overlook nuances that influence market movements. This paper introduces an advanced approach by employing Large Language Models (LLMs) instruction fine-tuned with a novel combination of instruction-based techniques and quantized low-rank adaptation (QLoRA) compression. Our methodology integrates 'base factors', such as financial metric growth and earnings transcripts, with 'external factors', including recent market indices performances and analyst grades, to create a rich, supervised dataset. This comprehensive dataset enables our models to achieve superior predictive performance in terms of accuracy, weighted F1, and Matthews correlation coefficient (MCC), especially evident in the comparison with benchmarks such as GPT-4. We specifically highlight the efficacy of the llama-3-8b-Instruct-4bit model, which showcases significant improvements over baseline models. The paper also discusses the potential of expanding the output capabilities to include a 'Hold' option and extending the prediction horizon, aiming to accommodate various investment styles and time frames. This study not only demonstrates the power of integrating cutting-edge AI with fine-tuned financial data but also paves the way for future research in enhancing AI-driven financial analysis tools.

CVAug 8, 2024
Evaluating Modern Approaches in 3D Scene Reconstruction: NeRF vs Gaussian-Based Methods

Yiming Zhou, Zixuan Zeng, Andi Chen et al.

Exploring the capabilities of Neural Radiance Fields (NeRF) and Gaussian-based methods in the context of 3D scene reconstruction, this study contrasts these modern approaches with traditional Simultaneous Localization and Mapping (SLAM) systems. Utilizing datasets such as Replica and ScanNet, we assess performance based on tracking accuracy, mapping fidelity, and view synthesis. Findings reveal that NeRF excels in view synthesis, offering unique capabilities in generating new perspectives from existing data, albeit at slower processing speeds. Conversely, Gaussian-based methods provide rapid processing and significant expressiveness but lack comprehensive scene completion. Enhanced by global optimization and loop closure techniques, newer methods like NICE-SLAM and SplaTAM not only surpass older frameworks such as ORB-SLAM2 in terms of robustness but also demonstrate superior performance in dynamic and complex environments. This comparative analysis bridges theoretical research with practical implications, shedding light on future developments in robust 3D scene reconstruction across various real-world applications.

AIJul 12, 2024
FedVAE: Trajectory privacy preserving based on Federated Variational AutoEncoder

Yuchen Jiang, Ying Wu, Shiyao Zhang et al.

The use of trajectory data with abundant spatial-temporal information is pivotal in Intelligent Transport Systems (ITS) and various traffic system tasks. Location-Based Services (LBS) capitalize on this trajectory data to offer users personalized services tailored to their location information. However, this trajectory data contains sensitive information about users' movement patterns and habits, necessitating confidentiality and protection from unknown collectors. To address this challenge, privacy-preserving methods like K-anonymity and Differential Privacy have been proposed to safeguard private information in the dataset. Despite their effectiveness, these methods can impact the original features by introducing perturbations or generating unrealistic trajectory data, leading to suboptimal performance in downstream tasks. To overcome these limitations, we propose a Federated Variational AutoEncoder (FedVAE) approach, which effectively generates a new trajectory dataset while preserving the confidentiality of private information and retaining the structure of the original features. In addition, FedVAE leverages Variational AutoEncoder (VAE) to maintain the original feature space and generate new trajectory data, and incorporates Federated Learning (FL) during the training stage, ensuring that users' data remains locally stored to protect their personal information. The results demonstrate its superior performance compared to other existing methods, affirming FedVAE as a promising solution for enhancing data privacy and utility in location-based applications.

LGAug 28, 2023
Meta Attentive Graph Convolutional Recurrent Network for Traffic Forecasting

Adnan Zeb, Yongchao Ye, Shiyao Zhang et al.

Traffic forecasting is a fundamental problem in intelligent transportation systems. Existing traffic predictors are limited by their expressive power to model the complex spatial-temporal dependencies in traffic data, mainly due to the following limitations. Firstly, most approaches are primarily designed to model the local shared patterns, which makes them insufficient to capture the specific patterns associated with each node globally. Hence, they fail to learn each node's unique properties and diversified patterns. Secondly, most existing approaches struggle to accurately model both short- and long-term dependencies simultaneously. In this paper, we propose a novel traffic predictor, named Meta Attentive Graph Convolutional Recurrent Network (MAGCRN). MAGCRN utilizes a Graph Convolutional Recurrent Network (GCRN) as a core module to model local dependencies and improves its operation with two novel modules: 1) a Node-Specific Meta Pattern Learning (NMPL) module to capture node-specific patterns globally and 2) a Node Attention Weight Generation Module (NAWG) module to capture short- and long-term dependencies by connecting the node-specific features with the ones learned initially at each time step during GCRN operation. Experiments on six real-world traffic datasets demonstrate that NMPL and NAWG together enable MAGCRN to outperform state-of-the-art baselines on both short- and long-term predictions.

ROApr 22
Unveiling Uncertainty-Aware Autonomous Cooperative Learning Based Planning Strategy

Shiyao Zhang, Liwei Deng, Shuyu Zhang et al.

In future intelligent transportation systems, autonomous cooperative planning (ACP), becomes a promising technique to increase the effectiveness and security of multi-vehicle interactions. However, multiple uncertainties cannot be fully addressed for existing ACP strategies, e.g. perception, planning, and communication uncertainties. To address these, a novel deep reinforcement learning-based autonomous cooperative planning (DRLACP) framework is proposed to tackle various uncertainties on cooperative motion planning schemes. Specifically, the soft actor-critic (SAC) with the implementation of gate recurrent units (GRUs) is adopted to learn the deterministic optimal time-varying actions with imperfect state information occurred by planning, communication, and perception uncertainties. In addition, the real-time actions of autonomous vehicles (AVs) are demonstrated via the Car Learning to Act (CARLA) simulation platform. Evaluation results show that the proposed DRLACP learns and performs cooperative planning effectively, which outperforms other baseline methods under different scenarios with imperfect AV state information.

CVNov 18, 2023
Mesh Watermark Removal Attack and Mitigation: A Novel Perspective of Function Space

Xingyu Zhu, Guanhui Ye, Chengdong Dong et al.

Mesh watermark embeds secret messages in 3D meshes and decodes the message from watermarked meshes for ownership verification. Current watermarking methods directly hide secret messages in vertex and face sets of meshes. However, mesh is a discrete representation that uses vertex and face sets to describe a continuous signal, which can be discretized in other discrete representations with different vertex and face sets. This raises the question of whether the watermark can still be verified on the different discrete representations of the watermarked mesh. We conduct this research in an attack-then-defense manner by proposing a novel function space mesh watermark removal attack FuncEvade and then mitigating it through function space mesh watermarking FuncMark. In detail, FuncEvade generates a different discrete representation of a watermarked mesh by extracting it from the signed distance function of the watermarked mesh. We observe that the generated mesh can evade ALL previous watermarking methods. FuncMark mitigates FuncEvade by watermarking signed distance function through message-guided deformation. Such deformation can survive isosurfacing and thus be inherited by the extracted meshes for further watermark decoding. Extensive experiments demonstrate that FuncEvade achieves 100% evasion rate among all previous watermarking methods while achieving only 0.3% evasion rate on FuncMark. Besides, our FuncMark performs similarly on other metrics compared to state-of-the-art mesh watermarking methods.

SYApr 23
A Convexified Eulerian Framework for Scalable Coordination of Massive DER Populations

Ge Chen, Yiwei Qiu, Shiyao Zhang et al.

This paper proposes a scalable coordination framework with aggregator-side privacy protection for storage-like distributed energy resources (DERs). The framework adopts a two-layer architecture. At the macroscopic layer, building upon an \emph{Eulerian} modeling perspective, the DER population is represented as a continuum whose density evolution is governed by a partial differential equation (PDE), such that the computational complexity is independent of the population size. To address the bilinear non-convexity in this PDE-constrained optimization problem, we develop a convexification method that combines finite-volume discretization with a flux-lifting technique, reformulating the macroscopic problem into a sparse linear program (LP). The LP solution yields a unified, state-dependent broadcast signal for population coordination. Furthermore, a Wasserstein-based relaxation is introduced to replace rigid cyclic constraints and provide additional operational flexibility for improved economic performance. At the microscopic layer, individual resources autonomously recover local setpoints from the broadcast signal and their local states, while an upstream data-mixing protocol aggregates individual states into a macroscopic density histogram without exposing raw individual states to the aggregator. Numerical studies validate the scalability, feasibility, and economic effectiveness of the proposed framework.

AIAug 17, 2024
Siamese Multiple Attention Temporal Convolution Networks for Human Mobility Signature Identification

Zhipeng Zheng, Yuchen Jiang, Shiyao Zhang et al.

The Human Mobility Signature Identification (HuMID) problem stands as a fundamental task within the realm of driving style representation, dedicated to discerning latent driving behaviors and preferences from diverse driver trajectories for driver identification. Its solutions hold significant implications across various domains (e.g., ride-hailing, insurance), wherein their application serves to safeguard users and mitigate potential fraudulent activities. Present HuMID solutions often exhibit limitations in adaptability when confronted with lengthy trajectories, consequently incurring substantial computational overhead. Furthermore, their inability to effectively extract crucial local information further impedes their performance. To address this problem, we propose a Siamese Multiple Attention Temporal Convolutional Network (Siamese MA-TCN) to capitalize on the strengths of both TCN architecture and multi-head self-attention, enabling the proficient extraction of both local and long-term dependencies. Additionally, we devise a novel attention mechanism tailored for the efficient aggregation of multi-scale representations derived from our model. Experimental evaluations conducted on two real-world taxi trajectory datasets reveal that our proposed model effectively extracts both local key information and long-term dependencies. These findings highlight the model's outstanding generalization capabilities, demonstrating its robustness and adaptability across datasets of varying sizes.

CRMay 14, 2024
Achieving Resolution-Agnostic DNN-based Image Watermarking: A Novel Perspective of Implicit Neural Representation

Yuchen Wang, Xingyu Zhu, Guanhui Ye et al.

DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.

LGFeb 21
RadioGen3D: 3D Radio Map Generation via Adversarial Learning on Large-Scale Synthetic Data

Junshen Chen, Angzi Xu, Zezhong Zhang et al.

Radio maps are essential for efficient radio resource management in future 6G and low-altitude networks. While deep learning (DL) techniques have emerged as an efficient alternative to conventional ray-tracing for radio map estimation (RME), most existing DL approaches are confined to 2D near-ground scenarios. They often fail to capture essential 3D signal propagation characteristics and antenna polarization effects, primarily due to the scarcity of 3D data and training challenges. To address these limitations, we present the RadioGen3D framework. First, we propose an efficient data synthesis method to generate high-quality 3D radio map data. By establishing a parametric target model that captures 2D ray-tracing and 3D channel fading characteristics, we derive realistic coefficient combinations from minimal real measurements, enabling the construction of a large-scale synthetic dataset, Radio3DMix. Utilizing this dataset, we propose a 3D model training scheme based on a conditional generative adversarial network (cGAN), yielding a 3D U-Net capable of accurate RME under diverse input feature combinations. Experimental results demonstrate that RadioGen3D surpasses all baselines in both estimation accuracy and speed. Furthermore, fine-tuning experiments verify its strong generalization capability via successful knowledge transfer.

CYAug 17, 2025
Disentangling the Drivers of LLM Social Conformity: An Uncertainty-Moderated Dual-Process Mechanism

Huixin Zhong, Yanan Liu, Qi Cao et al.

As large language models (LLMs) integrate into collaborative teams, their social conformity -- the tendency to align with majority opinions -- has emerged as a key concern. In humans, conformity arises from informational influence (rational use of group cues for accuracy) or normative influence (social pressure for approval), with uncertainty moderating this balance by shifting from purely analytical to heuristic processing. It remains unclear whether these human psychological mechanisms apply to LLMs. This study adapts the information cascade paradigm from behavioral economics to quantitatively disentangle the two drivers to investigate the moderate effect. We evaluated nine leading LLMs across three decision-making scenarios (medical, legal, investment), manipulating information uncertainty (q = 0.667, 0.55, and 0.70, respectively). Our results indicate that informational influence underpins the models' behavior across all contexts, with accuracy and confidence consistently rising with stronger evidence. However, this foundational mechanism is dramatically modulated by uncertainty. In low-to-medium uncertainty scenarios, this informational process is expressed as a conservative strategy, where LLMs systematically underweight all evidence sources. In contrast, high uncertainty triggers a critical shift: while still processing information, the models additionally exhibit a normative-like amplification, causing them to overweight public signals (beta > 1.55 vs. private beta = 0.81).

LGJun 18, 2024
Time Series Modeling for Heart Rate Prediction: From ARIMA to Transformers

Haowei Ni, Shuchen Meng, Xieming Geng et al.

Cardiovascular disease (CVD) is a leading cause of death globally, necessitating precise forecasting models for monitoring vital signs like heart rate, blood pressure, and ECG. Traditional models, such as ARIMA and Prophet, are limited by their need for manual parameter tuning and challenges in handling noisy, sparse, and highly variable medical data. This study investigates advanced deep learning models, including LSTM, and transformer-based architectures, for predicting heart rate time series from the MIT-BIH Database. Results demonstrate that deep learning models, particularly PatchTST, significantly outperform traditional models across multiple metrics, capturing complex patterns and dependencies more effectively. This research underscores the potential of deep learning to enhance patient monitoring and CVD management, suggesting substantial clinical benefits. Future work should extend these findings to larger, more diverse datasets and real-world clinical applications to further validate and optimize model performance.