Jie Gu

RO
h-index11
19papers
329citations
Novelty56%
AI Score57

19 Papers

61.0ROMay 27
SANTS: A State-Adaptive Scheduler for World Action Models

Yirui Sun, Guangyu Zhuge, Keliang Liu et al.

World Action Models (WAMs) improve robot manipulation by using video-based future representations to condition action generation. In pixel-space WAMs, however, the best action condition is not necessarily the fully denoised video. Controlled denoising-depth scans show that video refinement can reduce action error up to a state-dependent point, after which the gain may saturate or even reverse when late predictions become less action-relevant or physically unreliable. This suggests that action generation should use a state-dependent point along the video noise trajectory rather than a fixed terminal denoising depth. We introduce State-Adaptive Noise Trajectory Scheduler (SANTS), a lightweight scheduler for video-to-action diffusion policies. At each video decision point, SANTS reads the current video-state representation and noise level, then jointly predicts a cumulative stopping hazard and a relative noise-progression ratio. SANTS is post-trained with a path-level reward computed after the frozen action branch generates the final action chunk, so the scheduler is optimized for downstream action quality rather than intermediate video fidelity, while redundant video-state updates are explicitly penalized. Experiments show that SANTS reaches \(94.4\%\) overall success on RoboTwin 2.0 and \(73.1\%\) average success across seven real-robot tasks, while reducing latency by \(81.7\%\) and \(79.0\%\) relative to full video denoising, respectively. These results indicate that adaptive selection along the video noise trajectory can preserve the control benefits of WAM-style future reasoning while removing much of its redundant inference cost.

SYJun 6, 2022
Continuous and Distribution-free Probabilistic Wind Power Forecasting: A Conditional Normalizing Flow Approach

Honglin Wen, Pierre Pinson, Jinghuan Ma et al.

We present a data-driven approach for probabilistic wind power forecasting based on conditional normalizing flow (CNF). In contrast with the existing, this approach is distribution-free (as for non-parametric and quantile-based approaches) and can directly yield continuous probability densities, hence avoiding quantile crossing. It relies on a base distribution and a set of bijective mappings. Both the shape parameters of the base distribution and the bijective mappings are approximated with neural networks. Spline-based conditional normalizing flow is considered owing to its non-affine characteristics. Over the training phase, the model sequentially maps input examples onto samples of base distribution, given the conditional contexts, where parameters are estimated through maximum likelihood. To issue probabilistic forecasts, one eventually maps samples of the base distribution into samples of a desired distribution. Case studies based on open datasets validate the effectiveness of the proposed model, and allows us to discuss its advantages and caveats with respect to the state of the art.

32.8ROMar 19
RhoMorph: Rhombus-shaped Deformable Modular Robots for Stable, Medium-Independent Reconfiguration Motion

Jie Gu, Yirui Sun, Zhihao Xia et al.

In this paper, we present RhoMorph, a novel deformable planar lattice modular self-reconfigurable robot (MSRR) with a rhombus shaped module. Each module consists of a parallelogram skeleton with a single centrally mounted actuator that enables folding and unfolding along its diagonal. The core design philosophy is to achieve essential MSRR functionalities such as morphing, docking, and locomotion with minimal control complexity. This enables a continuous and stable reconfiguration process that is independent of the surrounding medium, allowing the system to reliably form various configurations in diverse environments. To leverage the unique kinematics of RhoMorph, we introduce morphpivoting, a novel motion primitive for reconfiguration that differs from advanced MSRR systems, and propose a strategy for its continuous execution. Finally, a series of physical experiments validate the module's stable reconfiguration ability, as well as its positional and docking accuracy.

CRDec 26, 2025
LLA: Enhancing Security and Privacy for Generative Models with Logic-Locked Accelerators

You Li, Guannan Zhao, Yuhao Ju et al.

We introduce LLA, an effective intellectual property (IP) protection scheme for generative AI models. LLA leverages the synergy between hardware and software to defend against various supply chain threats, including model theft, model corruption, and information leakage. On the software side, it embeds key bits into neurons that can trigger outliers to degrade performance and applies invariance transformations to obscure the key values. On the hardware side, it integrates a lightweight locking module into the AI accelerator while maintaining compatibility with various dataflow patterns and toolchains. An accelerator with a pre-stored secret key acts as a license to access the model services provided by the IP owner. The evaluation results show that LLA can withstand a broad range of oracle-guided key optimization attacks, while incurring a minimal computational overhead of less than 0.1% for 7,168 key bits.

87.8CVMay 10
DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

Can Li, Zhoujian Li, Ren Li et al.

World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics--neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand--continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis.

ROAug 3, 2024
Stimulating Imagination: Towards General-purpose "Something Something Placement"

Jianyang Wu, Jie Gu, Xiaokang Ma et al.

General-purpose object placement is a fundamental capability of an intelligent generalist robot: being capable of rearranging objects following precise human instructions even in novel environments. This work is dedicated to achieving general-purpose object placement with ``something something'' instructions. Specifically, we break the entire process down into three parts, including object localization, goal imagination and robot control, and propose a method named SPORT. SPORT leverages a pre-trained large vision model for broad semantic reasoning about objects, and learns a diffusion-based pose estimator to ensure physically-realistic results in 3D space. Only object types (movable or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object recognition and localization since no specific fine-tuning is needed for the robotic scenario. Moreover, the diffusion-based estimator only need to ``imagine" the object poses after the placement, while no necessity for their semantic information. Thus the training burden is greatly reduced and no massive training is required. The training data for the goal pose estimation is collected in simulation and annotated by using GPT-4. Experimental results demonstrate the effectiveness of our approach. SPORT can not only generate promising 3D goal poses for unseen simulated objects, but also be seamlessly applied to real-world settings.

86.6ROApr 26
Move-Then-Operate: Behavioral Phasing for Human-Like Robotic Manipulation

Haoming Xu, Lei Lei, Jie Gu et al.

We present Move-Then-Operate, a Vision language action framework that explicitly decouples robotic manipulation into two distinct behavioral phases: coarse relocation (move) and contact-critical interaction (operate). Unlike monolithic policies that conflate these heterogeneous regimes, our architecture employs a dual-expert policy routed by a learnable phase selector, introducing a structural inductive bias that isolates phase-specific dynamics. Phase labels are automatically generated via an MLLM-based pipeline conditioned on lightweight contextual cues such as end-effector velocity and subtask decomposition to ensure alignment with human motor patterns. Evaluated on the RoboTwin2 benchmark, our method achieves an average success rate of $68.9\%$, outperforming the monolithic $π_0$ baseline by $24\%$. It matches or exceeds models trained on $10\times$ more data and reaches peak performance in $40\%$ fewer training steps, demonstrating that architectural disentanglement of move and operate phases is a highly effective and efficient strategy for mastering high-precision manipulation.

CVJun 1, 2025
Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective

Lei Lei, Jie Gu, Xiaokang Ma et al.

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Previous works generally assume that all visual tokens are necessary in the shallow layers of LLMs, and therefore token compression typically occurs in intermediate layers. In contrast, our study reveals an interesting insight: with proper selection, token compression is feasible at the input stage of LLM with negligible performance loss. Specifically, we reveal that explainability methods can effectively evaluate the importance of each visual token with respect to the given instruction, which can well guide the token compression. Furthermore, we propose to learn a mapping from the attention map of the first LLM layer to the explanation results, thereby avoiding the need for a full inference pass and facilitating practical deployment. Interestingly, this mapping can be learned using a simple and lightweight convolutional network, whose training is efficient and independent of MLLMs. Extensive experiments on 10 image and video benchmarks across three leading MLLMs (Qwen2-VL, LLaVA-OneVision, and VILA1.5) demonstrate the effectiveness of our approach, e.g., pruning 50% visual tokens while retaining more than 96% of the original performance across all benchmarks for all these three MLLMs. It also exhibits strong generalization, even when the number of tokens in inference far exceeds that used in training.

LGMar 6, 2024
Tackling Missing Values in Probabilistic Wind Power Forecasting: A Generative Approach

Honglin Wen, Pierre Pinson, Jie Gu et al.

Machine learning techniques have been successfully used in probabilistic wind power forecasting. However, the issue of missing values within datasets due to sensor failure, for instance, has been overlooked for a long time. Although it is natural to consider addressing this issue by imputing missing values before model estimation and forecasting, we suggest treating missing values and forecasting targets indifferently and predicting all unknown values simultaneously based on observations. In this paper, we offer an efficient probabilistic forecasting approach by estimating the joint distribution of features and targets based on a generative model. It is free of preprocessing, and thus avoids introducing potential errors. Compared with the traditional "impute, then predict" pipeline, the proposed approach achieves better performance in terms of continuous ranked probability score.

ROAug 25, 2025
Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

Bokai Ji, Jie Gu, Xiaokang Ma et al.

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

QUANT-PHJan 24, 2022
Automated machine learning for secure key rate in discrete-modulated continuous-variable quantum key distribution

Zhi-Ping Liu, Min-Gang Zhou, Wen-Bo Liu et al.

Continuous-variable quantum key distribution (CV QKD) with discrete modulation has attracted increasing attention due to its experimental simplicity, lower-cost implementation and compatibility with classical optical communication. Correspondingly, some novel numerical methods have been proposed to analyze the security of these protocols against collective attacks, which promotes key rates over one hundred kilometers of fiber distance. However, numerical methods are limited by their calculation time and resource consumption, for which they cannot play more roles on mobile platforms in quantum networks. To improve this issue, a neural network model predicting key rates in nearly real time has been proposed previously. Here, we go further and show a neural network model combined with Bayesian optimization. This model automatically designs the best architecture of neural network computing key rates in real time. We demonstrate our model with two variants of CV QKD protocols with quaternary modulation. The results show high reliability with secure probability as high as $99.15\%-99.59\%$, considerable tightness and high efficiency with speedup of approximately $10^7$ in both cases. This inspiring model enables the real-time computation of unstructured quantum key distribution protocols' key rate more automatically and efficiently, which has met the growing needs of implementing QKD protocols on moving platforms.

LGOct 20, 2021
Empowering General-purpose User Representation with Full-life Cycle Behavior Modeling

Bei Yang, Jie Gu, Ke Liu et al.

User Modeling plays an essential role in industry. In this field, task-agnostic approaches, which generate general-purpose representation applicable to diverse downstream user cognition tasks, is a promising direction being more valuable and economical than task-specific representation learning. With the rapid development of Internet service platforms, user behaviors have been accumulated continuously. However, existing general-purpose user representation researches have little ability for full-life cycle modeling on extremely long behavior sequences since user registration. In this study, we propose a novel framework called full- Life cycle User Representation Model (LURM) to tackle this challenge. Specifically, LURM consists of two cascaded sub-models: (I) Bag-of-Interests (BoI) encodes user behaviors in any time period into a sparse vector with super-high dimension (e.g., 10^5); (II) Self-supervised Multi-anchor Encoder Network (SMEN) maps sequences of BoI features to multiple low-dimensional user representations. Specially, SMEN achieves almost lossless dimensionality reduction, benefiting from a novel multi-anchor module which can learn different aspects of user interests. Experiments on several benchmark datasets show that our approach outperforms state-of-the-art general-purpose representation methods.

QUANT-PHSep 23, 2021
Finite-key Analysis for Quantum Conference Key Agreement with Asymmetric Channels

Zhao Li, Xiao-Yu Cao, Chen-Long Li et al.

As an essential ingredient of quantum networks, quantum conference key agreement (QCKA) provides unconditional secret keys among multiple parties, which enables only legitimate users to decrypt the encrypted message. Recently, some QCKA protocols employing twin-field was proposed to promote transmission distance. These protocols, however, suffer from relatively low conference key rate and short transmission distance over asymmetric channels, which demands a prompt solution in practice. Here, we consider a tripartite QCKA protocol utilizing the idea of sending-or-not-sending twin-field scheme and propose a high-efficiency QCKA over asymmetric channels by removing the symmetry parameters condition. Besides, we provide a composable finite-key analysis with rigorous security proof against general attacks by exploiting the entropic uncertainty relation for multiparty system. Our protocol greatly improves the feasibility to establish conference keys over asymmetric channels.

LGSep 18, 2021
Interest-oriented Universal User Representation via Contrastive Learning

Qinghui Sun, Jie Gu, Bei Yang et al.

User representation is essential for providing high-quality commercial services in industry. Universal user representation has received many interests recently, with which we can be free from the cumbersome work of training a specific model for each downstream application. In this paper, we attempt to improve universal user representation from two points of views. First, a contrastive self-supervised learning paradigm is presented to guide the representation model training. It provides a unified framework that allows for long-term or short-term interest representation learning in a data-driven manner. Moreover, a novel multi-interest extraction module is presented. The module introduces an interest dictionary to capture principal interests of the given user, and then generate his/her interest-oriented representations via behavior aggregation. Experimental results demonstrate the effectiveness and applicability of the learned user representations.

QUANT-PHSep 6, 2021
Coherent one-way quantum conference key agreement based on twin field

Xiao-Yu Cao, Jie Gu, Yu-Shuo Lu et al.

Quantum conference key agreement (CKA) enables key sharing among multiple trusted users with information-theoretic security. Currently, the key rates of most quantum CKA protocols suffer from the limit of the total efficiency among quantum channels. Inspired by the coherent one-way and twin-field quantum key distribution (QKD) protocols, we propose a quantum CKA protocol of three users. Exploiting coherent states with intensity 0 and $μ$ to encode logic bits, our protocol can break the limit. Additionally, the requirements of phase randomization and multiple intensity modulation are removed in our protocol, making its experimental demonstration simple.

LGDec 11, 2020
Exploiting Behavioral Consistence for Universal User Representation

Jie Gu, Feng Wang, Qinghui Sun et al.

User modeling is critical for developing personalized services in industry. A common way for user modeling is to learn user representations that can be distinguished by their interests or preferences. In this work, we focus on developing universal user representation model. The obtained universal representations are expected to contain rich information, and be applicable to various downstream applications without further modifications (e.g., user preference prediction and user profiling). Accordingly, we can be free from the heavy work of training task-specific models for every downstream task as in previous works. In specific, we propose Self-supervised User Modeling Network (SUMN) to encode behavior data into the universal representation. It includes two key components. The first one is a new learning objective, which guides the model to fully identify and preserve valuable user information under a self-supervised learning framework. The other one is a multi-hop aggregation layer, which benefits the model capacity in aggregating diverse behaviors. Extensive experiments on benchmark datasets show that our approach can outperform state-of-the-art unsupervised representation methods, and even compete with supervised ones.

SPSep 9, 2019
Sequential Convolutional Recurrent Neural Networks for Fast Automatic Modulation Classification

Kaisheng Liao, Yaodong Zhao, Jie Gu et al.

A novel and efficient end-to-end learning model for automatic modulation classification is proposed for wireless spectrum monitoring applications, which automatically learns from the time domain in-phase and quadrature data without requiring the design of hand-crafted expert features. With the intuition of convolutional layers with pooling serving as the role of front-end feature distillation and dimensionality reduction, sequential convolutional recurrent neural networks are developed to take complementary advantage of parallel computing capability of convolutional neural networks and temporal sensitivity of recurrent neural networks. Experimental results demonstrate that the proposed architecture delivers overall superior performance in signal to noise ratio range above -10~dB, and achieves significantly improved classification accuracy from 80\% to 92.1\% at high signal to noise ratio range, while drastically reduces the average training and prediction time by approximately 74% and 67%, respectively. Response patterns learned by the proposed architecture are visualized to better understand the physics of the model. Furthermore, a comparative study is performed to investigate the impacts of various sequential convolutional recurrent neural network structure settings on classification performance. A representative sequential convolutional recurrent neural network architecture with the two-layer convolutional neural network and subsequent two-layer long short-term memory neural network is developed to suggest the option for fast automatic modulation classification.

CVMar 21, 2019
Progressive Sparse Local Attention for Video object detection

Chaoxu Guo, Bin Fan, Jie Gu et al.

Transferring image-based object detectors to the domain of videos remains a challenging problem. Previous efforts mostly exploit optical flow to propagate features across frames, aiming to achieve a good trade-off between accuracy and efficiency. However, introducing an extra model to estimate optical flow can significantly increase the overall model size. The gap between optical flow and high-level features can also hinder it from establishing spatial correspondence accurately. Instead of relying on optical flow, this paper proposes a novel module called Progressive Sparse Local Attention (PSLA), which establishes the spatial correspondence between features across frames in a local region with progressively sparser stride and uses the correspondence to propagate features. Based on PSLA, Recursive Feature Updating (RFU) and Dense Feature Transforming (DenseFT) are proposed to model temporal appearance and enrich feature representation respectively in a novel video object detection framework. Experiments on ImageNet VID show that our method achieves the best accuracy compared to existing methods with smaller model size and acceptable runtime speed.

MEJan 31, 2015
A Random Matrix Theoretical Approach to Early Event Detection in Smart Grid

Xing He, Robert Caiming Qiu, Qian Ai et al.

Power systems are developing very fast nowadays, both in size and in complexity; this situation is a challenge for Early Event Detection (EED). This paper proposes a data- driven unsupervised learning method to handle this challenge. Specifically, the random matrix theories (RMTs) are introduced as the statistical foundations for random matrix models (RMMs); based on the RMMs, linear eigenvalue statistics (LESs) are defined via the test functions as the system indicators. By comparing the values of the LES between the experimental and the theoretical ones, the anomaly detection is conducted. Furthermore, we develop 3D power-map to visualize the LES; it provides a robust auxiliary decision-making mechanism to the operators. In this sense, the proposed method conducts EED with a pure statistical procedure, requiring no knowledge of system topologies, unit operation/control models, etc. The LES, as a key ingredient during this procedure, is a high dimensional indictor derived directly from raw data. As an unsupervised learning indicator, the LES is much more sensitive than the low dimensional indictors obtained from supervised learning. With the statistical procedure, the proposed method is universal and fast; moreover, it is robust against traditional EED challenges (such as error accumulations, spurious correlations, and even bad data in core area). Case studies, with both simulated data and real ones, validate the proposed method. To manage large-scale distributed systems, data fusion is mentioned as another data processing ingredient.