CVMay 30
MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory ForecastingYu Liu, Ming Huang, Xiao Ren et al.
Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.
SYJul 5, 2018
Metamorphic Moving Horizon EstimationHe Kong, Salah Sukkarieh
This paper considers a practical scenario where a classical estimation method might have already been implemented on a certain platform when one tries to apply more advanced techniques such as moving horizon estimation (MHE). We are interested to utilize MHE to upgrade, rather than completely discard, the existing estimation technique. This immediately raises the question how one can improve the estimation performance gradually based on the pre-estimator. To this end, we propose a general methodology which incorporates the pre-estimator with a tuning parameter λ between 0 and 1 into the quadratic cost functions that are usually adopted in MHE. We examine the above idea in two standard MHE frameworks that have been proposed in the existing literature. For both frameworks, when λ = 0, the proposed strategy exactly matches the existing classical estimator; when the value of λ is increased, the proposed strategy exhibits a more aggressive normalized forgetting effect towards the old data, thereby increasing the estimation performance gradually.
HCFeb 9, 2024Code
ScreenAgent: A Vision Language Model-driven Computer Control AgentRunliang Niu, Jindong Li, Shiqi Wang et al.
Existing Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled directly by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphics User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing a variety of daily computer tasks. Finally, we trained a model, ScreenAgent, which achieved computer control capabilities comparable to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code is available at \url{https://github.com/niuzaisheng/ScreenAgent}.
ASApr 28
ASAP: An Azimuth-Priority Strip-Based Search Approach to Planar Microphone Array DOA Estimation in 3DMing Huang, Shuting Xu, Leying Yang et al.
Direction-of-arrival (DOA) estimation is an important task in microphone array processing and many downstream applications. The steered response power with phase transform (SRP-PHAT) method has been widely adopted for DOA estimation in recent years. However, accurate SRP-PHAT estimation in 3D scenarios requires evaluating steering responses over thousands of candidate directions, severely limiting real-time performance on resource-constrained platforms. This challenge becomes even more critical for planar arrays, which are widely used in robotics due to their structural simplicity. Motivated by the fact that azimuth estimation is usually more reliable than elevation estimation for most arrays, we propose ASAP, an azimuth-priority strip-based search approach to planar microphone array DOA estimation in 3D. In the first stage, ASAP performs coarse-to-fine region contraction within azimuthal strips to lock azimuth angles while retaining multiple maxima through spherical caps. In the second stage, it refines elevation along the great-circle arc between two close candidates. Extensive simulations and real-world experiments validate the efficiency and merits of the proposed method over existing approaches.
AIAug 10, 2025Code
Pentest-R1: Towards Autonomous Penetration Testing Reasoning Optimized via Two-Stage Reinforcement LearningHe Kong, Die Hu, Jingguo Ge et al.
Automating penetration testing is crucial for enhancing cybersecurity, yet current Large Language Models (LLMs) face significant limitations in this domain, including poor error handling, inefficient reasoning, and an inability to perform complex end-to-end tasks autonomously. To address these challenges, we introduce Pentest-R1, a novel framework designed to optimize LLM reasoning capabilities for this task through a two-stage reinforcement learning pipeline. We first construct a dataset of over 500 real-world, multi-step walkthroughs, which Pentest-R1 leverages for offline reinforcement learning (RL) to instill foundational attack logic. Subsequently, the LLM is fine-tuned via online RL in an interactive Capture The Flag (CTF) environment, where it learns directly from environmental feedback to develop robust error self-correction and adaptive strategies. Our extensive experiments on the Cybench and AutoPenBench benchmarks demonstrate the framework's effectiveness. On AutoPenBench, Pentest-R1 achieves a 24.2\% success rate, surpassing most state-of-the-art models and ranking second only to Gemini 2.5 Flash. On Cybench, it attains a 15.0\% success rate in unguided tasks, establishing a new state-of-the-art for open-source LLMs and matching the performance of top proprietary models. Ablation studies confirm that the synergy of both training stages is critical to its success.
CVFeb 13
GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface ReconstructionXiao Ren, Yu Liu, Ning An et al.
Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page
CVNov 2, 2025
Occlusion-Aware Diffusion Model for Pedestrian Intention PredictionYu Liu, Zhijie Liu, Zedong Yang et al.
Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. To tackle this challenge, we propose an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model's ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature.
ROJun 10, 2025
TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy OptimizationZengjue Chen, Runliang Niu, He Kong et al.
Visual-Language-Action (VLA) models have demonstrated strong cross-scenario generalization capabilities in various robotic tasks through large-scale pre-training and task-specific fine-tuning. However, their training paradigm mainly relies on manually collected successful demonstrations, making it difficult to adapt to complex environments when encountering out-of-distribution (OOD) scenarios or execution biases. While Reinforcement Learning (RL) provides a closed-loop optimization framework via active trial-and-error mechanism, it suffers from sparse rewards, high variance, and unstable optimization in long-horizon robotic tasks. To address these limitations, we propose Trajectory-based Group Relative Policy Optimization (TGRPO), an online RL-based training framework for VLA models. TGRPO leverages task analysis generated by a large language model to automatically construct dense reward functions, providing fine-grained feedback to accelerate convergence and improve credit assignment. The core of our method is a group-based strategy that samples and normalizes multiple trajectories in parallel, reducing variance through relative comparison. By integrating trajectory-level and step-level advantage estimation, TGRPO captures both global and local optimization signals without relying on a value network. Experiments on four task categories of the LIBERO benchmark demonstrate that TGRPO achieves an average success rate of 80.7\%, which is 4.2\% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.
CVJan 10, 2024
Knowledge-aware Graph Transformer for Pedestrian Trajectory PredictionYu Liu, Yuexin Zhang, Kunming Li et al.
Predicting pedestrian motion trajectories is crucial for path planning and motion control of autonomous vehicles. Accurately forecasting crowd trajectories is challenging due to the uncertain nature of human motions in different environments. For training, recent deep learning-based prediction approaches mainly utilize information like trajectory history and interactions between pedestrians, among others. This can limit the prediction performance across various scenarios since the discrepancies between training datasets have not been properly incorporated. To overcome this limitation, this paper proposes a graph transformer structure to improve prediction performance, capturing the differences between the various sites and scenarios contained in the datasets. In particular, a self-attention mechanism and a domain adaption module have been designed to improve the generalization ability of the model. Moreover, an additional metric considering cross-dataset sequences is introduced for training and performance evaluation purposes. The proposed framework is validated and compared against existing methods using popular public datasets, i.e., ETH and UCY. Experimental results demonstrate the improved performance of our proposed scheme.
CVAug 10, 2025
Intention-Aware Diffusion Model for Pedestrian Trajectory PredictionYu Liu, Zhijie Liu, Xiao Ren et al.
Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.
CVAug 6, 2025
Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory PredictionYu Liu, Zhijie Liu, Xiao Ren et al.
Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians' motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.
SEJan 30, 2025
Cogito, ergo sum: A Neurobiologically-Inspired Cognition-Memory-Growth System for Code GenerationYanlong Li, Jindong Li, Qi Wang et al.
Large language models based Multi Agent Systems (MAS) have demonstrated promising performance for enhancing the efficiency and accuracy of code generation tasks. However,most existing methods follow a conventional sequence of planning, coding, and debugging,which contradicts the growth-driven nature of human learning process. Additionally,the frequent information interaction between multiple agents inevitably involves high computational costs. In this paper,we propose Cogito,a neurobiologically inspired multi-agent framework to enhance the problem-solving capabilities in code generation tasks with lower cost. Specifically,Cogito adopts a reverse sequence: it first undergoes debugging, then coding,and finally planning. This approach mimics human learning and development,where knowledge is acquired progressively. Accordingly,a hippocampus-like memory module with different functions is designed to work with the pipeline to provide quick retrieval in similar tasks. Through this growth-based learning model,Cogito accumulates knowledge and cognitive skills at each stage,ultimately forming a Super Role an all capable agent to perform the code generation task. Extensive experiments against representative baselines demonstrate the superior performance and efficiency of Cogito. The code is publicly available at https://anonymous.4open.science/r/Cogito-0083.
ROSep 19, 2021
Active Information Acquisition under Arbitrary Unknown DisturbancesJennifer Wakulicz, He Kong, Salah Sukkarieh
Trajectory optimization of sensing robots to actively gather information of targets has received much attention in the past. It is well-known that under the assumption of linear Gaussian target dynamics and sensor models the stochastic Active Information Acquisition problem is equivalent to a deterministic optimal control problem. However, the above-mentioned assumptions regarding the target dynamic model are limiting. In real-world scenarios, the target may be subject to disturbances whose models or statistical properties are hard or impossible to obtain. Typical scenarios include abrupt maneuvers, jumping disturbances due to interactions with the environment, anomalous misbehaviors due to system faults/attacks, etc. Motivated by the above considerations, in this paper we consider targets whose dynamic models are subject to arbitrary unknown inputs whose models or statistical properties are not assumed to be available. In particular, with the aid of an unknown input decoupled filter, we formulate the sensor trajectory planning problem to track evolution of the target state and analyse the resulting performance for both the state and unknown input evolution tracking. Inspired by concepts of Reduced Value Iteration, a suboptimal solution that expands a search tree via Forward Value Iteration with informativeness-based pruning is proposed. Concrete suboptimality performance guarantees for tracking both the state and the unknown input are established. Numerical simulations of a target tracking example are presented to compare the proposed solution with a greedy policy.
ROMay 23, 2021
Experimental Evaluation of a Hierarchical Operating Framework for Ground Robots in AgricultureStuart Eiffert, Nathan D. Wallace, He Kong et al.
For mobile robots to be effectively applied to real world unstructured environments -- such as large scale farming -- they require the ability to generate adaptive plans that account both for limited onboard resources, and the presence of dynamic changes, including nearby moving individuals. This work provides a real world empirical evaluation of our proposed hierarchical framework for long-term autonomy of field robots, conducted on University of Sydney's Swagbot agricultural robot platform. We demonstrate the ability of the framework to navigate an unstructured and dynamic environment in an effective manner, validating its use for long-term deployment in large scale farming, for tasks such as autonomous weeding in the presence of moving individuals.
ROMay 22, 2021
Resource and Response Aware Path Planning for Long-term Autonomy of Ground Robots in AgricultureStuart Eiffert, Nathan D. Wallace, He Kong et al.
Achieving long-term autonomy for mobile robots operating in real-world unstructured environments such as farms remains a significant challenge. This is made increasingly complex in the presence of moving humans or livestock. These environments require a robot to be adaptive in its immediate plans, accounting for the state of nearby individuals and the response that they might have to the robot's actions. Additionally, in order to achieve longer-term goals, consideration of the limited on-board resources available to the robot is required, especially for extended missions such as weeding an agricultural field. To achieve efficient long-term autonomy, it is thus crucial to understand the impact that online dynamic updates to an energy efficient offline plan might have on resource usage whilst navigating through crowds or herds. To address these challenges, a hierarchical planning framework is proposed, integrating an online local dynamic path planner with an offline longer-term objective-based planner. This framework acts to achieve long-term autonomy through awareness of both dynamic responses of individuals to a robot's motion and the limited resources available. This paper details the hierarchical approach and its integration on a robotic platform, including a comprehensive description of the planning framework and associated perception modules. The approach is evaluated in real-world trials on farms, requiring both consideration of limited battery capacity and the presence of nearby moving individuals. These trials additionally demonstrate the ability of the framework to adapt resource use through variation of the local dynamic planner, allowing adaptive behaviour in changing environments. A summary video is available at https://youtu.be/DGVTrYwJ304.
ROJun 24, 2020
A Hierarchical Framework for Long-term and Robust Deployment of Field Ground Robots in Large-Scale FarmingStuart Eiffert, Nathan D. Wallace, He Kong et al.
Achieving long term autonomy of robots operating in dynamic environments such as farms remains a significant challenge. Arguably, the most demanding factors to achieve this are the on-board resource constraints such as energy, planning in the presence of moving individuals such as livestock and people, and handling unknown and undulating terrain. These considerations require a robot to be adaptive in its immediate actions in order to successfully achieve long-term, resource-efficient and robust autonomy. To achieve this, we propose a hierarchical framework that integrates a local dynamic path planner with a longer term objective based planner and advanced motion control methods, whilst taking into consideration the dynamic responses of moving individuals within the environment. The framework is motivated by and synthesizes our recent work on energy aware mission planning, path planning in dynamic environments, and receding horizon motion control. In this paper we detail the proposed framework and outline its integration on a robotic platform. We evaluate the strategy in extensive simulated trials, traversing between objective waypoints to complete tasks such as soil sampling, weeding and recharging across a dynamic environment, demonstrating its capability to robustly adapt long term mission plans in the presence of moving individuals and obstacles for real world applications such as large scale farming.
ROJan 30, 2020
Path Planning in Dynamic Environments using Generative RNNs and Monte Carlo Tree SearchStuart Eiffert, He Kong, Navid Pirmarzdashti et al.
State of the art methods for robotic path planning in dynamic environments, such as crowds or traffic, rely on hand crafted motion models for agents. These models often do not reflect interactions of agents in real world scenarios. To overcome this limitation, this paper proposes an integrated path planning framework using generative Recurrent Neural Networks within a Monte Carlo Tree Search (MCTS). This approach uses a learnt model of social response to predict crowd dynamics during planning across the action space. This extends our recent work using generative RNNs to learn the relationship between planned robotic actions and the likely response of a crowd. We show that the proposed framework can considerably improve motion prediction accuracy during interactions, allowing more effective path planning. The performance of our method is compared in simulation with existing methods for collision avoidance in a crowd of pedestrians, demonstrating the ability to control future states of nearby individuals. We also conduct preliminary real world tests to validate the effectiveness of our method.
ROOct 10, 2018
Receding horizon estimation and control with structured noise blocking for mobile robot slip compensationNathan Wallace, He Kong, Andrew Hill et al.
The control of field robots in varying and uncertain terrain conditions presents a challenge for autonomous navigation. Online estimation of the wheel-terrain slip characteristics is essential for generating the accurate control predictions necessary for tracking trajectories in off-road environments. Receding horizon estimation (RHE) provides a powerful framework for constrained estimation, and when combined with receding horizon control (RHC), yields an adaptive optimisation-based control method. Presently, such methods assume slip to be constant over the estimation horizon, while our proposed structured blocking approach relaxes this assumption, resulting in improved state and parameter estimation. We demonstrate and compare the performance of this method in simulation, and propose an overlapping-block strategy to ameliorate some of the limitations encountered in applying noise-blocking in a receding horizon estimation and control (RHEC) context.
SYJun 19, 2017
A Divide and Conquer Approach to Cooperative Distributed Model Predictive ControlHe Kong, Stefano Longo, Gabriele Pannocchia et al.
This paper is concerned with the design of cooperative distributed Model Predictive Control (MPC) for linear systems. Motivated by the special structure of the distributed models in some existing literature, we propose to apply a state transformation to the original system and global cost function. This has major implications on the closed-loop stability analysis and the mechanism of the resultant cooperative framework. It turns out that the proposed framework can be implemented without cooperative iterations being performed in the local optimizations, thus allowing one to compute the local inputs in parallel and independently from each other while requiring only partial plant-wide state information. The proposed framework can also be realized with cooperative iterations, thereby keeping the advantages of the technique in the former reference. Under certain conditions, closed-loop stability for both implementation procedures can be guaranteed a priori by appropriate selections of the original local cost functions. The strengths and benefits of the proposed method are highlighted by means of two numerical examples.