59.7ROMay 25
HeLoM: Hierarchical Learning for Whole-Body Loco-Manipulation by a Hexapod RobotXinrong Yang, Peizhuo Li, Hongyi Li et al.
In nature, animals often need to move/manipulate objects comparable in weight/size to their own bodies. Compared to grasping and carrying, pushing provides a more straightforward and efficient non-prehensile manipulation strategy, avoiding complex grasp design while leveraging direct contact to regulate an object's pose during interaction. Achieving effective pushing, however, requires both sufficient manipulation capability and stable whole-body coordination, which is particularly challenging when dealing with heavy or irregular objects. To address these challenges, we propose HeLoM, a learning-based hierarchical whole-body manipulation framework for hexapod robots that exploits coordinated multi-limb control and is applicable to multi-legged robotic systems. Inspired by the cooperative strategies of multi-legged insects, our framework leverages multiple contact points and high degrees of freedom to enable efficient and dynamic whole-body coordination during object interaction. HeLoM's high-level planner plans pushing behaviors, while its low-level controller maintains locomotion stability and generates dynamically consistent joint actions. This design enables the robot to maintain balance while executing continuous and controllable pushing behaviors through coordinated foreleg interaction and supportive hind-leg propulsion. We validate the effectiveness of HeLoM through both simulation and real-world experiments. Results show that our framework can stably push objects of varying sizes and unknown physical properties to designated goal poses in the real world.
CVSep 2, 2024Code
TempMe: Video Temporal Token Merging for Efficient Text-Video RetrievalLeqi Shen, Tianxiang Hao, Tao He et al.
Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we reduce spatio-temporal redundancy and enhance temporal modeling across different frames, leading to improved efficiency and performance. Extensive experiments validate the superiority of our TempMe. Compared to previous parameter-efficient text-video retrieval methods, TempMe achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. The code is available at https://github.com/LunarShen/TempMe.
55.1LGMar 25Code
CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal ControlYifeng Zhang, Harsh Goel, Peizhuo Li et al.
Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight
LGApr 20, 2023
SocialLight: Distributed Cooperation Learning towards Network-Wide Traffic Signal ControlHarsh Goel, Yifeng Zhang, Mehul Damani et al.
Many recent works have turned to multi-agent reinforcement learning (MARL) for adaptive traffic signal control to optimize the travel time of vehicles over large urban networks. However, achieving effective and scalable cooperation among junctions (agents) remains an open challenge, as existing methods often rely on extensive, non-generalizable reward shaping or on non-scalable centralized learning. To address these problems, we propose a new MARL method for traffic signal control, SocialLight, which learns cooperative traffic control policies by distributedly estimating the individual marginal contribution of agents on their local neighborhood. SocialLight relies on the Asynchronous Actor Critic (A3C) framework, and makes learning scalable by learning a locally-centralized critic conditioned over the states and actions of neighboring agents, used by agents to estimate individual contributions by counterfactual reasoning. We further introduce important modifications to the advantage calculation that help stabilize policy updates. These modifications decouple the impact of the neighbors' actions on the computed advantages, thereby reducing the variance in the gradient updates. We benchmark our trained network against state-of-the-art traffic signal control methods on standard benchmarks in two traffic simulators, SUMO and CityFlow. Our results show that SocialLight exhibits improved scalability to larger road networks and better performance across usual traffic metrics.
56.3ROMar 20
Legged Autonomous Surface Science In Analogue Environments (LASSIE): Making Every Robotic Step Count in Planetary ExplorationCristina G. Wilson, Marion Nachon, Shipeng Liu et al.
The ability to efficiently and effectively explore planetary surfaces is currently limited by the capability of wheeled rovers to traverse challenging terrains, and by pre-programmed data acquisition plans with limited in-situ flexibility. In this paper, we present two novel approaches to address these limitations: (i) high-mobility legged robots that use direct surface interactions to collect rich information about the terrain's mechanics to guide exploration; (ii) human-inspired data acquisition algorithms that enable robots to reason about scientific hypotheses and adapt exploration priorities based on incoming ground-sensing measurements. We successfully verify our approach through lab work and field deployments in two planetary analog environments. The new capability for legged robots to measure soil mechanical properties is shown to enable effective traversal of challenging terrains. When coupled with other geologic properties (e.g., composition, thermal properties, and grain size data etc), soil mechanical measurements reveal key factors governing the formation and development of geologic environments. We then demonstrate how human-inspired algorithms turn terrain-sensing robots into teammates, by supporting more flexible and adaptive data collection decisions with human scientists. Our approach therefore enables exploration of a wider range of planetary environments and new substrate investigation opportunities through integrated human-robot systems that support maximum scientific return.
24.2ROMay 18
REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile RobotsJianghong Dong, Yifeng Zhang, Jiawei Wang et al.
Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.
32.2AIMay 18
Learning to Solve Compositional Geometry Routing ProblemsMingfeng Fan, Jianan Zhou, Jiaqi Cheng et al.
We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.
42.2ROMar 26
COIN: Collaborative Interaction-Aware Multi-Agent Reinforcement Learning for Self-Driving SystemsYifeng Zhang, Jieming Chen, Tingguang Zhou et al.
Multi-Agent Self-Driving (MASD) systems provide an effective solution for coordinating autonomous vehicles to reduce congestion and enhance both safety and operational efficiency in future intelligent transportation systems. Multi-Agent Reinforcement Learning (MARL) has emerged as a promising approach for developing advanced end-to-end MASD systems. However, achieving efficient and safe collaboration in dynamic MASD systems remains a significant challenge in dense scenarios with complex agent interactions. To address this challenge, we propose a novel collaborative(CO-) interaction-aware(-IN) MARL framework, named COIN. Specifically, we develop a new counterfactual individual-global twin delayed deep deterministic policy gradient (CIG-TD3) algorithm, crafted in a "centralized training, decentralized execution" (CTDE) manner, which aims to jointly optimize the individual objectives (navigation) and the global objectives (collaboration) of agents. We further introduce a dual-level interaction-aware centralized critic architecture that captures both local pairwise interactions and global system-level dependencies, enabling more accurate global value estimation and improved credit assignment for collaborative policy learning. We conduct extensive simulation experiments in dense urban traffic environments, which demonstrate that COIN consistently outperforms other advanced baseline methods in both safety and efficiency across various system sizes. These results highlight its superiority in complex and dynamic MASD scenarios, as further validated through real-world robot demonstrations. Supplementary videos are available at https://marmotlab.github.io/COIN/
CVMar 14, 2025Code
FastVID: Dynamic Density Pruning for Fast Video Large Language ModelsLeqi Shen, Guoqiang Gong, Tao He et al.
Video Large Language Models have demonstrated strong video understanding capabilities, yet their practical deployment is hindered by substantial inference costs caused by redundant video tokens. Existing pruning techniques fail to fully exploit the spatiotemporal redundancy inherent in video data. To bridge this gap, we perform a systematic analysis of video redundancy from two perspectives: temporal context and visual context. Leveraging these insights, we propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. Specifically, FastVID dynamically partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity. Extensive evaluations show that FastVID achieves state-of-the-art performance across various short- and long-video benchmarks on leading Video LLMs, including LLaVA-OneVision and LLaVA-Video. Notably, on LLaVA-OneVision-7B, FastVID effectively prunes $\textbf{90.3%}$ of video tokens, reduces FLOPs to $\textbf{8.3%}$, and accelerates the prefilling stage by $\textbf{7.1}\times$, while maintaining $\textbf{98.0%}$ of the original accuracy. The code is available at https://github.com/LunarShen/FastVID.
53.7ROMar 25
LATS: Large Language Model Assisted Teacher-Student Framework for Multi-Agent Reinforcement Learning in Traffic Signal ControlYifeng Zhang, Peizhuo Li, Tingguang Zhou et al.
Adaptive Traffic Signal Control (ATSC) aims to optimize traffic flow and minimize delays by adjusting traffic lights in real time. Recent advances in Multi-agent Reinforcement Learning (MARL) have shown promise for ATSC, yet existing approaches still suffer from limited representational capacity, often leading to suboptimal performance and poor generalization in complex and dynamic traffic environments. On the other hand, Large Language Models (LLMs) excel at semantic representation, reasoning, and analysis, yet their propensity for hallucination and slow inference speeds often hinder their direct application to decision-making tasks. To address these challenges, we propose a novel learning paradigm named LATS that integrates LLMs and MARL, leveraging the former's strong prior knowledge and inductive abilities to enhance the latter's decision-making process. Specifically, we introduce a plug-and-play teacher-student learning module, where a trained embedding LLM serves as a teacher to generate rich semantic features that capture each intersection's topology structures and traffic dynamics. A much simpler (student) neural network then learns to emulate these features through knowledge distillation in the latent space, enabling the final model to operate independently from the LLM for downstream use in the RL decision-making process. This integration significantly enhances the overall model's representational capacity across diverse traffic scenarios, thus leading to more efficient and generalizable control strategies. Extensive experiments across diverse traffic datasets empirically demonstrate that our method enhances the representation learning capability of RL models, thereby leading to improved overall performance and generalization over both traditional RL and LLM-only approaches. [...]
20.1ROMar 26
CROSS: A Mixture-of-Experts Reinforcement Learning Framework for Generalizable Large-Scale Traffic Signal ControlXibei Chen, Yifeng Zhang, Yuxiang Xiao et al.
Recent advances in robotics, automation, and artificial intelligence have enabled urban traffic systems to operate with increasing autonomy towards future smart cities, powered in part by the development of adaptive traffic signal control (ATSC), which dynamically optimizes signal phases to mitigate congestion and optimize traffic. However, achieving effective and generalizable large-scale ATSC remains a significant challenge due to the diverse intersection topologies and highly dynamic, complex traffic demand patterns across the network. Existing RL-based methods typically use a single shared policy for all scenarios, whose limited representational capacity makes it difficult to capture diverse traffic dynamics and generalize to unseen environments. To address these challenges, we propose CROSS, a novel Mixture-of-Experts (MoE)-based decentralized RL framework for generalizable ATSC. We first introduce a Predictive Contrastive Clustering (PCC) module that forecasts short-term state transitions to identify latent traffic patterns, followed by clustering and contrastive learning to enhance pattern-level representation. We further design a Scenario-Adaptive MoE module that augments a shared policy with multiple experts, thus enabling adaptive specialization and more flexible scenario-specific strategies. We conduct extensive experiments in the SUMO simulator on both synthetic and real-world traffic datasets. Compared with state-of-the-art baselines, CROSS achieves superior performance and generalization through improved representation of diverse traffic scenarios.
41.1ROMar 19
CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman ProblemFengxiaoxiao Li, Xiao Mao, Mingfeng Fan et al.
Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.
CVJun 10, 2025Code
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text RetrievalLeqi Shen, Guoqiang Gong, Tianxiang Hao et al.
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.
CVMar 14, 2025Code
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMsLeqi Shen, Tao He, Guoqiang Gong et al.
Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.
29.8ROApr 2
From Impact to Insight: Dynamics-Aware Proprioceptive Terrain Sensing on Granular MediaYifeng Zhang, Yue Wu, Jake Futterman et al.
Robots that traverse natural terrain must interpret contact forces generated under highly dynamic conditions. However, most terrain characterization approaches rely on quasi-static assumptions that neglect velocity- and acceleration-dependent effects arising during impact and rapid stance transitions. In this work, we investigate granular terrain interaction during high-speed hopping and develop a physics-based framework for dynamic terrain characterization using proprioceptive sensing alone. Through controlled hopping experiments with systematically varied impact speed and leg compliance, our measurements reveal that quasi-static based assumptions lead to large discrepancies in granular terrain property estimation during high-speed hopping, particularly upon touchdown and controller-induced stiffness transitions. Velocity-dependent drag alone cannot explain these discrepancies. Instead, acceleration-dependent added-mass effects-associated with grain entrainment beneath the foot-dominate transient force responses. We integrate this force decomposition with a momentum-observer-based estimator that compensates for rigid-body inertia and gravity, and introduce an acceleration-aware weighted regression to account for increased force variance during high-acceleration events. Together, these methods enable consistent recovery of granular stiffness parameters across locomotion conditions, closely matching linear-actuator ground truth. Our results demonstrate that accurate terrain inference during high-speed locomotion requires explicit treatment of acceleration-dependent granular effects, and provide a foundation for robots to characterize complex deformable terrain during dynamic exploration of terrestrial and planetary environments.
CVMay 11, 2024
Solving Energy-Independent Density for CT Metal Artifact Reduction via Neural RepresentationQing Wu, Xu Guo, Lixuan Chen et al.
X-ray CT often suffers from shadowing and streaking artifacts in the presence of metallic materials, which severely degrade imaging quality. Physically, the linear attenuation coefficients (LACs) of metals vary significantly with X-ray energy, causing a nonlinear beam hardening effect (BHE) in CT measurements. Reconstructing CT images from metal-corrupted measurements consequently becomes a challenging nonlinear inverse problem. Existing state-of-the-art (SOTA) metal artifact reduction (MAR) algorithms rely on supervised learning with numerous paired CT samples. While promising, these supervised methods often assume that the unknown LACs are energy-independent, ignoring the energy-induced BHE, which results in limited generalization. Moreover, the requirement for large datasets also limits their applications in real-world scenarios. In this work, we propose Density neural representation (Diner), a novel unsupervised MAR method. Our key innovation lies in formulating MAR as an energy-independent density reconstruction problem that strictly adheres to the photon-tissue absorption physical model. This model is inherently nonlinear and complex, making it a rarely considered approach in inverse imaging problems. By introducing the water-equivalent tissues approximation and a new polychromatic model to characterize the nonlinear CT acquisition process, we directly learn the neural representation of the density map from raw measurements without using external training data. This energy-independent density reconstruction framework fundamentally resolves the nonlinear BHE, enabling superior MAR performance across a wide range of scanning scenarios. Extensive experiments on both simulated and real-world datasets demonstrate the superiority of our unsupervised Diner over popular supervised methods in terms of MAR performance and robustness.
ROMar 8
Inverse Resistive Force Theory (I-RFT): Learning granular properties through robot-terrain physical interactionsShipeng Liu, Feng Xue, Yifeng Zhang et al.
For robots to navigate safely and efficiently on soft, granular terrains, it is crucial to gather information about the terrain's mechanical properties, which directly affect locomotion performance. Recent research has developed robotic legs that can accurately sense ground reaction forces during locomotion. However, existing tests of granular property estimation often rely on specific foot trajectories, such as vertical penetration or horizontal shear, limiting their applicability during natural locomotion. To address this limitation, we introduce a physics-informed machine learning framework, Inverse Resistive Force Theory (I-RFT), which integrates the Granular Resistive Force Theory model with Gaussian Processes to infer terrain properties from proprioceptively measured contact forces under arbitrary gait trajectories. By embedding the granular force model within the learning process, I-RFT preserves physical consistency while enabling generalization across diverse motion primitives. Experimental results demonstrate that I-RFT accurately estimates terrain properties across multiple gait trajectories and toe shapes. Moreover, we show that the quantified uncertainty over the terrain resistance stress map could enable robots to optimize foot design and gait trajectories for efficient information gathering. This approach establishes a new foundation for data-efficient characterization of complex granular environments and opens new avenues for locomotion strategies that actively adapt gait for autonomous terrain exploration.
CVNov 25, 2025
DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven IlluminationMingyang Ou, Haojin Li, Yifeng Zhang et al.
Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inherent in endoscopic images, particularly in low-intensity regions. Existing low-light enhancement techniques fail to effectively guide the depth network. Furthermore, solutions from other fields, like autonomous driving, require well-lit images, making them unsuitable and increasing data collection burdens. To this end, we present DeLight-Mono - a novel self-supervised monocular depth estimation framework with illumination decoupling. Specifically, endoscopic images are represented by a designed illumination-reflectance-depth model, and are decomposed with auxiliary networks. Moreover, a self-supervised joint-optimizing framework with novel losses leveraging the decoupled components is proposed to mitigate the effects of uneven illumination on depth estimation. The effectiveness of the proposed methods was rigorously verified through extensive comparisons and an ablation study performed on two public datasets.
LGOct 3, 2025
A Unified Deep Reinforcement Learning Approach for Close Enough Traveling Salesman ProblemMingfeng Fan, Jiaqi Cheng, Yaoxin Wu et al.
In recent years, deep reinforcement learning (DRL) has gained traction for solving the NP-hard traveling salesman problem (TSP). However, limited attention has been given to the close-enough TSP (CETSP), primarily due to the challenge introduced by its neighborhood-based visitation criterion, wherein a node is considered visited if the agent enters a compact neighborhood around it. In this work, we formulate a Markov decision process (MDP) for CETSP using a discretization scheme and propose a novel unified dual-decoder DRL (UD3RL) framework that separates decision-making into node selection and waypoint determination. Specifically, an adapted encoder is employed for effective feature extraction, followed by a node-decoder and a loc-decoder to handle the two sub-tasks, respectively. A k-nearest neighbors subgraph interaction strategy is further introduced to enhance spatial reasoning during location decoding. Furthermore, we customize the REINFORCE algorithm to train UD3RL as a unified model capable of generalizing across different problem sizes and varying neighborhood radius types (i.e., constant and random radii). Experimental results show that UD3RL outperforms conventional methods in both solution quality and runtime, while exhibiting strong generalization across problem scales, spatial distributions, and radius ranges, as well as robustness to dynamic environments.
LGJul 9, 2025
The Primacy of Magnitude in Low-Rank AdaptationZicheng Zhang, Haoran Li, Yifeng Zhang et al.
Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive "Noise & Zeros" scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven "Basis & Basis" initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.
CVJun 22, 2025
Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal PredictionRui An, Yifeng Zhang, Ziran Liang et al.
Training urban spatio-temporal foundation models that generalize well across diverse regions and cities is critical for deploying urban services in unseen or data-scarce regions. Recent studies have typically focused on fusing cross-domain spatio-temporal data to train unified Transformer-based models. However, these models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment. Inspired by the efficiency of Mamba, a state space model with linear time complexity, we explore its potential for efficient urban spatio-temporal prediction. However, directly applying Mamba as a spatio-temporal backbone leads to negative transfer and severe performance degradation. This is primarily due to spatio-temporal heterogeneity and the recursive mechanism of Mamba's hidden state updates, which limit cross-domain generalization. To overcome these challenges, we propose Damba-ST, a novel domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction. Damba-ST retains Mamba's linear complexity advantage while significantly enhancing its adaptability to heterogeneous domains. Specifically, we introduce two core innovations: (1) a domain-adaptive state space model that partitions the latent representation space into a shared subspace for learning cross-domain commonalities and independent, domain-specific subspaces for capturing intra-domain discriminative features; (2) three distinct Domain Adapters, which serve as domain-aware proxies to bridge disparate domain distributions and facilitate the alignment of cross-domain commonalities. Extensive experiments demonstrate the generalization and efficiency of Damba-ST. It achieves state-of-the-art performance on prediction tasks and demonstrates strong zero-shot generalization, enabling seamless deployment in new urban environments without extensive retraining or fine-tuning.
AIJun 10, 2025
Preference-Driven Multi-Objective Combinatorial Optimization with Conditional ComputationMingfeng Fan, Jianan Zhou, Yifeng Zhang et al.
Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization.
LGMar 14, 2025
Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal ControlYifeng Zhang, Yilin Liu, Ping Gong et al.
Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter-sharing multi-agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large-scale homogeneous networks. However, the inherent heterogeneity of real-world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network-wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder-only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self-supervised manner, which enables better differentiation of intersection-specific features. Moreover, we integrate the state-action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. Our results show that Unicorn outperforms other methods across various evaluation metrics, highlighting its potential in complex, dynamic traffic networks.
LGJan 6, 2025
Detecting Defective Wafers Via Modular NetworksYifeng Zhang, Bryan Baker, Shi Chen et al.
The growing availability of sensors within semiconductor manufacturing processes makes it feasible to detect defective wafers with data-driven models. Without directly measuring the quality of semiconductor devices, they capture the modalities between diverse sensor readings and can be used to predict key quality indicators (KQI, \textit{e.g.}, roughness, resistance) to detect faulty products, significantly reducing the capital and human cost in maintaining physical metrology steps. Nevertheless, existing models pay little attention to the correlations among different processes for diverse wafer products and commonly struggle with generalizability issues. To enable generic fault detection, in this work, we propose a modular network (MN) trained using time series stage-wise datasets that embodies the structure of the manufacturing process. It decomposes KQI prediction as a combination of stage modules to simulate compositional semiconductor manufacturing, universally enhancing faulty wafer detection among different wafer types and manufacturing processes. Extensive experiments demonstrate the usefulness of our approach, and shed light on how the compositional design provides an interpretable interface for more practical applications.
CVAug 9, 2021
Two-stream Convolutional Networks for Multi-frame Face Anti-spoofingZhuoyi Zhang, Cheng Jiang, Xiya Zhong et al.
Face anti-spoofing is an important task to protect the security of face recognition. Most of previous work either struggle to capture discriminative and generalizable feature or rely on auxiliary information which is unavailable for most of industrial product. Inspired by the video classification work, we propose an efficient two-stream model to capture the key differences between live and spoof faces, which takes multi-frames and RGB difference as input respectively. Feature pyramid modules with two opposite fusion directions and pyramid pooling modules are applied to enhance feature representation. We evaluate the proposed method on the datasets of Siw, Oulu-NPU, CASIA-MFSD and Replay-Attack. The results show that our model achieves the state-of-the-art results on most of datasets' protocol with much less parameter size.
CVJul 27, 2020
Saliency Prediction with External KnowledgeYifeng Zhang, Ming Jiang, Qi Zhao
The last decades have seen great progress in saliency prediction, with the success of deep neural networks that are able to encode high-level semantics. Yet, while humans have the innate capability in leveraging their knowledge to decide where to look (e.g. people pay more attention to familiar faces such as celebrities), saliency prediction models have only been trained with large eye-tracking datasets. This work proposes to bridge this gap by explicitly incorporating external knowledge for saliency models as humans do. We develop networks that learn to highlight regions by incorporating prior knowledge of semantic relationships, be it general or domain-specific, depending on the task of interest. At the core of the method is a new Graph Semantic Saliency Network (GraSSNet) that constructs a graph that encodes semantic relationships learned from external knowledge. A Spatial Graph Attention Network is then developed to update saliency features based on the learned graph. Experiments show that the proposed model learns to predict saliency from the external knowledge and outperforms the state-of-the-art on four saliency benchmarks.
NEJan 18, 2019
DA-LSTM: A Long Short-Term Memory with Depth Adaptive to Non-uniform Information Flow in Sequential DataYifeng Zhang, Ka-Ho Chow, S. -H. Gary Chan
Much sequential data exhibits highly non-uniform information distribution. This cannot be correctly modeled by traditional Long Short-Term Memory (LSTM). To address that, recent works have extended LSTM by adding more activations between adjacent inputs. However, the approaches often use a fixed depth, which is at the step of the most information content. This one-size-fits-all worst-case approach is not satisfactory, because when little information is distributed to some steps, shallow structures can achieve faster convergence and consume less computation resource. In this paper, we develop a Depth-Adaptive Long Short-Term Memory (DA-LSTM) architecture, which can dynamically adjust the structure depending on information distribution without prior knowledge. Experimental results on real-world datasets show that DA-LSTM costs much less computation resource and substantially reduce convergence time by $41.78\%$ and $46.01 \%$, compared with Stacked LSTM and Deep Transition LSTM, respectively.
CVNov 28, 2018
Multi-granularity Generator for Temporal Action ProposalYuan Liu, Lin Ma, Yifeng Zhang et al.
Temporal action proposal generation is an important task, aiming to localize the video segments containing human actions in an untrimmed video. In this paper, we propose a multi-granularity generator (MGG) to perform the temporal action proposal from different granularity perspectives, relying on the video visual features equipped with the position embedding information. First, we propose to use a bilinear matching model to exploit the rich local information within the video sequence. Afterwards, two components, namely segment proposal producer (SPP) and frame actionness producer (FAP), are combined to perform the task of temporal action proposal at two distinct granularities. SPP considers the whole video in the form of feature pyramid and generates segment proposals from one coarse perspective, while FAP carries out a finer actionness evaluation for each video frame. Our proposed MGG can be trained in an end-to-end fashion. By temporally adjusting the segment proposals with fine-grained frame actionness information, MGG achieves the superior performance over state-of-the-art methods on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, we employ existing action classifiers to perform the classification of the proposals generated by MGG, leading to significant improvements compared against the competing methods for the video detection task.
LGNov 20, 2018
Representation Learning of Pedestrian Trajectories Using Actor-Critic Sequence-to-Sequence AutoencoderKa-Ho Chow, Anish Hiranandani, Yifeng Zhang et al.
Representation learning of pedestrian trajectories transforms variable-length timestamp-coordinate tuples of a trajectory into a fixed-length vector representation that summarizes spatiotemporal characteristics. It is a crucial technique to connect feature-based data mining with trajectory data. Trajectory representation is a challenging problem, because both environmental constraints (e.g., wall partitions) and temporal user dynamics should be meticulously considered and accounted for. Furthermore, traditional sequence-to-sequence autoencoders using maximum log-likelihood often require dataset covering all the possible spatiotemporal characteristics to perform well. This is infeasible or impractical in reality. We propose TREP, a practical pedestrian trajectory representation learning algorithm which captures the environmental constraints and the pedestrian dynamics without the need of any training dataset. By formulating a sequence-to-sequence autoencoder with a spatial-aware objective function under the paradigm of actor-critic reinforcement learning, TREP intelligently encodes spatiotemporal characteristics of trajectories with the capability of handling diverse trajectory patterns. Extensive experiments on both synthetic and real datasets validate the high fidelity of TREP to represent trajectories.
CVJul 18, 2018
An Attention-Based Approach for Single Image Super ResolutionYuan Liu, Yuancheng Wang, Nan Li et al.
The main challenge of single image super resolution (SISR) is the recovery of high frequency details such as tiny textures. However, most of the state-of-the-art methods lack specific modules to identify high frequency areas, causing the output image to be blurred. We propose an attention-based approach to give a discrimination between texture areas and smooth areas. After the positions of high frequency details are located, high frequency compensation is carried out. This approach can incorporate with previously proposed SISR networks. By providing high frequency enhancement, better performance and visual effect are achieved. We also propose our own SISR network composed of DenseRes blocks. The block provides an effective way to combine the low level features and high level features. Extensive benchmark evaluation shows that our proposed method achieves significant improvement over the state-of-the-art works in SISR.