CVJul 13, 2024Code
IFTR: An Instance-Level Fusion Transformer for Visual Collaborative PerceptionShaohong Wang, Lu Bin, Xinyu Xiao et al.
Multi-agent collaborative perception has emerged as a widely recognized technology in the field of autonomous driving in recent years. However, current collaborative perception predominantly relies on LiDAR point clouds, with significantly less attention given to methods using camera images. This severely impedes the development of budget-constrained collaborative systems and the exploitation of the advantages offered by the camera modality. This work proposes an instance-level fusion transformer for visual collaborative perception (IFTR), which enhances the detection performance of camera-only collaborative perception systems through the communication and sharing of visual features. To capture the visual information from multiple agents, we design an instance feature aggregation that interacts with the visual features of individual agents using predefined grid-shaped bird eye view (BEV) queries, generating more comprehensive and accurate BEV features. Additionally, we devise a cross-domain query adaptation as a heuristic to fuse 2D priors, implicitly encoding the candidate positions of targets. Furthermore, IFTR optimizes communication efficiency by sending instance-level features, achieving an optimal performance-bandwidth trade-off. We evaluate the proposed IFTR on a real dataset, DAIR-V2X, and two simulated datasets, OPV2V and V2XSet, achieving performance improvements of 57.96%, 9.23% and 12.99% in AP@70 metrics compared to the previous SOTAs, respectively. Extensive experiments demonstrate the superiority of IFTR and the effectiveness of its key components. The code is available at https://github.com/wangsh0111/IFTR.
LGDec 4, 2025
SHAP-Guided Kernel Actor-Critic for Explainable Reinforcement LearningNa Li, Hangguan Shan, Wei Ni et al.
Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose RKHS-SHAP-based Advanced Actor-Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS-SHAP (kernel mean embedding for on-manifold and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. We derive a global, non-asymptotic convergence bound under state perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three continuous-control environments show that RSA2C achieves efficiency, stability, and interpretability.
CVMar 24, 2023
Adaptive Base-class Suppression and Prior Guidance Network for One-Shot Object DetectionWenwen Zhang, Xinyu Xiao, Hangguan Shan et al.
One-shot object detection (OSOD) aims to detect all object instances towards the given category specified by a query image. Most existing studies in OSOD endeavor to explore effective cross-image correlation and alleviate the semantic feature misalignment, however, ignoring the phenomenon of the model bias towards the base classes and the generalization degradation on the novel classes. Observing this, we propose a novel framework, namely Base-class Suppression and Prior Guidance (BSPG) network to overcome the problem. Specifically, the objects of base categories can be explicitly detected by a base-class predictor and adaptively eliminated by our base-class suppression module. Moreover, a prior guidance module is designed to calculate the correlation of high-level features in a non-parametric manner, producing a class-agnostic prior map to provide the target features with rich semantic cues and guide the subsequent detection process. Equipped with the proposed two modules, we endow the model with a strong discriminative ability to distinguish the target objects from distractors belonging to the base classes. Extensive experiments show that our method outperforms the previous techniques by a large margin and achieves new state-of-the-art performance under various evaluation settings.
CVOct 9, 2025Code
RayFusion: Ray Fusion Enhanced Collaborative Visual PerceptionShaohong Wang, Bin Lu, Xinyu Xiao et al.
Collaborative visual perception methods have gained widespread attention in the autonomous driving community in recent years due to their ability to address sensor limitation problems. However, the absence of explicit depth information often makes it difficult for camera-based perception systems, e.g., 3D object detection, to generate accurate predictions. To alleviate the ambiguity in depth estimation, we propose RayFusion, a ray-based fusion method for collaborative visual perception. Using ray occupancy information from collaborators, RayFusion reduces redundancy and false positive predictions along camera rays, enhancing the detection performance of purely camera-based collaborative perception systems. Comprehensive experiments show that our method consistently outperforms existing state-of-the-art models, substantially advancing the performance of collaborative visual perception. The code is available at https://github.com/wangsh0111/RayFusion.
LGDec 16, 2025
Multivariate Time Series Forecasting with Hybrid Euclidean-SPD Manifold Graph Neural NetworksYong Fang, Na Li, Hangguan Shan et al.
Multivariate Time Series (MTS) forecasting plays a vital role in various real-world applications, such as traffic management and predictive maintenance. Existing approaches typically model MTS data in either Euclidean or Riemannian space, limiting their ability to capture the diverse geometric structures and complex spatio-temporal dependencies inherent in real-world data. To overcome this limitation, we propose the Hybrid Symmetric Positive-Definite Manifold Graph Neural Network (HSMGNN), a novel graph neural network-based model that captures data geometry within a hybrid Euclidean-Riemannian framework. To the best of our knowledge, this is the first work to leverage hybrid geometric representations for MTS forecasting, enabling expressive and comprehensive modeling of geometric properties. Specifically, we introduce a Submanifold-Cross-Segment (SCS) embedding to project input MTS into both Euclidean and Riemannian spaces, thereby capturing spatio-temporal variations across distinct geometric domains. To alleviate the high computational cost of Riemannian distance, we further design an Adaptive-Distance-Bank (ADB) layer with a trainable memory mechanism. Finally, a Fusion Graph Convolutional Network (FGCN) is devised to integrate features from the dual spaces via a learnable fusion operator for accurate prediction. Experiments on three benchmark datasets demonstrate that HSMGNN achieves up to a 13.8 percent improvement over state-of-the-art baselines in forecasting accuracy.
LGMar 20, 2024
Federated Learning Resilient to Byzantine Attacks and Data HeterogeneityShiyuan Zuo, Xingrun Yan, Rongfei Fan et al.
This paper addresses federated learning (FL) in the context of malicious Byzantine attacks and data heterogeneity. We introduce a novel Robust Average Gradient Algorithm (RAGA), which uses the geometric median for aggregation and {allows flexible round number for local updates.} Unlike most existing resilient approaches, which base their convergence analysis on strongly-convex loss functions or homogeneously distributed datasets, this work conducts convergence analysis for both strongly-convex and non-convex loss functions over heterogeneous datasets. The theoretical analysis indicates that as long as the fraction of the {data} from malicious users is less than half, RAGA can achieve convergence at a rate of $\mathcal{O}({1}/{T^{2/3- δ}})$ for non-convex loss functions, where $T$ is the iteration number and $δ\in (0, 2/3)$. For strongly-convex loss functions, the convergence rate is linear. Furthermore, the stationary point or global optimal solution is shown to be attainable as data heterogeneity diminishes. Experimental results validate the robustness of RAGA against Byzantine attacks and demonstrate its superior convergence performance compared to baselines under varying intensities of Byzantine attacks on heterogeneous datasets.
LGNov 19, 2018
Three Dimensional Convolutional Neural Network Pruning with Regularization-Based MethodYuxin Zhang, Huan Wang, Yang Luo et al.
Despite enjoying extensive applications in video analysis, three-dimensional convolutional neural networks (3D CNNs)are restricted by their massive computation and storage consumption. To solve this problem, we propose a threedimensional regularization-based neural network pruning method to assign different regularization parameters to different weight groups based on their importance to the network. Further we analyze the redundancy and computation cost for each layer to determine the different pruning ratios. Experiments show that pruning based on our method can lead to 2x theoretical speedup with only 0.41% accuracy loss for 3DResNet18 and 3.28% accuracy loss for C3D. The proposed method performs favorably against other popular methods for model compression and acceleration.