LGJun 1Code
VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series ForecastingXudong Zhang, Jierui Lei, Jiacheng Li et al.
Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08\% and 7.74\% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at https://github.com/leijieruilq/VLBM_OOD_forecast.
MMMay 29
Dynamic Interaction-Aware and Causality-Disentangled Framework for Multimodal Sentiment AnalysisGuangyuan Dong, Ziwei Hong, Shenghao Liu et al.
Although Multimodal Sentiment Analysis (MSA) effectively leverages rich information from language, visual, and acoustic modalities, existing methods still face two core challenges: 1) static conflict suppression mechanisms fail to adapt to dynamic variations across samples, and 2) the inherent sentimental bias within the language modality, which can misguide learning from other modalities, remains entangled. To this end, we propose a Dynamic Multimodal Causal Disentanglement and Adaptive Fusion Framework (MCAF). Its cornerstone is the Multi-Granularity Causal Dynamic Router and a Conditional Diffusion Denoising Module. First, we introduce a causal intervention module based on the information bottleneck principle, which builds a Structural Causal Model to disentangle sentimental bias from language features, yielding a "de-confounded" language representation as a pure guiding signal. Second, we devise a Dynamic Multimodal Router that evaluates the interaction states (complementary, conflicting, or redundant) among visual, acoustic, and de-confounded language signals in real-time across three levels: feature, temporal, and modality, then adaptively allocates weights and routes information flow for fine-grained regulation. Finally, a lightweight Conditional Diffusion Denoising Module performs iterative denoising on the fused joint representation to explicitly filter out residual irrelevant information, generating a robust hyper-modality representation. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks show that MCAF sets new state-of-the-art on key classification metrics, achieving an Acc-2/F1 of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI, while remaining highly competitive on others. Comprehensive analyses and visualizations further validate its efficacy in dynamically perceiving interactions, disentangling bias, and enhancing interpretability.
CLMay 29Code
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation PlanningXudong Zhang, Jian Yang, Shengkai Wang et al.
Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias.
ITNov 5, 2022Code
Quantization Adaptor for Bit-Level Deep Learning-Based Massive MIMO CSI FeedbackXudong Zhang, Zhilin Lu, Rui Zeng et al.
In massive multiple-input multiple-output (MIMO) systems, the user equipment (UE) needs to feed the channel state information (CSI) back to the base station (BS) for the following beamforming. But the large scale of antennas in massive MIMO systems causes huge feedback overhead. Deep learning (DL) based methods can compress the CSI at the UE and recover it at the BS, which reduces the feedback cost significantly. But the compressed CSI must be quantized into bit streams for transmission. In this paper, we propose an adaptor-assisted quantization strategy for bit-level DL-based CSI feedback. First, we design a network-aided adaptor and an advanced training scheme to adaptively improve the quantization and reconstruction accuracy. Moreover, for easy practical employment, we introduce the expert knowledge of data distribution and propose a pluggable and cost-free adaptor scheme. Experiments show that compared with the state-of-the-art feedback quantization method, this adaptor-aided quantization strategy can achieve better quantization accuracy and reconstruction performance with less or no additional cost. The open-source codes are available at https://github.com/zhang-xd18/QCRNet.
CVOct 18, 2022Code
A Real-Time Fusion Framework for Long-term Visual LocalizationYuchen Yang, Xudong Zhang, Shuang Gao et al.
Visual localization is a fundamental task that regresses the 6 Degree Of Freedom (6DoF) poses with image features in order to serve the high precision localization requests in many robotics applications. Degenerate conditions like motion blur, illumination changes and environment variations place great challenges in this task. Fusion with additional information, such as sequential information and Inertial Measurement Unit (IMU) inputs, would greatly assist such problems. In this paper, we present an efficient client-server visual localization architecture that fuses global and local pose estimations to realize promising precision and efficiency. We include additional geometry hints in mapping and global pose regressing modules to improve the measurement quality. A loosely coupled fusion policy is adopted to leverage the computation complexity and accuracy. We conduct the evaluations on two typical open-source benchmarks, 4Seasons and OpenLORIS. Quantitative results prove that our framework has competitive performance with respect to other state-of-the-art visual localization solutions.
LGApr 15Code
Scalable unsupervised feature selection via weight stabilityXudong Zhang, Renato Cordeiro de Amorim
Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted k-means++, a novel initialisation strategy for the Minkowski Weighted k-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical analysis, demonstrating that, under explicit assumptions on noise features and cluster structure, relevant features are assigned consistently higher weights than noise features across a range of Minkowski exponents. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.
AIMay 27
Global Policy-Space Response Oracles for Two-Player Zero-Sum GamesJunyu Zhang, Feihong Yang, Jian Wang et al.
The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.
ITOct 29, 2022
Better Lightweight Network for Free: Codeword Mimic Learning for Massive MIMO CSI feedbackZhilin Lu, Xudong Zhang, Rui Zeng et al.
The channel state information (CSI) needs to be fed back from the user equipment (UE) to the base station (BS) in frequency division duplexing (FDD) multiple-input multiple-output (MIMO) system. Recently, neural networks are widely applied to CSI compressed feedback since the original overhead is too large for the massive MIMO system. Notably, lightweight feedback networks attract special attention due to their practicality of deployment. However, the feedback accuracy is likely to be harmed by the network compression. In this paper, a cost free distillation technique named codeword mimic (CM) is proposed to train better feedback networks with the practical lightweight encoder. A mimic-explore training strategy with a special distillation scheduler is designed to enhance the CM learning. Experiments show that the proposed CM learning outperforms the previous state-of-the-art feedback distillation method, boosting the performance of the lightweight feedback network without any extra inference cost.
CRMar 4, 2022
Targeted Data Poisoning Attack on News Recommendation System by Content PerturbationXudong Zhang, Zan Wang, Jingke Zhao et al.
News Recommendation System(NRS) has become a fundamental technology to many online news services. Meanwhile, several studies show that recommendation systems(RS) are vulnerable to data poisoning attacks, and the attackers have the ability to mislead the system to perform as their desires. A widely studied attack approach, injecting fake users, can be applied on the NRS when the NRS is treated the same as the other systems whose items are fixed. However, in the NRS, as each item (i.e. news) is more informative, we propose a novel approach to poison the NRS, which is to perturb contents of some browsed news that results in the manipulation of the rank of the target news. Intuitively, an attack is useless if it is highly likely to be caught, i.e., exposed. To address this, we introduce a notion of the exposure risk and propose a novel problem of attacking a history news dataset by means of perturbations where the goal is to maximize the manipulation of the target news rank while keeping the risk of exposure under a given budget. We design a reinforcement learning framework, called TDP-CP, which contains a two-stage hierarchical model to reduce the searching space. Meanwhile, influence estimation is also applied to save the time on retraining the NRS for rewards. We test the performance of TDP-CP under three NRSs and on different target news. Our experiments show that TDP-CP can increase the rank of the target news successfully with a limited exposure budget.
IRFeb 8, 2023
SimCGNN: Simple Contrastive Graph Neural Network for Session-based RecommendationYuan Cao, Xudong Zhang, Fan Zhang et al.
Session-based recommendation (SBR) problem, which focuses on next-item prediction for anonymous users, has received increasingly more attention from researchers. Existing graph-based SBR methods all lack the ability to differentiate between sessions with the same last item, and suffer from severe popularity bias. Inspired by nowadays emerging contrastive learning methods, this paper presents a Simple Contrastive Graph Neural Network for Session-based Recommendation (SimCGNN). In SimCGNN, we first obtain normalized session embeddings on constructed session graphs. We next construct positive and negative samples of the sessions by two forward propagation and a novel negative sample selection strategy, and then calculate the constructive loss. Finally, session embeddings are used to give prediction. Extensive experiments conducted on two real-word datasets show our SimCGNN achieves a significant improvement over state-of-the-art methods.
CVOct 27, 2022
MMFL-Net: Multi-scale and Multi-granularity Feature Learning for Cross-domain Fashion RetrievalChen Bao, Xudong Zhang, Jiazhou Chen et al.
Instance-level image retrieval in fashion is a challenging issue owing to its increasing importance in real-scenario visual fashion search. Cross-domain fashion retrieval aims to match the unconstrained customer images as queries for photographs provided by retailers; however, it is a difficult task due to a wide range of consumer-to-shop (C2S) domain discrepancies and also considering that clothing image is vulnerable to various non-rigid deformations. To this end, we propose a novel multi-scale and multi-granularity feature learning network (MMFL-Net), which can jointly learn global-local aggregation feature representations of clothing images in a unified framework, aiming to train a cross-domain model for C2S fashion visual similarity. First, a new semantic-spatial feature fusion part is designed to bridge the semantic-spatial gap by applying top-down and bottom-up bidirectional multi-scale feature fusion. Next, a multi-branch deep network architecture is introduced to capture global salient, part-informed, and local detailed information, and extracting robust and discrimination feature embedding by integrating the similarity learning of coarse-to-fine embedding with the multiple granularities. Finally, the improved trihard loss, center loss, and multi-task classification loss are adopted for our MMFL-Net, which can jointly optimize intra-class and inter-class distance and thus explicitly improve intra-class compactness and inter-class discriminability between its visual representations for feature learning. Furthermore, our proposed model also combines the multi-task attribute recognition and classification module with multi-label semantic attributes and product ID labels. Experimental results demonstrate that our proposed MMFL-Net achieves significant improvement over the state-of-the-art methods on the two datasets, DeepFashion-C2S and Street2Shop.
LGSep 17, 2022
Sample-Efficient Multi-Agent Reinforcement Learning with Demonstrations for Flocking ControlYunbo Qiu, Yuzhu Zhan, Yue Jin et al.
Flocking control is a significant problem in multi-agent systems such as multi-agent unmanned aerial vehicles and multi-agent autonomous underwater vehicles, which enhances the cooperativity and safety of agents. In contrast to traditional methods, multi-agent reinforcement learning (MARL) solves the problem of flocking control more flexibly. However, methods based on MARL suffer from sample inefficiency, since they require a huge number of experiences to be collected from interactions between agents and the environment. We propose a novel method Pretraining with Demonstrations for MARL (PwD-MARL), which can utilize non-expert demonstrations collected in advance with traditional methods to pretrain agents. During the process of pretraining, agents learn policies from demonstrations by MARL and behavior cloning simultaneously, and are prevented from overfitting demonstrations. By pretraining with non-expert demonstrations, PwD-MARL improves sample efficiency in the process of online MARL with a warm start. Experiments show that PwD-MARL improves sample efficiency and policy performance in the problem of flocking control, even with bad or few demonstrations.
LGNov 14, 2022
Renewing Iterative Self-labeling Domain Adaptation with Application to the Spine Motion PredictionGecheng Chen, Yu Zhou, Xudong Zhang et al.
The area of transfer learning comprises supervised machine learning methods that cope with the issue when the training and testing data have different input feature spaces or distributions. In this work, we propose a novel transfer learning algorithm called Renewing Iterative Self-labeling Domain Adaptation (Re-ISDA). In this work, we propose a novel transfer learning algorithm called Renewing Iterative Self-labeling Domain Adaptation (Re-ISDA).
CVApr 12, 2023
SGL: Structure Guidance Learning for Camera LocalizationXudong Zhang, Shuang Gao, Xiaohu Nan et al.
Camera localization is a classical computer vision task that serves various Artificial Intelligence and Robotics applications. With the rapid developments of Deep Neural Networks (DNNs), end-to-end visual localization methods are prosperous in recent years. In this work, we focus on the scene coordinate prediction ones and propose a network architecture named as Structure Guidance Learning (SGL) which utilizes the receptive branch and the structure branch to extract both high-level and low-level features to estimate the 3D coordinates. We design a confidence strategy to refine and filter the predicted 3D observations, which enables us to estimate the camera poses by employing the Perspective-n-Point (PnP) with RANSAC. In the training part, we design the Bundle Adjustment trainer to help the network fit the scenes better. Comparisons with some state-of-the-art (SOTA) methods and sufficient ablation experiments confirm the validity of our proposed architecture.
MAFeb 10, 2023
Low Entropy Communication in Multi-Agent Reinforcement LearningLebin Yu, Yunbo Qiu, Qiexiang Wang et al.
Communication in multi-agent reinforcement learning has been drawing attention recently for its significant role in cooperation. However, multi-agent systems may suffer from limitations on communication resources and thus need efficient communication techniques in real-world scenarios. According to the Shannon-Hartley theorem, messages to be transmitted reliably in worse channels require lower entropy. Therefore, we aim to reduce message entropy in multi-agent communication. A fundamental challenge is that the gradients of entropy are either 0 or infinity, disabling gradient-based methods. To handle it, we propose a pseudo gradient descent scheme, which reduces entropy by adjusting the distributions of messages wisely. We conduct experiments on two base communication frameworks with six environment settings and find that our scheme can reduce message entropy by up to 90% with nearly no loss of cooperation performance.
ITFeb 5, 2023
Towards Efficient Subarray Hybrid Beamforming: Attention Network-based Practical Feedback in FDD Massive MU-MIMO SystemsZhilin Lu, Xudong Zhang, Rui Zeng et al.
Channel state information (CSI) feedback is necessary for the frequency division duplexing (FDD) multiple input multiple output (MIMO) systems due to the channel non-reciprocity. With the help of deep learning, many works have succeeded in rebuilding the compressed ideal CSI for massive MIMO. However, simple CSI reconstruction is of limited practicality since the channel estimation and the targeted beamforming design are not considered. In this paper, a jointly optimized network is introduced for channel estimation and feedback so that a spectral-efficient beamformer can be learned. Moreover, the deployment-friendly subarray hybrid beamforming architecture is applied and a practical lightweight end-to-end network is specially designed. Experiments show that the proposed network is over 10 times lighter at the resource-sensitive user equipment compared with the previous state-of-the-art method with only a minor performance loss.
LGSep 17, 2022
Sub-optimal Policy Aided Multi-Agent Reinforcement Learning for Flocking ControlYunbo Qiu, Yue Jin, Jian Wang et al.
Flocking control is a challenging problem, where multiple agents, such as drones or vehicles, need to reach a target position while maintaining the flock and avoiding collisions with obstacles and collisions among agents in the environment. Multi-agent reinforcement learning has achieved promising performance in flocking control. However, methods based on traditional reinforcement learning require a considerable number of interactions between agents and the environment. This paper proposes a sub-optimal policy aided multi-agent reinforcement learning algorithm (SPA-MARL) to boost sample efficiency. SPA-MARL directly leverages a prior policy that can be manually designed or solved with a non-learning method to aid agents in learning, where the performance of the policy can be sub-optimal. SPA-MARL recognizes the difference in performance between the sub-optimal policy and itself, and then imitates the sub-optimal policy if the sub-optimal policy is better. We leverage SPA-MARL to solve the flocking control problem. A traditional control method based on artificial potential fields is used to generate a sub-optimal policy. Experiments demonstrate that SPA-MARL can speed up the training process and outperform both the MARL baseline and the used sub-optimal policy.
LGFeb 18, 2023
Promoting Cooperation in Multi-Agent Reinforcement Learning via Mutual HelpYunbo Qiu, Yue Jin, Lebin Yu et al.
Multi-agent reinforcement learning (MARL) has achieved great progress in cooperative tasks in recent years. However, in the local reward scheme, where only local rewards for each agent are given without global rewards shared by all the agents, traditional MARL algorithms lack sufficient consideration of agents' mutual influence. In cooperative tasks, agents' mutual influence is especially important since agents are supposed to coordinate to achieve better performance. In this paper, we propose a novel algorithm Mutual-Help-based MARL (MH-MARL) to instruct agents to help each other in order to promote cooperation. MH-MARL utilizes an expected action module to generate expected other agents' actions for each particular agent. Then, the expected actions are delivered to other agents for selective imitation during training. Experimental results show that MH-MARL improves the performance of MARL both in success rate and cumulative reward.
ITFeb 15, 2023
Deep Learning for Hybrid Beamforming with Finite Feedback in GSM Aided mmWave MIMO SystemsZhilin Lu, Xudong Zhang, Rui Zeng et al.
Hybrid beamforming is widely recognized as an important technique for millimeter wave (mmWave) multiple input multiple output (MIMO) systems. Generalized spatial modulation (GSM) is further introduced to improve the spectrum efficiency. However, most of the existing works on beamforming assume the perfect channel state information (CSI), which is unrealistic in practical systems. In this paper, joint optimization of downlink pilot training, channel estimation, CSI feedback, and hybrid beamforming is considered in GSM aided frequency division duplexing (FDD) mmWave MIMO systems. With the help of deep learning, the GSM hybrid beamformers are designed via unsupervised learning in an end-to-end way. Experiments show that the proposed multi-resolution network named GsmEFBNet can reach a better achievable rate with fewer feedback bits compared with the conventional algorithm.
LGNov 14, 2025
Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving GamesBingyu Hui, Lebin Yu, Quanming Yao et al.
Zero-shot coordination(ZSC), a key challenge in multi-agent game theory, has become a hot topic in reinforcement learning (RL) research recently, especially in complex evolving games. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators from a diverse, potentially evolving, pool of partners that are not seen before without any fine-tuning. Population-based training, which approximates such an evolving partner pool, has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient RL training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi cooperative game and confirms its superiority.
MAMay 23, 2022
Learning to Advise and Learning from Advice in Cooperative Multi-Agent Reinforcement LearningYue Jin, Shuangqing Wei, Jian Yuan et al.
Learning to coordinate is a daunting problem in multi-agent reinforcement learning (MARL). Previous works have explored it from many facets, including cognition between agents, credit assignment, communication, expert demonstration, etc. However, less attention were paid to agents' decision structure and the hierarchy of coordination. In this paper, we explore the spatiotemporal structure of agents' decisions and consider the hierarchy of coordination from the perspective of multilevel emergence dynamics, based on which a novel approach, Learning to Advise and Learning from Advice (LALA), is proposed to improve MARL. Specifically, by distinguishing the hierarchy of coordination, we propose to enhance decision coordination at meso level with an advisor and leverage a policy discriminator to advise agents' learning at micro level. The advisor learns to aggregate decision information in both spatial and temporal domains and generates coordinated decisions by employing a spatiotemporal dual graph convolutional neural network with a task-oriented objective function. Each agent learns from the advice via a policy generative adversarial learning method where a discriminator distinguishes between the policies of the agent and the advisor and boosts both of them based on its judgement. Experimental results indicate the advantage of LALA over baseline approaches in terms of both learning efficiency and coordination capability. Coordination mechanism is investigated from the perspective of multilevel emergence dynamics and mutual information point of view, which provides a novel perspective and method to analyze and improve MARL algorithms.
CVNov 28, 2025Code
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence ParallelismSiqi Chen, Ke Hong, Tianchen Zhao et al.
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at https://github.com/thu-nics/db-SP.
AISep 25, 2024
Non-stationary BERT: Exploring Augmented IMU Data For Robust Human Activity RecognitionNing Sun, Yufei Wang, Yuwei Zhang et al.
Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users' daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability.
CVDec 23, 2025
LiDARDraft: Generating LiDAR Point Cloud from Versatile InputsHaiyun Wei, Fan Lu, Yunwei Zhu et al.
Generating realistic and diverse LiDAR point clouds is crucial for autonomous driving simulation. Although previous methods achieve LiDAR point cloud generation from user inputs, they struggle to attain high-quality results while enabling versatile controllability, due to the imbalance between the complex distribution of LiDAR point clouds and the simple control signals. To address the limitation, we propose LiDARDraft, which utilizes the 3D layout to build a bridge between versatile conditional signals and LiDAR point clouds. The 3D layout can be trivially generated from various user inputs such as textual descriptions and images. Specifically, we represent text, images, and point clouds as unified 3D layouts, which are further transformed into semantic and depth control signals. Then, we employ a rangemap-based ControlNet to guide LiDAR point cloud generation. This pixel-level alignment approach demonstrates excellent performance in controllable LiDAR point clouds generation, enabling "simulation from scratch", allowing self-driving environments to be created from arbitrary textual descriptions, images and sketches.
LGJun 18, 2024Code
Scale-Translation Equivariant Network for Oceanic Internal Solitary Wave LocalizationZhang Wan, Shuo Wang, Xudong Zhang
Internal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter data from remote sensing instruments. However, cloud cover in optical remote sensing images variably obscures ground information, leading to blurred or missing surface observations. As such, this paper aims at altimeter-based machine learning solutions to automatically locate ISWs. The challenges, however, lie in the following two aspects: 1) the altimeter data has low resolution, which requires a strong machine learner; 2) labeling data is extremely labor-intensive, leading to very limited data for training. In recent years, the grand progress of deep learning demonstrates strong learning capacity given abundant data. Besides, more recent studies on efficient learning and self-supervised learning laid solid foundations to tackle the aforementioned challenges. In this paper, we propose to inject prior knowledge to achieve a strong and efficient learner. Specifically, intrinsic patterns in altimetry data are efficiently captured using a scale-translation equivariant convolutional neural network (ST-ECNN). By considering inherent symmetries in neural network design, ST-ECNN achieves higher efficiency and better performance than baseline models. Furthermore, we also introduce prior knowledge from massive unsupervised data to enhance our solution using the SimCLR framework for pre-training. Our final solution achieves an overall better performance than baselines on our handcrafted altimetry dataset. Data and codes are available at https://github.com/ZhangWan-byte/Internal_Solitary_Wave_Localization .
CVJul 9, 2020Code
EPI-based Oriented Relation Networks for Light Field Depth EstimationKunyuan Li, Jun Zhang, Rui Sun et al.
Light field cameras record not only the spatial information of observed scenes but also the directions of all incoming light rays. The spatial and angular information implicitly contain geometrical characteristics such as multi-view or epipolar geometry, which can be exploited to improve the performance of depth estimation. An Epipolar Plane Image (EPI), the unique 2D spatial-angular slice of the light field, contains patterns of oriented lines. The slope of these lines is associated with the disparity. Benefiting from this property of EPIs, some representative methods estimate depth maps by analyzing the disparity of each line in EPIs. However, these methods often extract the optimal slope of the lines from EPIs while ignoring the relationship between neighboring pixels, which leads to inaccurate depth map predictions. Based on the observation that an oriented line and its neighboring pixels in an EPI share a similar linear structure, we propose an end-to-end fully convolutional network (FCN) to estimate the depth value of the intersection point on the horizontal and vertical EPIs. Specifically, we present a new feature-extraction module, called Oriented Relation Module (ORM), that constructs the relationship between the line orientations. To facilitate training, we also propose a refocusing-based data augmentation method to obtain different slopes from EPIs of the same scene point. Extensive experiments verify the efficacy of learning relations and show that our approach is competitive to other state-of-the-art methods. The code and the trained models are available at https://github.com/lkyahpu/EPI_ORM.git.
MADec 16, 2023
Robust Communicative Multi-Agent Reinforcement Learning with Active DefenseLebin Yu, Yunbo Qiu, Quanming Yao et al.
Communication in multi-agent reinforcement learning (MARL) has been proven to effectively promote cooperation among agents recently. Since communication in real-world scenarios is vulnerable to noises and adversarial attacks, it is crucial to develop robust communicative MARL technique. However, existing research in this domain has predominantly focused on passive defense strategies, where agents receive all messages equally, making it hard to balance performance and robustness. We propose an active defense strategy, where agents automatically reduce the impact of potentially harmful messages on the final decision. There are two challenges to implement this strategy, that are defining unreliable messages and adjusting the unreliable messages' impact on the final decision properly. To address them, we design an Active Defense Multi-Agent Communication framework (ADMAC), which estimates the reliability of received messages and adjusts their impact on the final decision accordingly with the help of a decomposable decision structure. The superiority of ADMAC over existing methods is validated by experiments in three communication-critical tasks under four types of attacks.
CLApr 25, 2024
Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in ConversationsShen Zhang, Haojie Zhang, Jing Zhang et al.
In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.
CLNov 20, 2025
ELPO: Ensemble Learning Based Prompt Optimization for Large Language ModelsQing Zhang, Bing Xu, Xudong Zhang et al.
The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
SPSep 15, 2025
CSIYOLO: An Intelligent CSI-based Scatter Sensing Framework for Integrated Sensing and Communication SystemsXudong Zhang, Jingbo Tan, Zhizhen Ren et al.
ISAC is regarded as a promising technology for next-generation communication systems, enabling simultaneous data transmission and target sensing. Among various tasks in ISAC, scatter sensing plays a crucial role in exploiting the full potential of ISAC and supporting applications such as autonomous driving and low-altitude economy. However, most existing methods rely on either waveform and hardware modifications or traditional signal processing schemes, leading to poor compatibility with current communication systems and limited sensing accuracy. To address these challenges, we propose CSIYOLO, a framework that performs scatter localization only using estimated CSI from a single base station-user equipment pair. This framework comprises two main components: anchor-based scatter parameter detection and CSI-based scatter localization. First, by formulating scatter parameter extraction as an image detection problem, we propose an anchor-based scatter parameter detection method inspired by You Only Look Once architectures. After that, a CSI-based localization algorithm is derived to determine scatter locations with extracted parameters. Moreover, to improve localization accuracy and implementation efficiency, we design an extendable network structure with task-oriented optimizations, enabling multi-scale anchor detection and better adaptation to CSI characteristics. A noise injection training strategy is further designed to enhance robustness against channel estimation errors. Since the proposed framework operates solely on estimated CSI without modifying waveforms or signal processing pipelines, it can be seamlessly integrated into existing communication systems as a plugin. Experiments show that our proposed method can significantly outperform existing methods in scatter localization accuracy with relatively low complexities under varying numbers of scatters and estimation errors.
AIAug 12, 2025
Efficient Agent: Optimizing Planning Capability for Multimodal Retrieval Augmented GenerationYuechen Wang, Yuming Qiao, Dan Meng et al.
Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark's explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent's superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.
CVJun 6, 2024
Monocular Localization with Semantics Map for Autonomous VehiclesJixiang Wan, Xudong Zhang, Shuzhou Dong et al.
Accurate and robust localization remains a significant challenge for autonomous vehicles. The cost of sensors and limitations in local computational efficiency make it difficult to scale to large commercial applications. Traditional vision-based approaches focus on texture features that are susceptible to changes in lighting, season, perspective, and appearance. Additionally, the large storage size of maps with descriptors and complex optimization processes hinder system performance. To balance efficiency and accuracy, we propose a novel lightweight visual semantic localization algorithm that employs stable semantic features instead of low-level texture features. First, semantic maps are constructed offline by detecting semantic objects, such as ground markers, lane lines, and poles, using cameras or LiDAR sensors. Then, online visual localization is performed through data association of semantic features and map objects. We evaluated our proposed localization framework in the publicly available KAIST Urban dataset and in scenarios recorded by ourselves. The experimental results demonstrate that our method is a reliable and practical localization solution in various autonomous driving localization tasks.
IVNov 26, 2021
Exploiting full Resolution Feature Context for Liver Tumor and Vessel Segmentation via Integrate Framework: Application to Liver Tumor and Vessel 3D Reconstruction under embedded microprocessorXiangyu Meng, Xudong Zhang, Gan Wang et al.
Liver cancer is one of the most common malignant diseases in the world. Segmentation and labeling of liver tumors and blood vessels in CT images can provide convenience for doctors in liver tumor diagnosis and surgical intervention. In the past decades, many state-of-the-art medical image segmentation algorithms appeared during this period. With the development of embedded devices, embedded deployment for medical segmentation and automatic reconstruction brings prospects for future automated surgical tasks. Yet, most of the existing segmentation methods mostly care about the spatial feature context and have a perception defect in the semantic relevance of medical images, which significantly affects the segmentation accuracy of liver tumors and blood vessels. Deploying large and complex models into embedded devices requires a reasonable trade-off between model accuracy, reasoning speed and model capacity. Given these problems, we introduce a multi-scale feature fusion network called TransFusionNet based on Transformer. This network achieved very competitive performance for liver vessel and liver tumor segmentation tasks, meanwhile it can improve the recognition of morphologic margins of liver tumors by exploiting the global information of CT images. Experiments show that in vessel segmentation task TransFusionNet achieved mean Dice coefficients of 0.899 and in liver tumor segmentation task TransFusionNet achieved mean Dice coefficients of 0.961. Compared with the state-of-the-art framework, our model achieves the best segmentation result. In addition, we deployed the model into an embedded micro-structure and constructed an integrated model for liver tumor vascular segmentation and reconstruction. This proprietary structure will be the exclusive component of the future medical field.
CVOct 8, 2021
Pose Refinement with Joint Optimization of Visual Points and LinesShuang Gao, Jixiang Wan, Yishan Ping et al.
High-precision camera re-localization technology in a pre-established 3D environment map is the basis for many tasks, such as Augmented Reality, Robotics and Autonomous Driving. The point-based visual re-localization approaches are well-developed in recent decades, but are insufficient in some feature-less cases. In this paper, we design a complete pipeline for camera pose refinement with points and lines, which contains the innovatively designed line extracting CNN named VLSE, the line matching and the pose optimization approaches. We adopt a novel line representation and customize a hybrid convolution block based on the Stacked Hourglass network, to detect accurate and stable line features on images. Then we apply a geometric-based strategy to obtain precise 2D-3D line correspondences using epipolar constraint and reprojection filtering. A following point-line joint cost function is constructed to optimize the camera pose with the initial coarse pose from the pure point-based localization. Sufficient experiments are conducted on open datasets, i.e, line extractor on Wireframe and YorkUrban, localization performance on InLoc duc1 and duc2, to confirm the effectiveness of our point-line joint pose optimization method.
LGSep 29, 2021
Information-Bottleneck-Based Behavior Representation Learning for Multi-agent Reinforcement learningYue Jin, Shuangqing Wei, Jian Yuan et al.
In multi-agent deep reinforcement learning, extracting sufficient and compact information of other agents is critical to attain efficient convergence and scalability of an algorithm. In canonical frameworks, distilling of such information is often done in an implicit and uninterpretable manner, or explicitly with cost functions not able to reflect the relationship between information compression and utility in representation. In this paper, we present Information-Bottleneck-based Other agents' behavior Representation learning for Multi-agent reinforcement learning (IBORM) to explicitly seek low-dimensional mapping encoder through which a compact and informative representation relevant to other agents' behaviors is established. IBORM leverages the information bottleneck principle to compress observation information, while retaining sufficient information relevant to other agents' behaviors used for cooperation decision. Empirical results have demonstrated that IBORM delivers the fastest convergence rate and the best performance of the learned policies, as compared with implicit behavior representation learning and explicit behavior representation learning without explicitly considering information compression and utility.
CVAug 19, 2021
Retrieval and Localization with Observation ConstraintsYuhao Zhou, Huanhuan Fan, Shuang Gao et al.
Accurate visual re-localization is very critical to many artificial intelligence applications, such as augmented reality, virtual reality, robotics and autonomous driving. To accomplish this task, we propose an integrated visual re-localization method called RLOCS by combining image retrieval, semantic consistency and geometry verification to achieve accurate estimations. The localization pipeline is designed as a coarse-to-fine paradigm. In the retrieval part, we cascade the architecture of ResNet101-GeM-ArcFace and employ DBSCAN followed by spatial verification to obtain a better initial coarse pose. We design a module called observation constraints, which combines geometry information and semantic consistency for filtering outliers. Comprehensive experiments are conducted on open datasets, including retrieval on R-Oxford5k and R-Paris6k, semantic segmentation on Cityscapes, localization on Aachen Day-Night and InLoc. By creatively modifying separate modules in the total pipeline, our method achieves many performance improvements on the challenging localization benchmarks.
IRAug 12, 2021
SR-HetGNN:Session-based Recommendation with Heterogeneous Graph Neural NetworkJinpeng Chen, Haiyang Li, Xudong Zhang et al.
The Session-Based Recommendation System aims to predict the user's next click based on their previous session sequence. The current studies generally learn user preferences according to the transitions of items in the user's session sequence. However, other effective information in the session sequence, such as user profiles, are largely ignored which may lead to the model unable to learn the user's specific preferences. In this paper, we propose SR-HetGNN, a novel session recommendation method that uses a heterogeneous graph neural network (HetGNN) to learn session embeddings and capture the specific preferences of anonymous users. Specifically, SR-HetGNN first constructs heterogeneous graphs containing various types of nodes according to the session sequence, which can capture the dependencies among items, users, and sessions. Second, HetGNN captures the complex transitions between items and learns the item embeddings containing user information. Finally, local and global session embeddings are combined with the attentional network to obtain the final session embedding, considering the influence of users' long and short-term preferences. SR-HetGNN is shown to be superior to the existing state-of-the-art session-based recommendation methods through extensive experiments over two real large datasets Diginetica and Tmall.
LGJul 3, 2021
Supervised Off-Policy RankingYue Jin, Yue Zhang, Tao Qin et al.
Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is a much simpler task than precisely evaluating their true performance; and (2) there are usually multiple policies that have been deployed to serve users in real-world systems and thus the true performance of these policies can be known. Inspired by the two observations, in this work, we study a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. We propose a method to solve SOPR, which learns a policy scoring model by minimizing a ranking loss of the training policies rather than estimating the precise policy performance. The scoring model in our method, a hierarchical Transformer based model, maps a set of state-action pairs to a score, where the state of each pair comes from the off-policy data and the action is taken by a target policy on the state in an offline manner. Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability. Our code is publicly available at GitHub.
ITMay 1, 2021
Binarized Aggregated Network with Quantization: Flexible Deep Learning Deployment for CSI Feedback in Massive MIMO SystemZhilin Lu, Xudong Zhang, Hongyi He et al.
Massive multiple-input multiple-output (MIMO) is one of the key techniques to achieve better spectrum and energy efficiency in 5G system. The channel state information (CSI) needs to be fed back from the user equipment to the base station in frequency division duplexing (FDD) mode. However, the overhead of the direct feedback is unacceptable due to the large antenna array in massive MIMO system. Recently, deep learning is widely adopted to the compressed CSI feedback task and proved to be effective. In this paper, a novel network named aggregated channel reconstruction network (ACRNet) is designed to boost the feedback performance with network aggregation and parametric rectified linear unit (PReLU) activation. The practical deployment of the feedback network in the communication system is also considered. Specifically, the elastic feedback scheme is proposed to flexibly adapt the network to meet different resource limitations. Besides, the network binarization technique is combined with the feature quantization for lightweight and practical deployment. Experiments show that the proposed ACRNet outperforms loads of previous state-of-the-art networks, providing a neat feedback solution with high performance, low cost and impressive flexibility.
LGMar 29, 2021
Robust Reinforcement Learning under model misspecificationLebin Yu, Jian Wang, Xudong Zhang
Reinforcement learning has achieved remarkable performance in a wide range of tasks these days. Nevertheless, some unsolved problems limit its applications in real-world control. One of them is model misspecification, a situation where an agent is trained and deployed in environments with different transition dynamics. We propose an novel framework that utilize history trajectory and Partial Observable Markov Decision Process Modeling to deal with this dilemma. Additionally, we put forward an efficient adversarial attack method to assist robust training. Our experiments in four gym domains validate the effectiveness of our framework.
IRFeb 6, 2015
A General SIMD-based Approach to Accelerating Compression AlgorithmsWayne Xin Zhao, Xudong Zhang, Daniel Lemire et al.
Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art non-vectorized algorithms with respect to decoding speeds.