Pin Lv

CV
h-index15
8papers
1,121citations
Novelty49%
AI Score46

8 Papers

CVApr 30, 2025Code
Is Intermediate Fusion All You Need for UAV-based Collaborative Perception?

Jiuwu Hao, Liguo Sun, Yuting Wan et al.

Collaborative perception enhances environmental awareness through inter-agent communication and is regarded as a promising solution to intelligent transportation systems. However, existing collaborative methods for Unmanned Aerial Vehicles (UAVs) overlook the unique characteristics of the UAV perspective, resulting in substantial communication overhead. To address this issue, we propose a novel communication-efficient collaborative perception framework based on late-intermediate fusion, dubbed LIF. The core concept is to exchange informative and compact detection results and shift the fusion stage to the feature representation level. In particular, we leverage vision-guided positional embedding (VPE) and box-based virtual augmented feature (BoBEV) to effectively integrate complementary information from various agents. Additionally, we innovatively introduce an uncertainty-driven communication mechanism that uses uncertainty evaluation to select high-quality and reliable shared areas. Experimental results demonstrate that our LIF achieves superior performance with minimal communication bandwidth, proving its effectiveness and practicality. Code and models are available at https://github.com/uestchjw/LIF.

CVFeb 16, 2025Code
FeaKM: Robust Collaborative Perception under Noisy Pose Conditions

Jiuwu Hao, Liguo Sun, Ti Xiang et al.

Collaborative perception is essential for networks of agents with limited sensing capabilities, enabling them to work together by exchanging information to achieve a robust and comprehensive understanding of their environment. However, localization inaccuracies often lead to significant spatial message displacement, which undermines the effectiveness of these collaborative efforts. To tackle this challenge, we introduce FeaKM, a novel method that employs Feature-level Keypoints Matching to effectively correct pose discrepancies among collaborating agents. Our approach begins by utilizing a confidence map to identify and extract salient points from intermediate feature representations, allowing for the computation of their descriptors. This step ensures that the system can focus on the most relevant information, enhancing the matching process. We then implement a target-matching strategy that generates an assignment matrix, correlating the keypoints identified by different agents. This is critical for establishing accurate correspondences, which are essential for effective collaboration. Finally, we employ a fine-grained transformation matrix to synchronize the features of all agents and ascertain their relative statuses, ensuring coherent communication among them. Our experimental results demonstrate that FeaKM significantly outperforms existing methods on the DAIR-V2X dataset, confirming its robustness even under severe noise conditions. The code and implementation details are available at https://github.com/uestchjw/FeaKM.

CVMar 2
physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

Yuting Wan, Liguo Sun, Jiuwu Hao et al.

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.

SEDec 19, 2024
Tree-of-Code: A Tree-Structured Exploring Framework for End-to-End Code Generation and Execution in Complex Task Handling

Ziyi Ni, Yifan Li, Ning Yang et al.

Solving complex reasoning tasks is a key real-world application of agents. Thanks to the pretraining of Large Language Models (LLMs) on code data, recent approaches like CodeAct successfully use code as LLM agents' action, achieving good results. However, CodeAct greedily generates the next action's code block by relying on fragmented thoughts, resulting in inconsistency and instability. Moreover, CodeAct lacks action-related ground-truth (GT), making its supervision signals and termination conditions questionable in multi-turn interactions. To address these issues, we first introduce a simple yet effective end-to-end code generation paradigm, CodeProgram, which leverages code's systematic logic to align with global reasoning and enable cohesive problem-solving. Then, we propose Tree-of-Code (ToC), which self-grows CodeProgram nodes based on the executable nature of the code and enables self-supervision in a GT-free scenario. Experimental results on two datasets using ten popular zero-shot LLMs show ToC remarkably boosts accuracy by nearly 20% over CodeAct with less than 1/4 turns. Several LLMs even perform better on one-turn CodeProgram than on multi-turn CodeAct. To further investigate the trade-off between efficacy and efficiency, we test different ToC tree sizes and exploration mechanisms. We also highlight the potential of ToC's end-to-end data generation for supervised and reinforced fine-tuning.

HCOct 14, 2024
Tracing Human Stress from Physiological Signals using UWB Radar

Jia Xu, Teng Xiao, Pin Lv et al.

Stress tracing is an important research domain that supports many applications, such as health care and stress management; and its closest related works are derived from stress detection. However, these existing works cannot well address two important challenges facing stress detection. First, most of these studies involve asking users to wear physiological sensors to detect their stress states, which has a negative impact on the user experience. Second, these studies have failed to effectively utilize multimodal physiological signals, which results in less satisfactory detection results. This paper formally defines the stress tracing problem, which emphasizes the continuous detection of human stress states. A novel deep stress tracing method, named DST, is presented. Note that DST proposes tracing human stress based on physiological signals collected by a noncontact ultrawideband radar, which is more friendly to users when collecting their physiological signals. In DST, a signal extraction module is carefully designed at first to robustly extract multimodal physiological signals from the raw RF data of the radar, even in the presence of body movement. Afterward, a multimodal fusion module is proposed in DST to ensure that the extracted multimodal physiological signals can be effectively fused and utilized. Extensive experiments are conducted on three real-world datasets, including one self-collected dataset and two publicity datasets. Experimental results show that the proposed DST method significantly outperforms all the baselines in terms of tracing human stress states. On average, DST averagely provides a 6.31% increase in detection accuracy on all datasets, compared with the best baselines.

LGAug 3, 2025
Proactive Constrained Policy Optimization with Preemptive Penalty

Ning Yang, Pengyu Wang, Guoqing Liu et al.

Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method's convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.

ASMay 31, 2023
VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition

Ziyi Ni, Minglun Han, Feilong Chen et al.

Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K and self-constructed VSDial datasets. We explore various cross-modal fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition.

CLJul 23, 2017
Using Argument-based Features to Predict and Analyse Review Helpfulness

Haijing Liu, Yang Gao, Pin Lv et al.

We study the helpful product reviews identification problem in this paper. We observe that the evidence-conclusion discourse relations, also known as arguments, often appear in product reviews, and we hypothesise that some argument-based features, e.g. the percentage of argumentative sentences, the evidences-conclusions ratios, are good indicators of helpful reviews. To validate this hypothesis, we manually annotate arguments in 110 hotel reviews, and investigate the effectiveness of several combinations of argument-based features. Experiments suggest that, when being used together with the argument-based features, the state-of-the-art baseline features can enjoy a performance boost (in terms of F1) of 11.01\% in average.