Lijun He

CV
h-index10
18papers
205citations
Novelty57%
AI Score58

18 Papers

CVMay 18Code
Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

Yixing Yong, Jian Wang, Ming Lei et al.

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

NIFeb 13, 2023
Computation Offloading for Uncertain Marine Tasks by Cooperation of UAVs and Vessels

Jiahao You, Ziye Jia, Chao Dong et al.

With the continuous increment of maritime applications, the development of marine networks for data offloading becomes necessary. However, the limited maritime network resources are very difficult to satisfy real-time demands. Besides, how to effectively handle multiple compute-intensive tasks becomes another intractable issue. Hence, in this paper, we focus on the decision of maritime task offloading by the cooperation of unmanned aerial vehicles (UAVs) and vessels. Specifically, we first propose a cooperative offloading framework, including the demands from marine Internet of Things (MIoTs) devices and resource providers from UAVs and vessels. Due to the limited energy and computation ability of UAVs, it is necessary to help better apply the vessels to computation offloading. Then, we formulate the studied problem into a Markov decision process, aiming to minimize the total execution time and energy cost. Then, we leverage Lyapunov optimization to convert the long-term constraints of the total execution time and energy cost into their short-term constraints, further yielding a set of per-time-slot optimization problems. Furthermore, we propose a Q-learning based approach to solve the short-term problem efficiently. Finally, simulation results are conducted to verify the correctness and effectiveness of the proposed algorithm.

CVJul 5, 2024
Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang et al.

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.

DCMar 10
Hierarchical Observe-Orient-Decide-Act Enabled UAV Swarms in Uncertain Environments: Frameworks, Potentials, and Challenges

Ziye Jia, Yao Wu, Qihui Wu et al.

Unmanned aerial vehicle (UAV) swarms are increasingly explored for their potentials in various applications such as surveillance, disaster response, and military. However, UAV swarms face significant challenges of implementing effective and rapid decisions under dynamic and uncertain environments. The traditional decision-making frameworks, mainly relying on centralized control and rigid architectures, are limited by their adaptability and scalability especially in complex environments. To overcome these challenges, in this paper, we propose a hierarchical Observe-Orient-Decide-Act (H-OODA) loop based framework for the UAV swarm operation in uncertain environments, which is implemented by embedding the classical OODA loop across the cloud-edge-terminal layers, and leveraging the network function virtualization (NFV) technology to provide flexible and scalable decision-making functions. In addition, based on the proposed H-OODA framework, we joint autonomous decision-making and cooperative control to enhance the adaptability and efficiency of UAV swarms. Furthermore, we present some typical case studies to verify the improvement and efficiency of the proposed framework. Finally, the potential challenges and possible directions are analyzed to provide insights for the future H-OODA enabled UAV swarms.

CVDec 19, 2025
HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

Zhaolin Cai, Fan Li, Ziwei Zheng et al.

Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.

LGJan 3, 2025Code
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Ziwei Zheng, Junyao Zhao, Le Yang et al.

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.

CLApr 15, 2025Code
Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis

Hao Liu, Lijun He, Jiaxi Liang et al.

Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+2.3\% F1 and +3.5\% precision on Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .

AIFeb 3
Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

Wei Dai, Haoyu Wang, Honghao Chang et al.

Vision Language Models (VLMs) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and impair generalization of VLMs. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment. Zero-shot evaluations across benchmark datasets demonstrate that our approach outperforms existing baseline methods. Extensive experiments and ablation studies confirm our model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments. Our code and models will be publicly available.

CVNov 11, 2025
Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving

Jian Wang, Lijun He, Yixing Yong et al.

Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.

CVNov 7, 2025
Learning Fourier shapes to probe the geometric world of deep neural networks

Jian Wang, Yixing Yong, Haixia Bi et al.

While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model's salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

SPMay 8
Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference

Jingyi Liu, Cheng Yuan, Lijun He et al.

The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1\% and the system latency to around 20\% compared to video codec-based solutions, while delivering comparable action understanding accuracy.

CVApr 29
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification

Zuzheng Kuang, Honghao Chang, Boqiang Liang et al.

Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.

CVNov 19, 2025
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs

Zhihan Ren, Lijun He, Jiaxi Liang et al.

Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.

MMJan 5, 2021
QoE-driven Secure Video Transmission in Cloud-edge Collaborative Networks

Tantan Zhao, Lijun He, Xinyu Huang et al.

Video transmission over the backhaul link in cloud-edge collaborative networks usually suffers security risks, which is ignored in most of the existing studies. The characteristics that video service can flexibly adjust the encoding rates and provide acceptable encoding qualities, make the security requirements more possible to be satisfied but tightly coupled with video encoding by introducing more restrictions on edge caching. In this paper, by considering the interaction between video encoding and edge caching, we investigate the quality of experience (QoE)-driven cross-layer optimization of secure video transmission over the wireless backhaul link in cloud-edge collaborative networks. First, we develop a secure transmission model based on video encoding and edge caching. By employing this model as the security constraint, then we formulate a QoE-driven joint optimization problem subject to limited available caching capacity. To solve the optimization problem, we propose two algorithms: a near-optimal iterative algorithm (EC-VE) and a greedy algorithm with low computational complexity (Greedy EC-VE). Simulation results show that our proposed EC-VE can greatly improve user QoE within security constraints, and the proposed Greedy EC-VE can obtain the tradeoff between QoE and computational complexity.

MMOct 16, 2020
Revenue and Energy Efficiency-Driven Delay Constrained Computing Task Offloading and Resource Allocation in a Vehicular Edge Computing Network: A Deep Reinforcement Learning Approach

Xinyu Huang, Lijun He, Xing Chen et al.

For in-vehicle application,task type and vehicle state information, i.e., vehicle speed, bear a significant impact on the task delay requirement. However, the joint impact of task type and vehicle speed on the task delay constraint has not been studied, and this lack of study may cause a mismatch between the requirement of the task delay and allocated computation and wireless resources. In this paper, we propose a joint task type and vehicle speed-aware task offloading and resource allocation strategy to decrease the vehicl's energy cost for executing tasks and increase the revenue of the vehicle for processing tasks within the delay constraint. First, we establish the joint task type and vehicle speed-aware delay constraint model. Then, the delay, energy cost and revenue for task execution in the vehicular edge computing (VEC) server, local terminal and terminals of other vehicles are calculated. Based on the energy cost and revenue from task execution,the utility function of the vehicle is acquired. Next, we formulate a joint optimization of task offloading and resource allocation to maximize the utility level of the vehicles subject to the constraints of task delay, computation resources and wireless resources. To obtain a near-optimal solution of the formulated problem, a joint offloading and resource allocation based on the multi-agent deep deterministic policy gradient (JORA-MADDPG) algorithm is proposed to maximize the utility level of vehicles. Simulation results show that our algorithm can achieve superior performance in task completion delay, vehicles' energy cost and processing revenue.

NIJul 2, 2020
Playback experience driven cross layer optimisation of APP, transport and MAC layer for video clients over long-term evolution system

Xinyu Huang, Lijun He

In traditional communication system, information of APP (Application) layer, transport layer and MAC (Media Access Control)layer has not been fully interacted,which inevitably leads to inconsistencies among TCP congestion state, clients'requirements and resource allocation. To solve the problem, we propose a joint optimization framework, which consists of APP layer, transport layer and MAC layer, to improve the video clients'playback experience and system throughput. First, a client requirement aware autonomous packet drop strategy, based on packet importance, channel condition and playback status, is developed to decrease the network load and the probability of rebuffering events. Further, TCP (Transmission Control Protocol) state aware downlink and uplink resource allocation schemes are proposed to achieve smooth video transmission and steady ACK (Acknowledgement) feedback respectively. For downlink scheme, maximum transmission capacity requirement for each client is calculated based on feedback ACK information from transport layer to avoid allocating excessive resource to the client, whose ACK feedback is blocked due to bad uplink channel condition. For uplink scheme, information of RTO (Retransmission Timeout) and TCP congestion window are utilized to indicate ACK scheduling priority. The simulation results show that our algorithm can signficantly improve the system throughput and the clients'playback continuity with acceptable video quality.

MMMay 15, 2020
Towards 5G: Joint Optimization of Video Segment Cache, Transcoding and Resource Allocation for Adaptive Video Streaming in a Muti-access Edge Computing Network

Xinyu Huang, Lijun He, Liejun Wang et al.

The cache and transcoding of the multi-access edge computing (MEC) server and wireless resource allocation in eNodeB interact and determine the quality of experience (QoE) of dynamic adaptive streaming over HTTP (DASH) clients in MEC networks. However, the relationship among the three factors has not been explored, which has led to limited improvement in clients' QoE. Therefore, we propose a joint optimization framework of video segment cache and transcoding in MEC servers and resource allocation to improve the QoE of DASH clients. Based on the established framework, we develop a MEC cache management mechanism that consists of the MEC cache partition, video segment deletion, and MEC cache space transfer. Then, a joint optimization algorithm that combines video segment cache and transcoding in the MEC server and resource allocation is proposed. In the algorithm, the clients' channel state and the playback status and cooperation among MEC servers are employed to estimate the client's priority, video segment presentation switch and continuous playback time. Considering the above four factors, we develop a utility function model of clients' QoE. Then, we formulate a mixed-integer nonlinear programming mathematical model to maximize the total utility of DASH clients, where the video segment cache and transcoding strategy and resource allocation strategy are jointly optimized. To solve this problem, we propose a low-complexity heuristic algorithm that decomposes the original problem into multiple subproblems. The simulation results show that our proposed algorithms efficiently improve client's throughput, received video quality and hit ratio of video segments while decreasing the playback rebuffering time, video segment presentation switch and system backhaul traffic.

MMSep 9, 2019
Hit Ratio Driven Mobile Edge Caching Scheme for Video on Demand Services

Xing Chen, Lijun He, Shang Xu et al.

More and more scholars focus on mobile edge computing (MEC) technology, because the strong storage and computing capabilities of MEC servers can reduce the long transmission delay, bandwidth waste, energy consumption, and privacy leaks in the data transmission process. In this paper, we study the cache placement problem to determine how to cache videos and which videos to be cached in a mobile edge computing system. First, we derive the video request probability by taking into account video popularity, user preference and the characteristic of video representations. Second, based on the acquired request probability, we formulate a cache placement problem with the objective to maximize the cache hit ratio subject to the storage capacity constraints. Finally, in order to solve the formulated problem, we transform it into a grouping knapsack problem and develop a dynamic programming algorithm to obtain the optimal caching strategy. Simulation results show that the proposed algorithm can greatly improve the cache hit ratio.