CVApr 17Code
NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker RecognitionJunguang Yao, Wenye Liu, Stjepan Picek et al.
Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.
CRDec 3, 2025Code
SELF: A Robust Singular Value and Eigenvalue Approach for LLM FingerprintingHanxiu Zhang, Yue Zheng
The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
CRMay 8
A Unified Open-Set Framework for Scalable PUF-Based Authentication of Heterogeneous IoT DevicesXin Wang, Peichun Hua, Chip Hong Chang et al.
As modern cyber systems scale to include large populations of heterogeneous IoT devices, securing them against impersonation and forgery is a critical cybersecurity challenge. Physical Unclonable Functions (PUFs) offer a lightweight, hardware-rooted trust anchor for IoT security. However, different PUF architectures possess distinct challenge-response spaces and raw response reliabilities, making existing authentication protocols PUF-type specific. To bridge this interoperability bottleneck, this paper proposes a scalable, helper-data-free, open-set PUF authentication framework that leverages an OpenGAN-based classifier to manage heterogeneous fleets of IoT devices. Our method addresses the limitations of traditional database-centric and digital-twin modeling methods by encoding raw responses from diverse PUF types, including strong, weak and hybrid PUFs, into a unified image representation. This enables robust, single-pass classification and impostor rejection. We integrate the classifier into a generic protocol employing hybrid encryption and Bloom filter-based replay detection. Evaluated across four different types of noisy PUF data (Arbiter, SRAM, DRAM, and heterogeneous PUFs), our framework achieves 100% closed-set accuracy and near-zero open-set error rates with up to 45 devices, a significant improvement over the 3 to 5 devices in prior classification-based approaches. Prototyped on a Raspberry Pi, our framework completes one authentication cycle within 0.67 s, approximately 30x faster than the state-of-the-art open-set baselines.
CVOct 18, 2025
Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language ModelsYue Zheng, Xiufang Shi, Jiming Chen et al.
Video anomaly detection (VAD) has rapidly advanced by recent development of Vision-Language Models (VLMs). While these models offer superior zero-shot detection capabilities, their immense computational cost and unstable visual grounding performance hinder real-time deployment. To overcome these challenges, we introduce Cerberus, a two-stage cascaded system designed for efficient yet accurate real-time VAD. Cerberus learns normal behavioral rules offline, and combines lightweight filtering with fine-grained VLM reasoning during online inference. The performance gains of Cerberus come from two key innovations: motion mask prompting and rule-based deviation detection. The former directs the VLM's attention to regions relevant to motion, while the latter identifies anomalies as deviations from learned norms rather than enumerating possible anomalies. Extensive evaluations on four datasets show that Cerberus on average achieves 57.68 fps on an NVIDIA L40S GPU, a 151.79$\times$ speedup, and 97.2\% accuracy comparable to the state-of-the-art VLM-based VAD methods, establishing it as a practical solution for real-time video analytics.
IRJun 1, 2025
Breaker: Removing Shortcut Cues with User Clustering for Single-slot Recommendation SystemChao Wang, Yue Zheng, Yujing Zhang et al.
In a single-slot recommendation system, users are only exposed to one item at a time, and the system cannot collect user feedback on multiple items simultaneously. Therefore, only pointwise modeling solutions can be adopted, focusing solely on modeling the likelihood of clicks or conversions for items by users to learn user-item preferences, without the ability to capture the ranking information among different items directly. However, since user-side information is often much more abundant than item-side information, the model can quickly learn the differences in user intrinsic tendencies, which are independent of the items they are exposed to. This can cause these intrinsic tendencies to become a shortcut bias for the model, leading to insufficient mining of the most concerned user-item preferences. To solve this challenge, we introduce the Breaker model. Breaker integrates an auxiliary task of user representation clustering with a multi-tower structure for cluster-specific preference modeling. By clustering user representations, we ensure that users within each cluster exhibit similar characteristics, which increases the complexity of the pointwise recommendation task on the user side. This forces the multi-tower structure with cluster-driven parameter learning to better model user-item preferences, ultimately eliminating shortcut biases related to user intrinsic tendencies. In terms of training, we propose a delayed parameter update mechanism to enhance training stability and convergence, enabling end-to-end joint training of the auxiliary clustering and classification tasks. Both offline and online experiments demonstrate that our method surpasses the baselines. It has already been deployed and is actively serving tens of millions of users daily on Meituan, one of the most popular e-commerce platforms for services.
MEDec 14, 2023
Temporal-Spatial Entropy Balancing for Causal Continuous Treatment-Effect EstimationTao Hu, Honglong Zhang, Fan Zeng et al.
In the field of intracity freight transportation, changes in order volume are significantly influenced by temporal and spatial factors. When building subsidy and pricing strategies, predicting the causal effects of these strategies on order volume is crucial. In the process of calculating causal effects, confounding variables can have an impact. Traditional methods to control confounding variables handle data from a holistic perspective, which cannot ensure the precision of causal effects in specific temporal and spatial dimensions. However, temporal and spatial dimensions are extremely critical in the logistics field, and this limitation may directly affect the precision of subsidy and pricing strategies. To address these issues, this study proposes a technique based on flexible temporal-spatial grid partitioning. Furthermore, based on the flexible grid partitioning technique, we further propose a continuous entropy balancing method in the temporal-spatial domain, which named TS-EBCT (Temporal-Spatial Entropy Balancing for Causal Continue Treatments). The method proposed in this paper has been tested on two simulation datasets and two real datasets, all of which have achieved excellent performance. In fact, after applying the TS-EBCT method to the intracity freight transportation field, the prediction accuracy of the causal effect has been significantly improved. It brings good business benefits to the company's subsidy and pricing strategies.
ROJun 5, 2021
A Split-face Study of Novel Robotic Prototype vs Human Operator in Skin Rejuvenation Using Q-switched Nd:Yag Laser: Accuracy, Efficacy and SafetySi Un Chan, Cheong Cheong Ip, Chengxiang Lian et al.
Background: Robotic technologies involved in skin laser are emerging. Objective: To compare the accuracy, efficacy and safety of novel robotic prototype with human operator in laser operation performance for skin photo-rejuvenation. Methods: Seventeen subjects were enrolled in a prospective, comparative split-face trial. Q-switch 1064nm laser conducted by the robotic prototype was provided on the right side of the face and that by the professional practitioner on the left. Each subject underwent a single time, one-pass, non-overlapped treatment on an equal size area of the forehead and cheek. Objective assessments included: treatment duration, laser irradiation shots, laser coverage percentage, VISIA parameters, skin temperature and the VAS pain scale. Results: Average time taken by robotic manipulator was longer than human operator; the average number of irradiation shots of both sides had no significant differences. Laser coverage rate of robotic manipulator (60.2 +-15.1%) was greater than that of human operator (43.6 +-12.9%). The VISIA parameters showed no significant differences between robotic manipulator and human operator. No short or long-term side effects were observed with maximum VAS score of 1 point. Limitations: Only one section of laser treatment was performed. Conclusion: Laser operation by novel robotic prototype is more reliable, stable and accurate than human operation.
LGSep 10, 2019
Reinforcement Learning and Video GamesYue Zheng
Reinforcement learning has exceeded human-level performance in game playing AI with deep learning methods according to the experiments from DeepMind on Go and Atari games. Deep learning solves high dimension input problems which stop the development of reinforcement for many years. This study uses both two techniques to create several agents with different algorithms that successfully learn to play T-rex Runner. Deep Q network algorithm and three types of improvements are implemented to train the agent. The results from some of them are far from satisfactory but others are better than human experts. Batch normalization is a method to solve internal covariate shift problems in deep neural network. The positive influence of this on reinforcement learning has also been proved in this study.
CVNov 19, 2018
Intention Oriented Image Captions with Guiding ObjectsYue Zheng, Yali Li, Shengjin Wang
Although existing image caption models can produce promising results using recurrent neural networks (RNNs), it is difficult to guarantee that an object we care about is contained in generated descriptions, for example in the case that the object is inconspicuous in the image. Problems become even harder when these objects did not appear in training stage. In this paper, we propose a novel approach for generating image captions with guiding objects (CGO). The CGO constrains the model to involve a human-concerned object when the object is in the image. CGO ensures that the object is in the generated description while maintaining fluency. Instead of generating the sequence from left to right, we start the description with a selected object and generate other parts of the sequence based on this object. To achieve this, we design a novel framework combining two LSTMs in opposite directions. We demonstrate the characteristics of our method on MSCOCO where we generate descriptions for each detected object in the images. With CGO, we can extend the ability of description to the objects being neglected in image caption labels and provide a set of more comprehensive and diverse descriptions for an image. CGO shows advantages when applied to the task of describing novel objects. We show experimental results on both MSCOCO and ImageNet datasets. Evaluations show that our method outperforms the state-of-the-art models in the task with average F1 75.8, leading to better descriptions in terms of both content accuracy and fluency.