CVApr 7, 2022Code
HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detectionJiashun Suo, Tianyi Wang, Xingzhou Zhang et al.
We present the HIT-UAV dataset, a high-altitude infrared thermal dataset for object detection applications on Unmanned Aerial Vehicles (UAVs). The dataset comprises 2,898 infrared thermal images extracted from 43,470 frames in hundreds of videos captured by UAVs in various scenarios including schools, parking lots, roads, and playgrounds. Moreover, the HIT-UAV provides essential flight data for each image, such as flight altitude, camera perspective, date, and daylight intensity. For each image, we have manually annotated object instances with bounding boxes of two types (oriented and standard) to tackle the challenge of significant overlap of object instances in aerial images. To the best of our knowledge, the HIT-UAV is the first publicly available high-altitude UAV-based infrared thermal dataset for detecting persons and vehicles. We have trained and evaluated well-established object detection algorithms on the HIT-UAV. Our results demonstrate that the detection algorithms perform exceptionally well on the HIT-UAV compared to visual light datasets since infrared thermal images do not contain significant irrelevant information about objects. We believe that the HIT-UAV will contribute to various UAV-based applications and researches. The dataset is freely available at https://github.com/suojiashun/HIT-UAV-Infrared-Thermal-Dataset.
CVSep 12, 2022
Deep Convolutional Pooling Transformer for Deepfake DetectionTianyi Wang, Harry Cheng, Kam Pui Chow et al.
Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
CVMar 4, 2022
Voice-Face Homogeneity Tells DeepfakeHarry Cheng, Yangyang Guo, Tianyi Wang et al.
Detecting forgery videos is highly desirable due to the abuse of deepfake. Existing detection approaches contribute to exploring the specific artifacts in deepfake videos and fit well on certain data. However, the growing technique on these artifacts keeps challenging the robustness of traditional deepfake detectors. As a result, the development of generalizability of these approaches has reached a blockage. To address this issue, given the empirical results that the identities behind voices and faces are often mismatched in deepfake videos, and the voices and faces have homogeneity to some extent, in this paper, we propose to perform the deepfake detection from an unexplored voice-face matching view. To this end, a voice-face matching method is devised to measure the matching degree of these two. Nevertheless, training on specific deepfake datasets makes the model overfit certain traits of deepfake algorithms. We instead, advocate a method that quickly adapts to untapped forgery, with a pre-training then fine-tuning paradigm. Specifically, we first pre-train the model on a generic audio-visual dataset, followed by the fine-tuning on downstream deepfake data. We conduct extensive experiments over three widely exploited deepfake datasets - DFDC, FakeAVCeleb, and DeepfakeTIMIT. Our method obtains significant performance gains as compared to other state-of-the-art competitors. It is also worth noting that our method already achieves competitive results when fine-tuned on limited deepfake data.
HCMar 28, 2023
XAIR: A Framework of Explainable AI in Augmented RealityXuhai Xu, Mengjie Yu, Tanya R. Jonker et al.
Explainable AI (XAI) has established itself as an important component of AI-driven interactive systems. With Augmented Reality (AR) becoming more integrated in daily lives, the role of XAI also becomes essential in AR because end-users will frequently interact with intelligent services. However, it is unclear how to design effective XAI experiences for AR. We propose XAIR, a design framework that addresses "when", "what", and "how" to provide explanations of AI output in AR. The framework was based on a multi-disciplinary literature review of XAI and HCI research, a large-scale survey probing 500+ end-users' preferences for AR-based explanations, and three workshops with 12 experts collecting their insights about XAI design in AR. XAIR's utility and effectiveness was verified via a study with 10 designers and another study with 12 end-users. XAIR can provide guidelines for designers, inspiring them to identify new design opportunities and achieve effective XAI designs in AR.
CVJul 14, 2022
Detecting Volunteer Cotton Plants in a Corn Field with Deep Learning on UAV Remote-Sensing ImageryPappu Kumar Yadav, J. Alex Thomasson, Robert Hardin et al.
The cotton boll weevil, Anthonomus grandis Boheman is a serious pest to the U.S. cotton industry that has cost more than 16 billion USD in damages since it entered the United States from Mexico in the late 1800s. This pest has been nearly eradicated; however, southern part of Texas still faces this issue and is always prone to the pest reinfestation each year due to its sub-tropical climate where cotton plants can grow year-round. Volunteer cotton (VC) plants growing in the fields of inter-seasonal crops, like corn, can serve as hosts to these pests once they reach pin-head square stage (5-6 leaf stage) and therefore need to be detected, located, and destroyed or sprayed . In this paper, we present a study to detect VC plants in a corn field using YOLOv3 on three band aerial images collected by unmanned aircraft system (UAS). The two-fold objectives of this paper were : (i) to determine whether YOLOv3 can be used for VC detection in a corn field using RGB (red, green, and blue) aerial images collected by UAS and (ii) to investigate the behavior of YOLOv3 on images at three different scales (320 x 320, S1; 416 x 416, S2; and 512 x 512, S3 pixels) based on average precision (AP), mean average precision (mAP) and F1-score at 95% confidence level. No significant differences existed for mAP among the three scales, while a significant difference was found for AP between S1 and S3 (p = 0.04) and S2 and S3 (p = 0.02). A significant difference was also found for F1-score between S2 and S3 (p = 0.02). The lack of significant differences of mAP at all the three scales indicated that the trained YOLOv3 model can be used on a computer vision-based remotely piloted aerial application system (RPAAS) for VC detection and spray application in near real-time.
CVJul 31, 2022
Assessing The Performance of YOLOv5 Algorithm for Detecting Volunteer Cotton Plants in Corn Fields at Three Different Growth StagesPappu Kumar Yadav, J. Alex Thomasson, Stephen W. Searcy et al.
The boll weevil (Anthonomus grandis L.) is a serious pest that primarily feeds on cotton plants. In places like Lower Rio Grande Valley of Texas, due to sub-tropical climatic conditions, cotton plants can grow year-round and therefore the left-over seeds from the previous season during harvest can continue to grow in the middle of rotation crops like corn (Zea mays L.) and sorghum (Sorghum bicolor L.). These feral or volunteer cotton (VC) plants when reach the pinhead squaring phase (5-6 leaf stage) can act as hosts for the boll weevil pest. The Texas Boll Weevil Eradication Program (TBWEP) employs people to locate and eliminate VC plants growing by the side of roads or fields with rotation crops but the ones growing in the middle of fields remain undetected. In this paper, we demonstrate the application of computer vision (CV) algorithm based on You Only Look Once version 5 (YOLOv5) for detecting VC plants growing in the middle of corn fields at three different growth stages (V3, V6, and VT) using unmanned aircraft systems (UAS) remote sensing imagery. All the four variants of YOLOv5 (s, m, l, and x) were used and their performances were compared based on classification accuracy, mean average precision (mAP), and F1-score. It was found that YOLOv5s could detect VC plants with a maximum classification accuracy of 98% and mAP of 96.3 % at the V6 stage of corn while YOLOv5s and YOLOv5m resulted in the lowest classification accuracy of 85% and YOLOv5m and YOLOv5l had the least mAP of 86.5% at the VT stage on images of size 416 x 416 pixels. The developed CV algorithm has the potential to effectively detect and locate VC plants growing in the middle of corn fields as well as expedite the management aspects of TBWEP.
LGJan 3, 2023
Metalearning generalizable dynamics from trajectoriesQiaofeng Li, Tianyi Wang, Vwani Roychowdhury et al.
We present the interpretable meta neural ordinary differential equation (iMODE) method to rapidly learn generalizable (i.e., not parameter-specific) dynamics from trajectories of multiple dynamical systems that vary in their physical parameters. The iMODE method learns meta-knowledge, the functional variations of the force field of dynamical system instances without knowing the physical parameters, by adopting a bi-level optimization framework: an outer level capturing the common force field form among studied dynamical system instances and an inner level adapting to individual system instances. A priori physical knowledge can be conveniently embedded in the neural network architecture as inductive bias, such as conservative force field and Euclidean symmetry. With the learned meta-knowledge, iMODE can model an unseen system within seconds, and inversely reveal knowledge on the physical parameters of a system, or as a Neural Gauge to "measure" the physical parameters of an unseen system with observed trajectories. We test the validity of the iMODE method on bistable, double pendulum, Van der Pol, Slinky, and reaction-diffusion systems.
CVJul 24, 2023
Towards Generalizable Deepfake Detection by Primary Region RegularizationHarry Cheng, Yangyang Guo, Tianyi Wang et al.
The existing deepfake detection methods have reached a bottleneck in generalizing to unseen forgeries and manipulation approaches. Based on the observation that the deepfake detectors exhibit a preference for overfitting the specific primary regions in input, this paper enhances the generalization capability from a novel regularization perspective. This can be simply achieved by augmenting the images through primary region removal, thereby preventing the detector from over-relying on data bias. Our method consists of two stages, namely the static localization for primary region maps, as well as the dynamic exploitation of primary region masks. The proposed method can be seamlessly integrated into different backbones without affecting their inference efficiency. We conduct extensive experiments over three widely used deepfake datasets - DFDC, DF-1.0, and Celeb-DF with five backbones. Our method demonstrates an average performance improvement of 6% across different backbones and performs competitively with several state-of-the-art baselines.
CLMar 8, 2022
Which side are you on? Insider-Outsider classification in conspiracy-theoretic social mediaPavan Holur, Tianyi Wang, Shadi Shahsavari et al.
Social media is a breeding ground for threat narratives and related conspiracy theories. In these, an outside group threatens the integrity of an inside group, leading to the emergence of sharply defined group identities: Insiders -- agents with whom the authors identify and Outsiders -- agents who threaten the insiders. Inferring the members of these groups constitutes a challenging new NLP task: (i) Information is distributed over many poorly-constructed posts; (ii) Threats and threat agents are highly contextual, with the same post potentially having multiple agents assigned to membership in either group; (iii) An agent's identity is often implicit and transitive; and (iv) Phrases used to imply Outsider status often do not follow common negative sentiment patterns. To address these challenges, we define a novel Insider-Outsider classification task. Because we are not aware of any appropriate existing datasets or attendant models, we introduce a labeled dataset (CT5K) and design a model (NP2IO) to address this task. NP2IO leverages pretrained language modeling to classify Insiders and Outsiders. NP2IO is shown to be robust, generalizing to noun phrases not seen during training, and exceeding the performance of non-trivial baseline models by $20\%$.
CVMay 18Code
EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal PredictionAhmad Yehia, Abduallah Mohamed, Tianyi Wang et al.
Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at https://github.com/yehiahmad/EgoTraj.
CVSep 23, 2024Code
Revisiting Video Quality Assessment from the Perspective of GeneralizationXinli Yue, Jianhui Sun, Liangchao Yao et al.
The increasing popularity of short video platforms such as YouTube Shorts, TikTok, and Kwai has led to a surge in User-Generated Content (UGC), which presents significant challenges for the generalization performance of Video Quality Assessment (VQA) tasks. These challenges not only affect performance on test sets but also impact the ability to generalize across different datasets. While prior research has primarily focused on enhancing feature extractors, sampling methods, and network branches, it has largely overlooked the generalization capabilities of VQA tasks. In this work, we reevaluate the VQA task from a generalization standpoint. We begin by analyzing the weight loss landscape of VQA models, identifying a strong correlation between this landscape and the generalization gaps. We then investigate various techniques to regularize the weight loss landscape. Our results reveal that adversarial weight perturbations can effectively smooth this landscape, significantly improving the generalization performance, with cross-dataset generalization and fine-tuning performance enhanced by up to 1.8% and 3%, respectively. Through extensive experiments across various VQA methods and datasets, we validate the effectiveness of our approach. Furthermore, by leveraging our insights, we achieve state-of-the-art performance in Image Quality Assessment (IQA) tasks. Our code is available at https://github.com/XinliYue/VQA-Generalization.
LGMay 6, 2022
Diverse Imitation Learning via Self-Organizing Generative ModelsArash Vahabpour, Tianyi Wang, Qiujing Lu et al.
Imitation learning is the task of replicating expert policy from demonstrations, without access to a reward function. This task becomes particularly challenging when the expert exhibits a mixture of behaviors. Prior work has introduced latent variables to model variations of the expert policy. However, our experiments show that the existing works do not exhibit appropriate imitation of individual modes. To tackle this problem, we adopt an encoder-free generative model for behavior cloning (BC) to accurately distinguish and imitate different modes. Then, we integrate it with GAIL to make the learning robust towards compounding errors at unseen states. We show that our method significantly outperforms the state of the art across multiple experiments.
AIFeb 11Code
Found-RL: foundation model-enhanced reinforcement learning for autonomous drivingYansong Qu, Zihao Sheng, Zilin Huang et al.
Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP's dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.
CVNov 20, 2022
Deepfake Detection: A Comprehensive Survey from the Reliability PerspectiveTianyi Wang, Xin Liao, Kam Pui Chow et al.
The mushroomed Deepfake synthetic materials circulated on the internet have raised a profound social impact on politicians, celebrities, and individuals worldwide. In this survey, we provide a thorough review of the existing Deepfake detection studies from the reliability perspective. We identify three reliability-oriented research challenges in the current Deepfake detection domain: transferability, interpretability, and robustness. Moreover, while solutions have been frequently addressed regarding the three challenges, the general reliability of a detection model has been barely considered, leading to the lack of reliable evidence in real-life usages and even for prosecutions on Deepfake-related cases in court. We, therefore, introduce a model reliability study metric using statistical random sampling knowledge and the publicly available benchmark datasets to review the reliability of the existing detection models on arbitrary Deepfake candidate suspects. Case studies are further executed to justify the real-life Deepfake cases including different groups of victims with the help of the reliably qualified detection models as reviewed in this survey. Reviews and experiments on the existing approaches provide informative discussions and future research directions for Deepfake detection.
IVJul 15, 2022
Computer Vision for Volunteer Cotton Detection in a Corn Field with UAS Remote Sensing Imagery and Spot Spray ApplicationsPappu Kumar Yadav, J. Alex Thomasson, Stephen W. Searcy et al.
To control boll weevil (Anthonomus grandis L.) pest re-infestation in cotton fields, the current practices of volunteer cotton (VC) (Gossypium hirsutum L.) plant detection in fields of rotation crops like corn (Zea mays L.) and sorghum (Sorghum bicolor L.) involve manual field scouting at the edges of fields. This leads to many VC plants growing in the middle of fields remain undetected that continue to grow side by side along with corn and sorghum. When they reach pinhead squaring stage (5-6 leaves), they can serve as hosts for the boll weevil pests. Therefore, it is required to detect, locate and then precisely spot-spray them with chemicals. In this paper, we present the application of YOLOv5m on radiometrically and gamma-corrected low resolution (1.2 Megapixel) multispectral imagery for detecting and locating VC plants growing in the middle of tasseling (VT) growth stage of cornfield. Our results show that VC plants can be detected with a mean average precision (mAP) of 79% and classification accuracy of 78% on images of size 1207 x 923 pixels at an average inference speed of nearly 47 frames per second (FPS) on NVIDIA Tesla P100 GPU-16GB and 0.4 FPS on NVIDIA Jetson TX2 GPU. We also demonstrate the application of a customized unmanned aircraft systems (UAS) for spot-spray applications based on the developed computer vision (CV) algorithm and how it can be used for near real-time detection and mitigation of VC plants growing in corn fields for efficient management of the boll weevil pests.
CVMay 6
Detecting Deepfakes via Hamiltonian DynamicsHarry Cheng, Ming-Hui Liu, Tianyi Wang et al.
Driven by the rapid development of generative AI models, deepfake detectors are compelled to undergo periodic recalibration to capture newly developed synthetic artifacts. To break this cycle, we propose a new perspective on deepfake detection: moving from static pattern recognition to dynamical stability analysis. Specifically, our approach is motivated by physics-inspired priors: we hypothesize that natural images, as products of dissipative physical processes, tend to settle near stable, low-energy equilibria. In contrast, generative models optimize for statistical similarity to real images but do not explicitly enforce structural constraints such as geometric smoothness, leaving deepfakes more likely to occupy unstable, high-energy states. To operationalize this, we introduce Hamiltonian Action Anomaly Detection (HAAD), comprising three contributions: \textbf{i)} We model the image latent manifold as a potential energy surface. Under this hypothesis, real images are expected to produce basin-like low-energy responses, whereas fake images are more likely to induce high-potential, high-gradient responses. \textbf{ii)} We employ Hamiltonian-inspired dynamics as a stability probe. By releasing latent states from rest, samples near stable regions remain bounded, while high-gradient samples produce larger trajectory responses. \textbf{iii)} We quantify these dynamic behaviors through two trajectory statistics, \ie, Hamiltonian action and energy dissipation. Extensive experiments show that HAAD outperforms evaluated state-of-the-art baselines on challenging cross-dataset transfer benchmarks, supporting a physics-inspired stability prior for digital forensics.
CVMay 12Code
RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq PredictionYaxuan Song, Jianan Fan, Tianyi Wang et al.
Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.
LGJul 22, 2024
Spatial-Temporal Cross-View Contrastive Pre-training for Check-in Sequence Representation LearningLetian Gong, Huaiyu Wan, Shengnan Guo et al.
The rapid growth of location-based services (LBS) has yielded massive amounts of data on human mobility. Effectively extracting meaningful representations for user-generated check-in sequences is pivotal for facilitating various downstream services. However, the user-generated check-in data are simultaneously influenced by the surrounding objective circumstances and the user's subjective intention. Specifically, the temporal uncertainty and spatial diversity exhibited in check-in data make it difficult to capture the macroscopic spatial-temporal patterns of users and to understand the semantics of user mobility activities. Furthermore, the distinct characteristics of the temporal and spatial information in check-in sequences call for an effective fusion method to incorporate these two types of information. In this paper, we propose a novel Spatial-Temporal Cross-view Contrastive Representation (STCCR) framework for check-in sequence representation learning. Specifically, STCCR addresses the above challenges by employing self-supervision from "spatial topic" and "temporal intention" views, facilitating effective fusion of spatial and temporal information at the semantic level. Besides, STCCR leverages contrastive clustering to uncover users' shared spatial topics from diverse mobility activities, while employing angular momentum contrast to mitigate the impact of temporal uncertainty and noise. We extensively evaluate STCCR on three real-world datasets and demonstrate its superior performance across three downstream tasks.
RODec 4, 2025
XR-DT: Extended Reality-Enhanced Digital Twin for Agentic Mobile RobotsTianyi Wang, Jiseop Byeon, Ahmad Yehia et al.
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences, impeding deployment in safety-critical and socially embedded environments. This paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for agentic mobile robots, that bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates virtual-, augmented-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable AR devices. Within this framework, we design an agentic mobile robot system with a unified diffusion policy for context-aware task adaptation. We further propose a chain-of-thought prompting mechanism that allows multimodal large language models to reason over human instructions and environmental context, while leveraging an AutoGen-based multi-agent coordination layer to enhance robustness and collaboration in dynamic tasks. Initial experimental results demonstrate accurate human and robot trajectory prediction, validating the XR-DT framework's effectiveness in HRI tasks. By embedding human intention, environmental dynamics, and robot cognition into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
SYMay 24
DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load EstimationTianyi Wang, Tianyi Zeng, Zimo Zeng et al.
Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety and stability. The performance of ADAS critically relies on accurate and reliable vehicle state estimation, particularly from vehicle dynamic sensors. Among these signals, wheel load is a key variable for chassis control and safety-critical functions, yet it remains difficult to estimate robustly due to complex suspension geometry, nonlinear dynamics, and measurement noise. To address this issue, we propose DBPnet, a Bayesian physics-informed neural network (PINN) with a physics-aware embedding module inspired by damper characteristics. First, this paper presents a suspension linkage-level modeling (SLLM) approach that constructs a nonlinear instantaneous dynamic model by explicitly considering the complex geometric structure of the suspension. Building upon SLLM, Bayesian inference is integrated into the PINN to effectively cope with noise and uncertainty in the vehicle chassis system, thereby improving the model's robustness. Then, a physics-informed loss function is employed to ensure consistency with fundamental physical principles, while the damper characteristics-inspired embedding module extracts temporal variation features of input signals and incorporates them into each layer of the PINN, ensuring that physical observations guide the neural network without being constrained by fixed physical models. Extensive evaluations on high-fidelity simulations and real-world experiments demonstrate that our DBPnet consistently achieves lower RMSE and MaxError than baseline methods. These results highlight the potential of our DBPnet to advance wheel load estimation and contribute to the development of more reliable ADAS actuator functions.
SYMay 24
Solar phased arrays-based wireless power transfer for commercial airlines can reduce energy costs and carbon emissions in the United StatesTianyi Wang, Yiming Xu, Jiseop Byeon et al.
Decarbonizing aviation remains challenging because energy-dense jet fuels dominate beyond short-range operations, while batteries impose severe range and payload penalties. Here we evaluate a new infrastructure pathway in which utility-scale solar farms equipped with solar phased arrays wirelessly beam microwave power to hybrid-electric aircraft during cruise. Integrating 143,152 U.S. flight trajectories, 5,712 solar farms and wireless power transfer models, we quantify the spatial, temporal, and operational potential of this concept at continental scale. We find that benefits are highly concentrated in solar-rich, traffic-dense states and are dominated by short- and medium-range flights, accounting for nearly all delivered energy and cost savings. Schedule optimization and higher cruise altitudes further increase value by improving alignment between aircraft demand and beaming availability. Market penetration analysis reveals non-linear scaling between solar farm and flight adoption. These results show that wireless power beaming is best understood as a corridor-specific strategy complementing other aviation decarbonization pathways.
CVAug 4, 2023
CTP-Net: Character Texture Perception Network for Document Image Forgery LocalizationXin Liao, Siliang Chen, Jiaxin Chen et al.
Due to the progression of information technology in recent years, document images have been widely disseminated on social networks. With the help of powerful image editing tools, document images are easily forged without leaving visible manipulation traces, which leads to severe issues if significant information is falsified for malicious use. Therefore, the research of document image forensics is worth further exploring. In this paper, we propose a Character Texture Perception Network (CTP-Net) to localize the forged regions in document images. Specifically, considering the characters with semantics in a document image are highly vulnerable, capturing the forgery traces is the key to localize the forged regions. We design a Character Texture Stream (CTS) based on optical character recognition to capture features of text areas that are essential components of a document image. Meanwhile, texture features of the whole document image are exploited by an Image Texture Stream (ITS). Combining the features extracted from the CTS and the ITS, the CTP-Net can reveal more subtle forgery traces from document images. Moreover, to overcome the challenge caused by the lack of fake document images, we design a data generation strategy that is utilized to construct a Fake Chinese Trademark dataset (FCTM). Experimental results on different datasets demonstrate that the proposed CTP-Net is able to localize multi-scale forged areas in document images, and outperform the state-of-the-art forgery localization methods, even though post-processing operations are applied.
CVNov 2, 2023
Robust Identity Perceptual Watermark Against Deepfake Face SwappingTianyi Wang, Mengxiao Huang, Harry Cheng et al.
Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue in cross-domain scenarios. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, to fulfill the research gap, we propose a robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We innovatively assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and nonreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are robustly encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. For a suspect image, falsification is accomplished by justifying the consistency between the content-matched identity perceptual watermark and the recovered robust watermark, without requiring the ground-truth. Moreover, source tracing can be accomplished based on the identity semantics that the recovered watermark carries. Extensive experiments demonstrate state-of-the-art detection and source tracing performance against Deepfake face swapping with promising watermark robustness for both cross-dataset and cross-manipulation settings.
LGJan 25, 2023
Recursive deep learning framework for forecasting the decadal world economic outlookTianyi Wang, Rodney Beard, John Hawkins et al.
The gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic output. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as emergence of catastrophic world events including pandemics and wars. Deep learning models are well suited for modelling temporal sequences and time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use the Penn World Table as the data source featuring 13 countries prior to the COVID-19 pandemic, such as Australia, China, India, and the United States. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We test prominent deep learning models and compare their results with traditional econometric models for selected developed and developing countries. Our decadal forecasts reveal that that most of the developed countries would experience economic growth slowdown, stagnation and even recession within five years (2020-2024). Furthermore, our model forecasts show that only China, France, and India would experience stable GDP growth.
IVApr 17, 2024Code
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and ResultsXin Li, Kun Yuan, Yajing Pei et al.
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.
CVSep 15, 2024
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery DetectionYaning Zhang, Tianyi Wang, Zitong Yu et al.
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
MAMay 19
PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent SocietiesAhmad Yehia, Abduallah Mohamed, Kun Qian et al.
Generative agents based on large language models reproduce believable human behavior in cooperative settings, but how they should reason in situations where rule-breaking may be required, such as fire evacuation or authority-supervised emergency, remains poorly characterized. We propose PAVE (Perception, Assessment, Verdict, Emulation), a novel four-module cognitive architecture that addresses this gap end to end: (i) Perception extracts a structured context with explicit authority distance, peer behaviors, and severity-tagged situational cues; (ii) Assessment scores the context along five scalars including an explicit legitimacy judgment that checks necessity, proportionality, and absence of alternatives; (iii) Verdict decides to comply or violate under a hard legitimacy gate, with a per-agent threshold elicited from the persona; (iv) Emulation enacts the verdict and scopes the violation to the rule the trigger justifies. We instantiate PAVE in Voville, a tile-based traffic environment forked from Smallville, and evaluate across three scenarios, four LLM backbones, and a focused ablation. PAVE agents satisfy four properties simultaneously: legitimate violation (only when a trigger justifies it), authority deference (officer instructions override even high legitimacy), bounded scope (violations confined to the targeted rule), and recovery (baseline restored once the trigger ends). PAVE agents make more structured and interpretable decisions than vanilla across all four properties, and human evaluators rate them as more plausible. Ablating the legitimacy gate reproduces vanilla-like failures. We release Voville, the PAVE prompts and code, and the evaluation pipeline.
CVMay 19
Stage-adaptive Token Selection for Efficient Omni-modal LLMsZijie Xin, Jie Yang, Ruixiang Zhao et al.
Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.
CVApr 24
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal EmbeddingsPeixi Wu, Ke Mei, Feipeng Ma et al.
Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
AIMar 3
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language ModelXiangyu Li, Tianyi Wang, Xi Cheng et al.
Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability and robustness in complex traffic environments. This paper presents LLM-MLFFN, a novel large language model (LLM)-enhanced multi-level feature fusion network designed to address the complexities of multi-dimensional driving data. The proposed LLM-MLFFN framework integrates priors from largescale pre-trained models and employs a multi-level approach to enhance classification accuracy. LLM-MLFFN comprises three core components: (1) a multi-level feature extraction module that extracts statistical, behavioral, and dynamic features to capture the quantitative aspects of driving behaviors; (2) a semantic description module that leverages LLMs to transform raw data into high-level semantic features; and (3) a dual-channel multi-level feature fusion network that combines numerical and semantic features using weighted attention mechanisms to improve robustness and prediction accuracy. Evaluation on the Waymo open trajectory dataset demonstrates the superior performance of the proposed LLM-MLFFN, achieving a classification accuracy of over 94%, surpassing existing machine learning models. Ablation studies further validate the critical contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning. These results suggest that integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.
LGJan 29
PILD: Physics-Informed Learning via DiffusionTianyi Zeng, Tianyi Wang, Jiaru Zhang et al.
Diffusion models have emerged as powerful generative tools for modeling complex data distributions, yet their purely data-driven nature limits applicability in practical engineering and scientific problems where physical laws need to be followed. This paper proposes Physics-Informed Learning via Diffusion (PILD), a framework that unifies diffusion modeling and first-principles physical constraints by introducing a virtual residual observation sampled from a Laplace distribution to supervise generation during training. To further integrate physical laws, a conditional embedding module is incorporated to inject physical information into the denoising network at multiple layers, ensuring consistent guidance throughout the diffusion process. The proposed PILD framework is concise, modular, and broadly applicable to problems governed by ordinary differential equations, partial differential equations, as well as algebraic equations or inequality constraints. Extensive experiments across engineering and scientific tasks including estimating vehicle trajectories, tire forces, Darcy flow and plasma dynamics, demonstrate that our PILD substantially improves accuracy, stability, and generalization over existing physics-informed and diffusion-based baselines.
LGMar 17
DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon ReplayLong Li, Zhijian Zhou, Tianyi Wang et al.
While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algorithms like GRPO are sample-inefficient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct policy updates, but this often incurs high computational costs and causes mode collapse via overfitting. We argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy. To this end, we propose Dynamic Jensen-Shannon Replay (DyJR), a simple yet effective regularization framework using a dynamic reference distribution from recent trajectories. DyJR introduces two innovations: (1) A Time-Sensitive Dynamic Buffer that uses FIFO and adaptive sizing to retain only temporally proximal samples, synchronizing with model evolution; and (2) Jensen-Shannon Divergence Regularization, which replaces direct gradient updates with a distributional constraint to prevent diversity collapse. Experiments on mathematical reasoning and Text-to-SQL benchmarks demonstrate that DyJR significantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efficiency comparable to the original GRPO. Furthermore, from the perspective of Rank-$k$ token probability evolution, we show that DyJR enhances diversity and mitigates over-reliance on Rank-1 tokens, elucidating how specific sub-modules of DyJR influence the training dynamics.
CVMay 18
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video UnderstandingRuixiang Zhao, Jie Yang, Zijie Xin et al.
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
CVFeb 3, 2024Code
GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge LearningYaning Zhang, Zitong Yu, Tianyi Wang et al.
The rapid advancement of photorealistic generators has reached a critical juncture where the discrepancy between authentic and manipulated images is increasingly indistinguishable. Thus, benchmarking and advancing techniques detecting digital manipulation become an urgent issue. Although there have been a number of publicly available face forgery datasets, the forgery faces are mostly generated using GAN-based synthesis technology, which does not involve the most recent technologies like diffusion. The diversity and quality of images generated by diffusion models have been significantly improved and thus a much more challenging face forgery dataset shall be used to evaluate SOTA forgery detection literature. In this paper, we propose a large-scale, diverse, and fine-grained high-fidelity dataset, namely GenFace, to facilitate the advancement of deepfake detection, which contains a large number of forgery faces generated by advanced generators such as the diffusion-based model and more detailed labels about the manipulation approaches and adopted generators. In addition to evaluating SOTA approaches on our benchmark, we design an innovative cross appearance-edge learning (CAEL) detector to capture multi-grained appearance and edge global representations, and detect discriminative and general forgery traces. Moreover, we devise an appearance-edge cross-attention (AECA) module to explore the various integrations across two domains. Extensive experiment results and visualizations show that our detection model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations. Code and datasets will be available at \url{https://github.com/Jenine-321/GenFace
CVMar 10
When to Lock Attention: Training-Free KV Control in Video DiffusionTianyi Zeng, Jincheng Gao, Tianyi Wang et al.
Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
CVSep 23, 2024
Advancing Video Quality Assessment for AIGCXinli Yue, Jianhui Sun, Han Kong et al.
In recent years, AI generative models have made remarkable progress across various domains, including text generation, image generation, and video generation. However, assessing the quality of text-to-video generation is still in its infancy, and existing evaluation frameworks fall short when compared to those for natural videos. Current video quality assessment (VQA) methods primarily focus on evaluating the overall quality of natural videos and fail to adequately account for the substantial quality discrepancies between frames in generated videos. To address this issue, we propose a novel loss function that combines mean absolute error with cross-entropy loss to mitigate inter-frame quality inconsistencies. Additionally, we introduce the innovative S2CNet technique to retain critical content, while leveraging adversarial training to enhance the model's generalization capabilities. Experimental results demonstrate that our method outperforms existing VQA techniques on the AIGC Video dataset, surpassing the previous state-of-the-art by 3.1% in terms of PLCC.
IVJun 30, 2022
PVT-COV19D: Pyramid Vision Transformer for COVID-19 DiagnosisLilang Zheng, Jiaxuan Fang, Xiaorun Tang et al.
With the outbreak of COVID-19, a large number of relevant studies have emerged in recent years. We propose an automatic COVID-19 diagnosis framework based on lung CT scan images, the PVT-COV19D. In order to accommodate the different dimensions of the image input, we first classified the images using Transformer models, then sampled the images in the dataset according to normal distribution, and fed the sampling results into the modified PVTv2 model for training. A large number of experiments on the COV19-CT-DB dataset demonstrate the effectiveness of the proposed method.
CYJan 23
Do VLMs Have a Moral Backbone? A Study on the Fragile Morality of Vision-Language ModelsZhining Liu, Tianyi Wang, Xiao Lin et al.
Despite substantial efforts toward improving the moral alignment of Vision-Language Models (VLMs), it remains unclear whether their ethical judgments are stable in realistic settings. This work studies moral robustness in VLMs, defined as the ability to preserve moral judgments under textual and visual perturbations that do not alter the underlying moral context. We systematically probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile, frequently flipping under simple manipulations. Our analysis reveals systematic vulnerabilities across perturbation types, moral domains, and model scales, including a sycophancy trade-off where stronger instruction-following models are more susceptible to persuasion. We further show that lightweight inference-time interventions can partially restore moral stability. These results demonstrate that moral alignment alone is insufficient and that moral robustness is a necessary criterion for the responsible deployment of VLMs.
CLApr 18, 2025Code
LLM Sensitivity Evaluation Framework for Clinical DiagnosisChenwei Yan, Xiangling Fu, Yuxuan Xiong et al.
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
CVJan 27
ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous DrivingYujin Wang, Yutong Zheng, Wenxian Fan et al.
In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
IRMar 18
SF-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Question AnsweringRui Yu, Tianyi Wang, Ruixia Liu et al.
Efficient question-answering (QA) over extensive scientific literature is essential for evidence-based engineering decision-making. Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. However, existing approaches flatten papers into unstructured chunks, destroying the native hierarchical structure and forcing retrieval to operate in a disordered space. This produces fragmented contexts, misallocates tokens to non-evidential regions, and increases the reasoning burden for downstream language models.To address these issues, we propose SF-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior.SF-RAG first inherits the native hierarchy to construct a structure-fidelity index, which prevents entropy increase at the source.It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts.In contrast to existing RAG approaches, SF-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. We further introduce entropy-based structural diagnostics to quantify retrieval fragmentation and evidence allocation accuracy.Evaluations across three QA benchmarks show that SF-RAG significantly reduces retrieval fragmentation and improves evidence allocation. These structural benefits drive superior answer quality, establishing a scalable foundation for intelligent engineering document systems and future applications in technical specifications.
SYDec 4, 2025
ARCAS: An Augmented Reality Collision Avoidance System with SLAM-Based Tracking for Enhancing VRU SafetyAhmad Yehia, Jiseop Byeon, Tianyi Wang et al.
Vulnerable road users (VRUs) face high collision risks in mixed traffic, yet most existing safety systems prioritize driver or vehicle assistance over direct VRU support. This paper presents ARCAS, a real-time augmented reality (AR) collision avoidance system that provides personalized spatial alerts to VRUs via wearable AR headsets. By fusing roadside 360° 3D LiDAR with SLAM-based headset tracking and an automatic 3D calibration procedure, ARCAS accurately overlays world-locked 3D bounding boxes and directional arrows onto approaching hazards in the user's passthrough view. The system also enables multi-headset coordination through shared world anchoring. Evaluated in real-world pedestrian interactions with e-scooters and vehicles (180 trials), ARCAS nearly doubles pedestrians' time to collision and increases counterparts' reaction margins by up to 4x compared to unaided eye conditions. Results validate the feasibility and effectiveness of LiDAR-driven AR guidance and highlight the potential of wearable AR as a promising next generation safety tool for urban mobility.
CVMay 11
MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and LocalizationYaning Zhang, Tianyi Wang, Zan Gao et al.
The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
CVFeb 23
Physics-informed Active Polarimetric 3D Imaging for Specular SurfacesJiazhang Wang, Hyelim Yang, Tianyi Wang et al.
3D imaging of specular surfaces remains challenging in real-world scenarios, such as in-line inspection or hand-held scanning, requiring fast and accurate measurement of complex geometries. Optical metrology techniques such as deflectometry achieve high accuracy but typically rely on multi-shot acquisition, making them unsuitable for dynamic environments. Fourier-based single-shot approaches alleviate this constraint, yet their performance deteriorates when measuring surfaces with high spatial frequency structure or large curvature. Alternatively, polarimetric 3D imaging in computer vision operates in a single-shot fashion and exhibits robustness to geometric complexity. However, its accuracy is fundamentally limited by the orthographic imaging assumption. In this paper, we propose a physics-informed deep learning framework for single-shot 3D imaging of complex specular surfaces. Polarization cues provide orientation priors that assist in interpreting geometric information encoded by structured illumination. These complementary cues are processed through a dual-encoder architecture with mutual feature modulation, allowing the network to resolve their nonlinear coupling and directly infer surface normals. The proposed method achieves accurate and robust normal estimation in single-shot with fast inference, enabling practical 3D imaging of complex specular surfaces.
CVSep 26, 2025Code
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLMChangli Tang, Qinfan Xiao, Ke Mei et al.
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code, checkpoints, and data will be released.
CVMay 27, 2025Code
VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy ImagesMingxuan Sun, Juntao Jiang, Zhiqiang Yang et al.
Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: https://github.com/juntaoJianggavin/Visalgae2023/.
CVMay 25, 2023Code
A Similarity Alignment Model for Video Copy Segment MatchingZhenhua Liu, Feipeng Ma, Tianyi Wang et al.
With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms. Meta AI hold Video Similarity Challenge on CVPR 2023 to push the technology forward. In this report, we share our winner solutions on Matching Track. We propose a Similarity Alignment Model(SAM) for video copy segment matching. Our SAM exhibits superior performance compared to other competitors, with a 0.108 / 0.144 absolute improvement over the second-place competitor in Phase 1 / Phase 2. Code is available at https://github.com/FeipengMa6/VSC22-Submission/tree/main/VSC22-Matching-Track-1st.
CVMay 21, 2023Code
A Dual-level Detection Method for Video Copy DetectionTianyi Wang, Feipeng Ma, Zhenhua Liu et al.
With the development of multimedia technology, Video Copy Detection has been a crucial problem for social media platforms. Meta AI hold Video Similarity Challenge on CVPR 2023 to push the technology forward. In this paper, we share our winner solutions on both tracks to help progress in this area. For Descriptor Track, we propose a dual-level detection method with Video Editing Detection (VED) and Frame Scenes Detection (FSD) to tackle the core challenges on Video Copy Detection. Experimental results demonstrate the effectiveness and efficiency of our proposed method. Code is available at https://github.com/FeipengMa6/VSC22-Submission.
CVAug 26, 2021Code
An Underwater Image Semantic Segmentation Method Focusing on Boundaries and a Real Underwater Scene Semantic Segmentation DatasetZhiwei Ma, Haojie Li, Zhihui Wang et al.
With the development of underwater object grabbing technology, underwater object recognition and segmentation of high accuracy has become a challenge. The existing underwater object detection technology can only give the general position of an object, unable to give more detailed information such as the outline of the object, which seriously affects the grabbing efficiency. To address this problem, we label and establish the first underwater semantic segmentation dataset of real scene(DUT-USEG:DUT Underwater Segmentation Dataset). The DUT- USEG dataset includes 6617 images, 1487 of which have semantic segmentation and instance segmentation annotations, and the remaining 5130 images have object detection box annotations. Based on this dataset, we propose a semi-supervised underwater semantic segmentation network focusing on the boundaries(US-Net: Underwater Segmentation Network). By designing a pseudo label generator and a boundary detection subnetwork, this network realizes the fine learning of boundaries between underwater objects and background, and improves the segmentation effect of boundary areas. Experiments show that the proposed method improves by 6.7% in three categories of holothurian, echinus, starfish in DUT-USEG dataset, and achieves state-of-the-art results. The DUT- USEG dataset will be released at https://github.com/baxiyi/DUT-USEG.
AIFeb 5
Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained RectificationTianyi Wang, Long Li, Hongcan Guo et al.
Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.