Yuxin Lin

CV
h-index42
10papers
22citations
Novelty59%
AI Score58

10 Papers

68.7CVMay 26Code
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

Huidong Feng, Wentao Chen, Jie Chen et al.

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

CVFeb 25
Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading

Haoran Li, Yuxin Lin, Huan Wang et al.

Diabetic retinopathy (DR) is one of the leading causes of vision loss worldwide, making early and accurate DR grading critical for timely intervention. Recent clinical practices leverage multi-view fundus images for DR detection with a wide coverage of the field of view (FOV), motivating deep learning methods to explore the potential of multi-view learning for DR grading. However, existing methods often overlook the inter-view correlations when fusing multi-view fundus images, failing to fully exploit the inherent consistency across views originating from the same patient. In this work, we present MVGFDR, an end-to-end Multi-View Graph Fusion framework for DR grading. Different from existing methods that directly fuse visual features from multiple views, MVGFDR is equipped with a novel Multi-View Graph Fusion (MVGF) module to explicitly disentangle the shared and view-specific visual features. Specifically, MVGF comprises three key components: (1) Multi-view Graph Initialization, which constructs visual graphs via residual-guided connections and employs Discrete Cosine Transform (DCT) coefficients as frequency-domain anchors; (2) Multi-view Graph Fusion, which integrates selective nodes across multi-view graphs based on frequency-domain relevance to capture complementary view-specific information; and (3) Masked Cross-view Reconstruction, which leverages masked reconstruction of shared information across views to facilitate view-invariant representation learning. Extensive experimental results on MFIDDR, by far the largest multi-view fundus image dataset, demonstrate the superiority of our proposed approach over existing state-of-the-art approaches in diabetic retinopathy grading.

AIDec 1, 2025
RoboDriveVLM: A Novel Benchmark and Baseline towards Robust Vision-Language Models for Autonomous Driving

Dacheng Liao, Mengshi Qi, Peng Shu et al.

Current Vision-Language Model (VLM)-based end-to-end autonomous driving systems often leverage large language models to generate driving decisions directly based on their understanding of the current scene. However, such systems introduce multiple risks in real-world driving scenarios. To evaluate whether VLMs are truly viable for autonomous driving, we introduce RoboDriveBench, the first robustness benchmark focused on end-to-end trajectory prediction tasks. This benchmark systematically evaluates two critical categories of real-world challenges for VLM-based end-to-end autonomous driving systems through 11 simulated scenarios encompassing various corruption types, including 6 scenarios of sensor corruption caused by environmental variations, along with 5 cases of prompt corruption resulting from human intervention and data transmission failures. Each corruption type includes 250 unique driving scenarios and 5,689 frames, resulting in 64,559 total trajectory prediction cases per evaluation. To overcome these real-world challenges, we propose a novel VLM-based autonomous driving framework called RoboDriveVLM, which enhances robustness by mapping more multimodal data-e.g., lidar and radar-into a unified latent space. Furthermore, we introduce a new Test-Time Adaptation (TTA) method based on cross-modal knowledge distillation to improve the robustness of VLM-based autonomous driving systems. Through extensive experiments, our work highlights the limitations of current VLM-based end-to-end autonomous driving systems and provides a more reliable solution for real-world deployment. Source code and datasets will be released.

IVMar 25, 2025Code
Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection

Yongting Hu, Yuxin Lin, Chengliang Liu et al.

Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on https://github.com/HuYongting/WGLIN.

LGNov 10, 2024
CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction

Shuqi Li, Yuebo Sun, Yuxin Lin et al.

There are two issues in news-driven multi-stock movement prediction tasks that are not well solved in the existing works. On the one hand, "relation discovery" is a pivotal part when leveraging the price information of other stocks to achieve accurate stock movement prediction. Given that stock relations are often unidirectional, such as the "supplier-consumer" relationship, causal relations are more appropriate to capture the impact between stocks. On the other hand, there is substantial noise existing in the news data leading to extracting effective information with difficulty. With these two issues in mind, we propose a novel framework called CausalStock for news-driven multi-stock movement prediction, which discovers the temporal causal relations between stocks. We design a lag-dependent temporal causal discovery mechanism to model the temporal causal graph distribution. Then a Functional Causal Model is employed to encapsulate the discovered causal relations and predict the stock movements. Additionally, we propose a Denoised News Encoder by taking advantage of the excellent text evaluation ability of large language models (LLMs) to extract useful information from massive news data. The experiment results show that CausalStock outperforms the strong baselines for both news-driven multi-stock movement prediction and multi-stock movement prediction tasks on six real-world datasets collected from the US, China, Japan, and UK markets. Moreover, getting benefit from the causal relations, CausalStock could offer a clear prediction mechanism with good explainability.

CVFeb 2, 2025
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving

Yuxin Lin, Mengshi Qi, Liang Liu et al.

In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a "teacher" to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 21.40% to 32.28% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.

ROMar 8
Uncertainty Mitigation and Intent Inference: A Dual-Mode Human-Machine Joint Planning System

Zeyu Fang, Yuxin Lin, Cheng Liu et al.

Effective human-robot collaboration in open-world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human-like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human-robot joint planning system designed to tackle dual sources of uncertainty: task-relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty-mitigation joint planning module enables two-way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM-assisted active elicitation mechanism and a hypothesis-augmented A^* search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real-time intent-aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination-aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real-world UAV deployments integrated with a Vision-Language Model (VLM)-based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty-mitigation planning and reduces the task execution time by 25.4% in intent-aware cooperation compared to the baselines.

CLMay 19, 2025
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals

Yuxin Lin, Yinglin Zheng, Ming Zeng et al.

This paper addresses the gap in predicting turn-taking and backchannel actions in human-machine conversations using multi-modal signals (linguistic, acoustic, and visual). To overcome the limitation of existing datasets, we propose an automatic data collection pipeline that allows us to collect and annotate over 210 hours of human conversation videos. From this, we construct a Multi-Modal Face-to-Face (MM-F2F) human conversation dataset, including over 1.5M words and corresponding turn-taking and backchannel annotations from approximately 20M frames. Additionally, we present an end-to-end framework that predicts the probability of turn-taking and backchannel actions from multi-modal signals. The proposed model emphasizes the interrelation between modalities and supports any combination of text, audio, and video inputs, making it adaptable to a variety of realistic scenarios. Our experiments show that our approach achieves state-of-the-art performance on turn-taking and backchannel prediction tasks, achieving a 10% increase in F1-score on turn-taking and a 33% increase on backchannel prediction. Our dataset and code are publicly available online to ease of subsequent research.

ROMar 8
Reasoning Knowledge-Gap in Drone Planning via LLM-based Active Elicitation

Zeyu Fang, Beomyeol Yu, Cheng Liu et al.

Human-AI joint planning in Unmanned Aerial Vehicles (UAVs) typically relies on control handover when facing environmental uncertainties, which is often inefficient and cognitively demanding for non-expert operators. To address this, we propose a novel framework that shifts the collaboration paradigm from control takeover to active information elicitation. We introduce the Minimal Information Neuro-Symbolic Tree (MINT), a reasoning mechanism that explicitly structures knowledge gaps regarding obstacles and goals into a queryable format. By leveraging large language models, our system formulates optimal binary queries to resolve specific ambiguities with minimal human interaction. We demonstrate the efficacy of this approach through a comprehensive workflow integrating a vision-language model for perception, voice interfaces, and a low-level UAV control module in both high-fidelity NVIDIA Isaac simulations and real-world deployments. Experimental results show that our method achieves a significant improvement in the success rate for complex search-and-rescue tasks while significantly reducing the frequency of human interaction compared to exhaustive querying baselines.

CVJul 17, 2025
HairShifter: Consistent and High-Fidelity Video Hair Transfer via Anchor-Guided Animation

Wangzheng Shi, Yinglin Zheng, Yuxin Lin et al.

Hair transfer is increasingly valuable across domains such as social media, gaming, advertising, and entertainment. While significant progress has been made in single-image hair transfer, video-based hair transfer remains challenging due to the need for temporal consistency, spatial fidelity, and dynamic adaptability. In this work, we propose HairShifter, a novel "Anchor Frame + Animation" framework that unifies high-quality image hair transfer with smooth and coherent video animation. At its core, HairShifter integrates a Image Hair Transfer (IHT) module for precise per-frame transformation and a Multi-Scale Gated SPADE Decoder to ensure seamless spatial blending and temporal coherence. Our method maintains hairstyle fidelity across frames while preserving non-hair regions. Extensive experiments demonstrate that HairShifter achieves state-of-the-art performance in video hairstyle transfer, combining superior visual quality, temporal consistency, and scalability. The code will be publicly available. We believe this work will open new avenues for video-based hairstyle transfer and establish a robust baseline in this field.