90.3CVApr 13Code
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the WildAleksandr Gushchin, Khaled Abud, Ekaterina Shumitskaya et al.
This paper presents an overview of the NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild, held in conjunction with the NTIRE workshop at CVPR 2026. The goal of this challenge was to develop detection models capable of distinguishing real images from generated ones in realistic scenarios: the images are often transformed (cropped, resized, compressed, blurred) for practical usage, and therefore, the detection models should be robust to such transformations. The challenge is based on a novel dataset consisting of 108,750 real and 185,750 AI-generated images from 42 generators comprising a large variety of open-source and closed-source models of various architectures, augmented with 36 image transformations. Methods were evaluated using ROC AUC on the full test set, including both transformed and untransformed images. A total of 511 participants registered, with 20 teams submitting valid final solutions. This report provides a comprehensive overview of the challenge, describes the proposed solutions, and can be used as a valuable reference for researchers and practitioners in increasing the robustness of the detection models to real-world transformations.
CLNov 9, 2023Code
Mirror: A Universal Framework for Various Information Extraction TasksTong Zhu, Junfei Ren, Zijian Yu et al.
Sharing knowledge between information extraction tasks has always been a challenge due to the diverse data formats and task variations. Meanwhile, this divergence leads to information waste and increases difficulties in building complex applications in real scenarios. Recent studies often formulate IE tasks as a triplet extraction problem. However, such a paradigm does not support multi-span and n-ary extraction, leading to weak versatility. To this end, we reorganize IE problems into unified multi-slot tuples and propose a universal framework for various IE tasks, namely Mirror. Specifically, we recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm to extract all spans in a single step. It is worth noting that this graph structure is incredibly versatile, and it supports not only complex IE tasks, but also machine reading comprehension and classification tasks. We manually construct a corpus containing 57 datasets for model pretraining, and conduct experiments on 30 datasets across 8 downstream tasks. The experimental results demonstrate that our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems under few-shot and zero-shot settings. The code, model weights, and pretraining corpus are available at https://github.com/Spico197/Mirror .
CLApr 28, 2023Code
CED: Catalog Extraction from DocumentsTong Zhu, Guoliang Zhang, Zechang Li et al.
Sentence-by-sentence information extraction from long documents is an exhausting and error-prone task. As the indicator of document skeleton, catalogs naturally chunk documents into segments and provide informative cascade semantics, which can help to reduce the search space. Despite their usefulness, catalogs are hard to be extracted without the assist from external knowledge. For documents that adhere to a specific template, regular expressions are practical to extract catalogs. However, handcrafted heuristics are not applicable when processing documents from different sources with diverse formats. To address this problem, we build a large manually annotated corpus, which is the first dataset for the Catalog Extraction from Documents (CED) task. Based on this corpus, we propose a transition-based framework for parsing documents into catalog trees. The experimental results demonstrate that our proposed method outperforms baseline systems and shows a good ability to transfer. We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents. Data and code are available at \url{https://github.com/Spico197/CatalogExtraction}
CVFeb 9
VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement LearningHao Tan, Jun Lan, Senyuan Shi et al.
The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
CLJul 12, 2025Code
RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-CheckingShuo Yang, Zijian Yu, Zhenzhe Ying et al.
The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.
CVDec 30, 2024Code
UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language ModelsYujie Li, Wenjia Xu, Guangzuo Li et al.
The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.
CVJun 11, 2024Code
RS-Agent: Automating Remote Sensing Tasks through Intelligent AgentWenjia Xu, Zijian Yu, Boyang Mu et al.
The unprecedented advancements in Multimodal Large Language Models (MLLMs) have demonstrated strong potential in interacting with humans through both language and visual inputs to perform downstream tasks such as visual question answering and scene understanding. However, these models are constrained to basic instruction-following or descriptive tasks, facing challenges in complex real-world remote sensing applications that require specialized tools and knowledge. To address these limitations, we propose RS-Agent, an AI agent designed to interact with human users and autonomously leverage specialized models to address the demands of real-world remote sensing applications. RS-Agent integrates four key components: a Central Controller based on large language models, a dynamic toolkit for tool execution, a Solution Space for task-specific expert guidance, and a Knowledge Space for domain-level reasoning, enabling it to interpret user queries and orchestrate tools for accurate remote sensing task. We introduce two novel mechanisms: Task-Aware Retrieval, which improves tool selection accuracy through expert-guided planning, and DualRAG, a retrieval-augmented generation method that enhances knowledge relevance through weighted, dual-path retrieval. RS-Agent supports flexible integration of new tools and is compatible with both open-source and proprietary LLMs. Extensive experiments across 9 datasets and 18 remote sensing tasks demonstrate that RS-Agent significantly outperforms state-of-the-art MLLMs, achieving over 95% task planning accuracy and delivering superior performance in tasks such as scene classification, object counting, and remote sensing visual question answering. Our work presents RS-Agent as a robust and extensible framework for advancing intelligent automation in remote sensing analysis.
45.3CVMay 9
Few-Click-Driven Interactive 3D Segmentation with Semantic EmbeddingXueyang Kang, Zijian Yu, Kourosh Khoshelham et al.
Interactive segmentation allows efficient label generation by leveraging user-provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D-3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer-based encoder and a hierarchical mask decoder, which integrates multi-level crop-and-merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter-instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8-10 percent gains under cross-dataset evaluation for a one-click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real-time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.
21.9CLMar 16
Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce TasksZijian Yu, Kejun Xiao, Huaipeng Zhao et al.
In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
CVMay 6, 2021
Cascade Image Matting with Deformable Graph RefinementZijian Yu, Xuhui Li, Huijuan Huang et al.
Image matting refers to the estimation of the opacity of foreground objects. It requires correct contours and fine details of foreground objects for the matting results. To better accomplish human image matting tasks, we propose the Cascade Image Matting Network with Deformable Graph Refinement, which can automatically predict precise alpha mattes from single human images without any additional inputs. We adopt a network cascade architecture to perform matting from low-to-high resolution, which corresponds to coarse-to-fine optimization. We also introduce the Deformable Graph Refinement (DGR) module based on graph neural networks (GNNs) to overcome the limitations of convolutional neural networks (CNNs). The DGR module can effectively capture long-range relations and obtain more global and local information to help produce finer alpha mattes. We also reduce the computation complexity of the DGR module by dynamically predicting the neighbors and apply DGR module to higher--resolution features. Experimental results demonstrate the ability of our CasDGR to achieve state-of-the-art performance on synthetic datasets and produce good results on real human images.