40.6CVJun 2Code
SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual MisinformationJunxiao Yang, Minghao Zhang, Xiaoce Wang et al.
Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.
LGJan 30Code
ToolTok: Tool Tokenization for Efficient and Generalizable GUI AgentsXiaoce Wang, Guibin Zhang, Junzhe Li et al.
Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.
69.4CVMay 7
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent SpaceXiaoce Wang, Sifan Zhou, Kaifei Wang et al.
Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.
AIMay 18, 2025
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMsJunxiao Yang, Jinzhe Tu, Haoran Liu et al. · tsinghua
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning. However, current LRMs rarely admit ignorance or respond with "I don't know". Instead, they often produce incorrect answers while showing undue confidence, raising concerns about their factual reliability. In this work, we identify two pathological reasoning patterns characterized by overthinking that contribute to the overconfident and incorrect answers: last-minute guessing and second-thought spiraling. To address these issues, we propose BARREL-a novel framework that promotes concise and boundary-aware factual reasoning. Our experiments show that BARREL-training increases the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48%, while still achieving accuracy comparable to models finetuned on reasoning data generated by R1. These results demonstrate that our pilot study is inspiring to build more reliable and factual System 2 LRMs.
CVAug 6, 2025
StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative FeedbackHongbo Ma, Fei Shen, Hongbin Xu et al.
The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality. To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor's superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.