58.1CVMay 30Code
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative PreferencesFangzhou Lin, Peiran Li, Lingyu Xu et al.
Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.
44.6CLApr 5Code
AdaptFuse: Training-Free Sequential Preference Learning via Externalized Bayesian InferenceFangzhou Lin, Peiran Li, Shuo Xing et al.
Large language models struggle to accumulate evidence across multiple rounds of user interaction, failing to update their beliefs in a manner consistent with Bayesian inference. Existing solutions require fine-tuning on sensitive user interaction data, limiting their applicability in privacy-conscious settings. We propose AdaptFuse, a training-free framework that externalizes probabilistic computation entirely from the LLM: a symbolic module maintains a Bayesian posterior over a discrete hypothesis set, while a frozen LLM contributes semantic reasoning via multi-sample Dirichlet aggregation. The two signals are combined through entropy-adaptive fusion, which automatically weights each source by its predictive confidence, shifting reliance from the LLM to the symbolic posterior as evidence accumulates. We evaluate across three domains: flight recommendation, hotel recommendation, and web shopping; on Gemma 2 9B, Llama 3 8B, and Qwen 2.5 7B. AdaptFuse consistently outperforms both prompting baselines and fine-tuned Bayesian Teaching models on all tasks, with accuracy improving monotonically over interaction rounds. These results demonstrate that principled inference-time algorithms can substitute for fine-tuning in personalized recommendation, without storing or training on sensitive user data. All the code and materials will be open-sourced.
20.8CVMar 13
NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow NetworksFangzhou Lin, Yuping Wang, Yuliang Guo et al.
Partially Supervised Multi-Task Learning (PS-MTL) aims to leverage knowledge across tasks when annotations are incomplete. Existing approaches, however, have largely focused on the simpler setting of homogeneous, dense prediction tasks, leaving the more realistic challenge of learning from structurally diverse tasks unexplored. To this end, we introduce NexusFlow, a novel, lightweight, and plug-and-play framework effective in both settings. NexusFlow introduces a set of surrogate networks with invertible coupling layers to align the latent feature distributions of tasks, creating a unified representation that enables effective knowledge transfer. The coupling layers are bijective, preserving information while mapping features into a shared canonical space. This invertibility avoids representational collapse and enables alignment across structurally different tasks without reducing expressive capacity. We first evaluate NexusFlow on the core challenge of domain-partitioned autonomous driving, where dense map reconstruction and sparse multi-object tracking are supervised in different geographic regions, creating both structural disparity and a strong domain gap. NexusFlow sets a new state-of-the-art result on nuScenes, outperforming strong partially supervised baselines. To demonstrate generality, we further test NexusFlow on NYUv2 using three homogeneous dense prediction tasks, segmentation, depth, and surface normals, as a representative N-task PS-MTL scenario. NexusFlow yields consistent gains across all tasks, confirming its broad applicability.
40.2AIMay 21
PathCal: State-Aware Reflection-Marker Calibration for Efficient ReasoningLingyu Jiang, Zirui Li, Shuo Xing et al.
The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as ``wait'', ``but'', and ``alternatively'', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency--performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.
36.5AIMay 15
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel ReasoningFangzhou Lin, Shuo Xing, Peiran Li et al.
Parallel reasoning, where a generator samples many candidate solutions and an aggregator selects the best, is one of the most effective forms of test-time scaling in large language models, and pairwise self-verification has become its strongest aggregation primitive. Yet pairwise verification carries a heavy cost: each judgment reads two complete solutions in full, and existing methods perform tens of such judgments per problem regardless of whether the comparison is informative. We introduce CAPS (Cascaded Adaptive Pairwise Selection), an inference-only framework that allocates verifier compute non-uniformly along two orthogonal axes: an evidence axis that adapts how much of each candidate the judge sees, and a distribution axis that adapts how comparisons are spread across the pool. CAPS instantiates these into a four-stage cascade with an optional rescue subroutine, and admits a closed-form verifier-token cost in which the per-candidate marginal cost is roughly halved relative to uniform full-evidence schedules. On four self-verifying models (Qwen3-14B, GPT-OSS-20B, Qwen3-4B-Instruct/Thinking) and five reasoning benchmarks spanning code (LiveCodeBench-v5/v6, CodeContests) and math (AIME 2025, HMMT 2025), CAPS outperforms the leading pairwise verifier on 14 of 20 suites while using 25.4% of its verifier-token budget on code, and outperforms pointwise self-verification on all 20. The trade-off suites admit an interpretable diagnostic in terms of the verifier's accuracy at partial versus full evidence, providing a concrete pre-deployment check for cascade suitability.
32.6CVApr 7
Physics-Aware Video Instance Removal BenchmarkZirui Li, Xinghao Chen, Lingyu Jiang et al.
Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.
LGNov 23, 2025
TimePre: Bridging Accuracy, Efficiency, and Stability in Probabilistic Time-Series ForecastingLingyu Jiang, Lingyu Xu, Peiran Li et al.
Probabilistic Time-Series Forecasting (PTSF) is critical for uncertainty-aware decision making, but existing generative models, such as diffusion-based approaches, are computationally prohibitive due to expensive iterative sampling. Non-sampling frameworks like Multiple Choice Learning (MCL) offer an efficient alternative, but suffer from severe training instability and hypothesis collapse, which has historically hindered their performance. This problem is dramatically exacerbated when attempting to combine them with modern, efficient MLP-based backbones. To resolve this fundamental incompatibility, we propose TimePre, a novel framework that successfully unifies the efficiency of MLP-based models with the distributional flexibility of the MCL paradigm. The core of our solution is Stabilized Instance Normalization (SIN), a novel normalization layer that explicitly remedies this incompatibility. SIN stabilizes the hybrid architecture by correcting channel-wise statistical shifts, definitively resolving the catastrophic hypothesis collapse. Extensive experiments on six benchmark datasets demonstrate that TimePre achieves new state-of-the-art accuracy on key probabilistic metrics. Critically, TimePre achieves inference speeds orders of magnitude faster than sampling-based models and, unlike prior MCL work, demonstrates stable performance scaling. It thus bridges the long-standing gap between accuracy, efficiency, and stability in probabilistic forecasting.