LGDec 11, 2025
Classifier Reconstruction Through Counterfactual-Aware Wasserstein PrototypesXuan Zhao, Zhuo Cao, Arya Bangun et al.
Counterfactual explanations provide actionable insights by identifying minimal input changes required to achieve a desired model prediction. Beyond their interpretability benefits, counterfactuals can also be leveraged for model reconstruction, where a surrogate model is trained to replicate the behavior of a target model. In this work, we demonstrate that model reconstruction can be significantly improved by recognizing that counterfactuals, which typically lie close to the decision boundary, can serve as informative though less representative samples for both classes. This is particularly beneficial in settings with limited access to labeled data. We propose a method that integrates original data samples with counterfactuals to approximate class prototypes using the Wasserstein barycenter, thereby preserving the underlying distributional structure of each class. This approach enhances the quality of the surrogate model and mitigates the issue of decision boundary shift, which commonly arises when counterfactuals are naively treated as ordinary training instances. Empirical results across multiple datasets show that our method improves fidelity between the surrogate and target models, validating its effectiveness.
IVDec 7, 2025
Physics-Guided Diffusion Priors for Multi-Slice Reconstruction in Scientific ImagingLaurentius Valdy, Richard D. Paul, Alessio Quercia et al.
Accurate multi-slice reconstruction from limited measurement data is crucial to speed up the acquisition process in medical and scientific imaging. However, it remains challenging due to the ill-posed nature of the problem and the high computational and memory demands. We propose a framework that addresses these challenges by integrating partitioned diffusion priors with physics-based constraints. By doing so, we substantially reduce memory usage per GPU while preserving high reconstruction quality, outperforming both physics-only and full multi-slice reconstruction baselines for different modalities, namely Magnetic Resonance Imaging (MRI) and four-dimensional Scanning Transmission Electron Microscopy (4D-STEM). Additionally, we show that the proposed method improves in-distribution accuracy as well as strong generalization to out-of-distribution datasets.
CVSep 30, 2024
TokenBinder: Text-Video Retrieval with One-to-Many Alignment ParadigmBingqing Zhang, Zhuo Cao, Heming Du et al.
Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.
CVSep 30, 2022
Combining Efficient and Precise Sign Language Recognition: Good pose estimation library is all you needMatyáš Boháček, Zhuo Cao, Marek Hrúz
Sign language recognition could significantly improve the user experience for d/Deaf people with the general consumer technology, such as IoT devices or videoconferencing. However, current sign language recognition architectures are usually computationally heavy and require robust GPU-equipped hardware to run in real-time. Some models aim for lower-end devices (such as smartphones) by minimizing their size and complexity, which leads to worse accuracy. This highly scrutinizes accurate in-the-wild applications. We build upon the SPOTER architecture, which belongs to the latter group of light methods, as it came close to the performance of large models employed for this task. By substituting its original third-party pose estimation module with the MediaPipe library, we achieve an overall state-of-the-art result on the WLASL100 dataset. Significantly, our method beats previous larger architectures while still being twice as computationally efficient and almost $11$ times faster on inference when compared to a relevant benchmark. To demonstrate our method's combined efficiency and precision, we built an online demo that enables users to translate sign lemmas of American sign language in their browsers. This is the first publicly available online application demonstrating this task to the best of our knowledge.
CVDec 18, 2024Code
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal GroundingZhuo Cao, Bingqing Zhang, Heming Du et al.
Text-guided Video Temporal Grounding (VTG) aims to localize relevant segments in untrimmed videos based on textual descriptions, encompassing two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Although previous typical methods have achieved commendable results, it is still challenging to retrieve short video moments. This is primarily due to the reliance on sparse and limited decoder queries, which significantly constrain the accuracy of predictions. Furthermore, suboptimal outcomes often arise because previous methods rank predictions based on isolated predictions, neglecting the broader video context. To tackle these issues, we introduce FlashVTG, a framework featuring a Temporal Feature Layering (TFL) module and an Adaptive Score Refinement (ASR) module. The TFL module replaces the traditional decoder structure to capture nuanced video content variations across multiple temporal scales, while the ASR module improves prediction ranking by integrating context from adjacent moments and multi-temporal-scale features. Extensive experiments demonstrate that FlashVTG achieves state-of-the-art performance on four widely adopted datasets in both MR and HD. Specifically, on the QVHighlights dataset, it boosts mAP by 5.8% for MR and 3.3% for HD. For short-moment retrieval, FlashVTG increases mAP to 125% of previous SOTA performance. All these improvements are made without adding training burdens, underscoring its effectiveness. Our code is available at https://github.com/Zhuo-Cao/FlashVTG.
AIMar 4
AgentSelect: Benchmark for Narrative Query-to-Agent RecommendationYunxiao Shi, Wujiang Xu, Tingwei Chen et al.
LLM agents are rapidly becoming the practical interface for task automation, yet the ecosystem lacks a principled way to choose among an exploding space of deployable configurations. Existing LLM leaderboards and tool/agent benchmarks evaluate components in isolation and remain fragmented across tasks, metrics, and candidate pools, leaving a critical research gap: there is little query-conditioned supervision for learning to recommend end-to-end agent configurations that couple a backbone model with a toolkit. We address this gap with AgentSelect, a benchmark that reframes agent selection as narrative query-to-agent recommendation over capability profiles and systematically converts heterogeneous evaluation artifacts into unified, positive-only interaction data. AgentSelectcomprises 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from 40+ sources, spanning LLM-only, toolkit-only, and compositional agents. Our analyses reveal a regime shift from dense head reuse to long-tail, near one-off supervision, where popularity-based CF/GNN methods become fragile and content-aware capability matching is essential. We further show that Part~III synthesized compositional interactions are learnable, induce capability-sensitive behavior under controlled counterfactual edits, and improve coverage over realistic compositions; models trained on AgentSelect also transfer to a public agent marketplace (MuleRun), yielding consistent gains on an unseen catalog. Overall, AgentSelect provides the first unified data and evaluation infrastructure for agent recommendation, which establishes a reproducible foundation to study and accelerate the emerging agent ecosystem.
IRMay 11
ReCoVR: Closing the Loop in Interactive Composed Video RetrievalBingqing Zhang, Yi Zhang, Zhuo Cao et al.
Composed video retrieval (CoVR) searches for target videos using a reference video and a modification text, but existing methods are restricted to a single interaction round and cannot support the progressive nature of real-world visual search. To bridge this gap, we first formalize interactive composed video retrieval, a multi-turn extension of CoVR, where users progressively refine their search intent through natural-language feedback across turns. Adapting existing interactive retrieval methods to this setting reveals two structural weaknesses: reliance on a single retrieval channel and an open-loop retrieval design that consumes user feedback but does not diagnose whether its own retrieval trajectory is drifting or stagnating. To address these limitations, we propose ReCoVR (Reflexive Composed Video Retrieval), a dual-pathway architecture built on reflexive perception, where the system treats its retrieval history as diagnostic evidence alongside user feedback. Specifically, an Intent Pathway routes heterogeneous feedback to complementary retrieval channels, while a Reflection Pathway performs trajectory-level reflection to monitor result evolution and correct retrieval errors across turns. Experiments on multiple benchmarks show that ReCoVR consistently outperforms interactive baselines, notably achieving 74.30% R@1 after just one interactive round on the WebVid-CoVR-Test dataset.
CVOct 20, 2025Code
When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment InteractionsZhuo Cao, Heming Du, Bingqing Zhang et al.
Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2,212 annotations covering 6,384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3.00% on G-mAP, 2.70% on mAP@3+tgt, and 2.56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https://github.com/Zhuo-Cao/QV-M2.
CVJan 22, 2025Code
Enhancing Monocular Depth Estimation with Multi-Source Auxiliary TasksAlessio Quercia, Erenus Yildiz, Zhuo Cao et al.
Monocular depth estimation (MDE) is a challenging task in computer vision, often hindered by the cost and scarcity of high-quality labeled datasets. We tackle this challenge using auxiliary datasets from related vision tasks for an alternating training scheme with a shared decoder built on top of a pre-trained vision foundation model, while giving a higher weight to MDE. Through extensive experiments we demonstrate the benefits of incorporating various in-domain auxiliary datasets and tasks to improve MDE quality on average by ~11%. Our experimental analysis shows that auxiliary tasks have different impacts, confirming the importance of task selection, highlighting that quality gains are not achieved by merely adding data. Remarkably, our study reveals that using semantic segmentation datasets as Multi-Label Dense Classification (MLDC) often results in additional quality gains. Lastly, our method significantly improves the data efficiency for the considered MDE datasets, enhancing their quality while reducing their size by at least 80%. This paves the way for using auxiliary data from related tasks to improve MDE quality despite limited availability of high-quality labeled data. Code is available at https://jugit.fz-juelich.de/ias-8/mdeaux.
IVDec 25, 2024
MRI Reconstruction with Regularized 3D Diffusion Model (R3DM)Arya Bangun, Zhuo Cao, Alessio Quercia et al.
Magnetic Resonance Imaging (MRI) is a powerful imaging technique widely used for visualizing structures within the human body and in other fields such as plant sciences. However, there is a demand to develop fast 3D-MRI reconstruction algorithms to show the fine structure of objects from under-sampled acquisition data, i.e., k-space data. This emphasizes the need for efficient solutions that can handle limited input while maintaining high-quality imaging. In contrast to previous methods only using 2D, we propose a 3D MRI reconstruction method that leverages a regularized 3D diffusion model combined with optimization method. By incorporating diffusion based priors, our method improves image quality, reduces noise, and enhances the overall fidelity of 3D MRI reconstructions. We conduct comprehensive experiments analysis on clinical and plant science MRI datasets. To evaluate the algorithm effectiveness for under-sampled k-space data, we also demonstrate its reconstruction performance with several undersampling patterns, as well as with in- and out-of-distribution pre-trained data. In experiments, we show that our method improves upon tested competitors.
CVJul 21, 2025
Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty MinimizationBingqing Zhang, Zhuo Cao, Heming Du et al.
Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR's effectiveness, achieving notable gains in Recall@1 (69.2\% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
LGOct 16, 2025
Galaxy Morphology Classification with Counterfactual ExplanationZhuo Cao, Lena Krieger, Hanno Scharr et al.
Galaxy morphologies play an essential role in the study of the evolution of galaxies. The determination of morphologies is laborious for a large amount of data giving rise to machine learning-based approaches. Unfortunately, most of these approaches offer no insight into how the model works and make the results difficult to understand and explain. We here propose to extend a classical encoder-decoder architecture with invertible flow, allowing us to not only obtain a good predictive performance but also provide additional information about the decision process with counterfactual explanations.
CVMar 11, 2025
1LoRA: Summation Compression for Very Low-Rank AdaptationAlessio Quercia, Zhuo Cao, Arya Bangun et al.
Parameter-Efficient Fine-Tuning (PEFT) methods have transformed the approach to fine-tuning large models for downstream tasks by enabling the adjustment of significantly fewer parameters than those in the original model matrices. In this work, we study the "very low rank regime", where we fine-tune the lowest amount of parameters per linear layer for each considered PEFT method. We propose 1LoRA (Summation Low-Rank Adaptation), a compute, parameter and memory efficient fine-tuning method which uses the feature sum as fixed compression and a single trainable vector as decompression. Differently from state-of-the-art PEFT methods like LoRA, VeRA, and the recent MoRA, 1LoRA uses fewer parameters per layer, reducing the memory footprint and the computational cost. We extensively evaluate our method against state-of-the-art PEFT methods on multiple fine-tuning tasks, and show that our method not only outperforms them, but is also more parameter, memory and computationally efficient. Moreover, thanks to its memory efficiency, 1LoRA allows to fine-tune more evenly across layers, instead of focusing on specific ones (e.g. attention layers), improving performance further.
IRFeb 15
Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query ShiftsBingqing Zhang, Zhuo Cao, Heming Du et al.
Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.
LGOct 16, 2025
LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow MatchingZhuo Cao, Xuan Zhao, Lena Krieger et al.
The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model's prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.
AISep 3, 2025
sam-llm: interpretable lane change trajectoryprediction via parametric finetuningZhuo Cao, Yunxiao Shi, Min Xu
This work introduces SAM-LLM, a novel hybrid architecture that bridges the gap between the contextual reasoning of Large Language Models (LLMs) and the physical precision of kinematic lane change models for autonomous driving. The system is designed for interpretable lane change trajectory prediction by finetuning an LLM to output the core physical parameters of a trajectory model instead of raw coordinates. For lane-keeping scenarios, the model predicts discrete coordinates, but for lane change maneuvers, it generates the parameters for an enhanced Sinusoidal Acceleration Model (SAM), including lateral displacement, maneuver duration, initial lateral velocity, and longitudinal velocity change. This parametric approach yields a complete, continuous, and physically plausible trajectory model that is inherently interpretable and computationally efficient, achieving an 80% reduction in output size compared to coordinate-based methods. The SAM-LLM achieves a state-of-the-art overall intention prediction accuracy of 98.73%, demonstrating performance equivalent to traditional LLM predictors while offering significant advantages in explainability and resource efficiency.