CVAIDec 26, 2025

LVLM-Aided Alignment of Task-Specific Vision Models

arXiv:2512.21985v1h-index: 2
Originality Incremental advance
AI Analysis

This addresses alignment issues in high-stakes domains where small vision models are used, offering an efficient solution without fine-grained feedback, though it is incremental as it builds on existing LVLM capabilities.

The paper tackled the problem of small task-specific vision models relying on spurious correlations instead of human domain knowledge, which can lead to brittle real-world behavior, and introduced LVLM-VA, a method that uses a Large Vision Language Model to align these models with human specifications, showing substantial improvements in reducing dependence on spurious features and biases.

In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes