80.6CVApr 20Code
Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language ModelsZiyao Tang, Pengkun Jiao, Bin Zhu et al.
Video Large Language Models (Vid-LLMs) have demonstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting. Rather than merely changing their answers, the models often fabricate unsupported temporal or spatial explanations to justify incorrect revisions. To systematically investigate this phenomenon, we propose a negation-based gaslighting evaluation framework and introduce GasVideo-1000, a curated benchmark designed to probe spatiotemporal sycophancy with clear visual grounding and temporal reasoning requirements. We evaluate a broad range of state-of-the-art open-source and proprietary Vid-LLMs across diverse video understanding tasks. Extensive experiments reveal that vulnerability to negation-based gaslighting is pervasive and severe, even among models with strong baseline performance. While prompt-level grounding constraints can partially mitigate this behavior, they do not reliably prevent hallucinated justifications or belief reversal. Our results indicate that current Vid-LLMs lack robust mechanisms for maintaining grounded spatiotemporal beliefs under adversarial conversational feedback.
CVSep 1, 2022
Combating Noisy Labels in Long-Tailed Image ClassificationChaowei Fang, Lechao Cheng, Huiyan Qi et al.
Most existing methods that cope with noisy labels usually assume that the class distributions are well balanced, which has insufficient capacity to deal with the practical scenarios where training samples have imbalanced distributions. To this end, this paper makes an early effort to tackle the image classification task with both long-tailed distribution and label noise. Existing noise-robust learning methods cannot work in this scenario as it is challenging to differentiate noisy samples from clean samples of tail classes. To deal with this problem, we propose a new learning paradigm based on matching between inferences on weak and strong data augmentations to screen out noisy samples and introduce a leave-noise-out regularization to eliminate the effect of the recognized noisy samples. Furthermore, we incorporate a novel prediction penalty based on online prior distribution to avoid bias towards head classes. This mechanism has superiority in capturing the class fitting degree in realtime compared to the existing long-tail classification methods. Exhaustive experiments demonstrate that the proposed method outperforms state-of-the-art algorithms that address the distribution imbalance problem in long-tailed classification under noisy labels.
CVNov 29, 2022
Transferability Estimation Based On Principal Gradient ExpectationHuiyan Qi, Lechao Cheng, Jingjing Chen et al.
Transfer learning aims to improve the performance of target tasks by transferring knowledge acquired in source tasks. The standard approach is pre-training followed by fine-tuning or linear probing. Especially, selecting a proper source domain for a specific target domain under predefined tasks is crucial for improving efficiency and effectiveness. It is conventional to solve this problem via estimating transferability. However, existing methods can not reach a trade-off between performance and cost. To comprehensively evaluate estimation methods, we summarize three properties: stability, reliability and efficiency. Building upon them, we propose Principal Gradient Expectation(PGE), a simple yet effective method for assessing transferability. Specifically, we calculate the gradient over each weight unit multiple times with a restart scheme, and then we compute the expectation of all gradients. Finally, the transferability between the source and target is estimated by computing the gap of normalized principal gradients. Extensive experiments show that the proposed metric is superior to state-of-the-art methods on all properties.
CVDec 22, 2023Code
FoodLMM: A Versatile Food Assistant using Large Multi-modal ModelYuehao Yin, Huiyan Qi, Bin Zhu et al.
Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.
CLJan 31, 2025Code
Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language ModelsBin Zhu, Yinxuan Gui, Huiyan Qi et al.
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with generated negation prompts across 20 diverse categories. Throughout extensive evaluation, we find that proprietary models such as Gemini-1.5-flash and GPT-4o demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA, though even advanced reasoning-oriented models like Gemini-2.5-Pro remain susceptible. Our category-level analysis further shows that subjective or socially nuanced domains (e.g., Social Relation, Image Emotion) are especially fragile, while more objective domains (e.g., Geography) exhibit relatively smaller but still notable drops. Overall, all evaluated MLLMs struggle to maintain logical consistency under gaslighting negation attack. These findings highlight a fundamental robustness gap and provide insights for developing more reliable and trustworthy multimodal AI systems. Project website: https://yxg1005.github.io/GaslightingNegationAttacks/.
CVMay 13, 2025
Advancing Food Nutrition Estimation via Visual-Ingredient Feature FusionHuiyan Qi, Bin Zhu, Chong-Wah Ngo et al.
Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.