CVNov 29, 2023Code
Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?Shuo Chen, Zhen Han, Bailan He et al.
Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently, Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability, i.e., responding to queries given a few multimodal demos, including images, queries, and answers. While ICL has been extensively studied on LLMs, its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL, or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content, i.e., understanding the influences of demo content in different modalities. 2) Demo selection strategy, i.e., how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly, visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis, we propose a simple yet effective approach, termed Mixed Modality In-Context Example Selection (MMICES), which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method. Code is available at \url{https://chenxshuo.github.io/m-icl/}.
CLSep 28, 2024
Visual Question Decomposition on Multimodal Large Language ModelsHaowei Zhang, Jianzhe Liu, Zhen Han et al. · deepmind, oxford
Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.
CVJul 21, 2025
True Multimodal In-Context Learning Needs Attention to the Visual ContextShuo Chen, Jianzhe Liu, Zhen Han et al. · deepmind, oxford
Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .
15.4SYApr 8
Decision-focused Conservation Voltage Reduction to Consider the Cascading Impact of Forecast ErrorsQintao Du, Ran Li, Weiyi Lv et al.
Conservation Voltage Reduction (CVR) relies on the effective coordination of slow-acting devices, such as OLTCs and CBs, and fast-acting devices, such as SVGs and PV inverters, typically implemented through a hierarchical multi-stage Volt-Var Control (VVC) spanning day-ahead scheduling, intra-day dispatch, and real-time control. However, existing sequential methods fail to account for the cas-cading impact of forecast errors on multi-stage decision-making. This oversight results in suboptimal day-ahead schedules for OLTCs and CBs that hinder the ef-fective coordination with fast-acting SVGs and inverters, inevitably driving a trade-off between real-time voltage security and CVR efficiency. To improve the Pareto front of this trade-off, this paper proposes a novel bi-level multi-timescale forecasting (Bi-MTF) framework for multi-stage VVC optimization. By integrating the downstream multi-stage VVC optimization into the upstream forecasting mod-els training, the decision-focused forecasting models are able to learn the trade-offs across temporal horizons. To solve the computationally challenging bi-level for-mulation, a modified sensitivity-driven integer L-shaped method is developed. It utilizes a hybrid gradient feedback mechanism that integrates numerical sensitivity analysis for discrete variables with analytical dual information for continuous fore-cast parameters to ensure tractability. Numerical results on a modified IEEE 33-bus system demonstrate that the proposed approach yields superior energy savings and operational safety compared to conventional MSE-based sequential paradigms. Specifically, as the capacity of fast-acting devices increases, the energy savings of the proposed method rise from 2.74% to 3.41%, which is far superior to the 1.50% to 1.76% achieved by conventional MSE-based sequential paradigms.
SYFeb 18, 2021
Encoding Frequency Constraints in Preventive Unit Commitment Using Deep Learning with Region-of-Interest Active SamplingYichen Zhang, Hantao Cui, Jianzhe Liu et al.
With the increasing penetration of renewable energy, frequency response and its security are of significant concerns for reliable power system operations. Frequency-constrained unit commitment (FCUC) is proposed to address this challenge. Despite existing efforts in modeling frequency characteristics in unit commitment (UC), current strategies can only handle oversimplified low-order frequency response models and do not consider wide-range operating conditions. This paper presents a generic data-driven framework for FCUC under high renewable penetration. Deep neural networks (DNNs) are trained to predict the frequency response using real data or high-fidelity simulation data. Next, the DNN is reformulated as a set of mixed-integer linear constraints to be incorporated into the ordinary UC formulation. In the data generation phase, all possible power injections are considered, and a region-of-interests active sampling is proposed to include power injection samples with frequency nadirs closer to the UFLC threshold, which significantly enhances the accuracy of frequency constraints in FCUC. The proposed FCUC is verified on the the IEEE 39-bus system. Then, a full-order dynamic model simulation using PSS/E verifies the effectiveness of FCUC in frequency-secure generator commitments.
LGJul 27, 2020
Deep Active Learning for Solvability Prediction in Power SystemsYichen Zhang, Jianzhe Liu, Feng Qiu et al.
Traditional methods for solvability region analysis can only have inner approximations with inconclusive conservatism. Machine learning methods have been proposed to approach the real region. In this letter, we propose a deep active learning framework for power system solvability prediction. Compared with the passive learning methods where the training is performed after all instances are labeled, the active learning selects most informative instances to be label and therefore significantly reduce the size of labeled dataset for training. In the active learning framework, the acquisition functions, which correspond to different sampling strategies, are defined in terms of the on-the-fly posterior probability from the classifier. The IEEE 39-bus system is employed to validate the proposed framework, where a two-dimensional case is illustrated to visualize the effectiveness of the sampling method followed by the full-dimensional numerical experiments.