MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
This addresses the need for transparency and trust in VLMs, which are crucial for applications in AI safety and explainability, though it appears incremental as it builds on existing model inversion techniques.
The paper tackles the problem of interpreting complex Vision Language Models (VLMs) by proposing the MIMIC framework to visualize internal representations through synthesized visual concepts, achieving results evaluated with standard visual quality and semantic text-based metrics.
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.