CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing
This provides mechanistic insights into spatiotemporal semantics analysis of LVLMs, which is incremental but foundational for designing more robust and interpretable models.
The paper tackled the problem of understanding how large vision-language models (LVLMs) process spatiotemporal visual semantics, revealing that visual semantics are highly localized to specific object tokens, with removal degrading performance by up to 92.6%, and that interpretable concepts emerge in middle-to-late layers.
The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens--removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.