CV AI LGNov 15, 2024

ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation

Hesam Hosseini, Ghazal Hosseini Mighan, Amirabbas Afzali, Sajjad Amini, Amir Houmansadr

arXiv:2411.12589v23.71 citationsh-index: 3Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses the interpretability challenge in Transformer-based models for computer vision and natural language processing, offering a tool for explaining semantic structures without fine-tuning, though it is incremental in building on existing pre-trained models.

The paper tackles the problem of interpreting latent token representations in Transformers, which are complex and difficult to understand, by introducing ULTra, a framework that enables unsupervised semantic segmentation and achieves state-of-the-art performance in this task.

Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.

View on arXiv PDF

Similar