Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
This work addresses efficiency and accuracy challenges in open-vocabulary semantic segmentation, which is important for real-world applications like autonomous driving, though it appears incremental by building on existing vision-language models.
The paper tackles the problem of balancing accuracy and efficiency in open-vocabulary semantic segmentation by introducing ERR-Seg, a framework that reduces redundancy through modules like CRM and ESCF, achieving a +5.6% mIoU improvement and 67.3% latency reduction compared to previous state-of-the-art methods.
Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities, significantly facilitating the development of OVSS. However, most existing methods suffer from either suboptimal performance or long latency. This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency. ERR-Seg incorporates a training-free Channel Reduction Module (CRM) that leverages prior knowledge from vision-language models like CLIP to identify the most relevant classes while discarding others. Moreover, it incorporates Efficient Semantic Context Fusion (ESCF) with spatial-level and class-level sequence reduction strategies. CRM and ESCF result in substantial memory and computational savings without compromising accuracy. Additionally, recognizing the significance of hierarchical semantics extracted from middle-layer features for closed-set semantic segmentation, ERR-Seg introduces the Hierarchical Semantic Module (HSM) to exploit hierarchical semantics in the context of OVSS. Compared to previous state-of-the-art methods under the ADE20K-847 setting, ERR-Seg achieves +$5.6\%$ mIoU improvement and reduces latency by $67.3\%$.