Mitigating Spurious Correlations in Weakly Supervised Semantic Segmentation via Cross-architecture Consistency Regularization
This addresses incomplete coverage and inaccurate boundaries in WSSS for specific domains like industrial monitoring, but it is incremental as it builds on teacher-student frameworks and consistency methods.
The paper tackles spurious correlations in weakly supervised semantic segmentation, especially in domains like industrial smoke where emissions are coupled with chimneys, by proposing a cross-architecture consistency regularization framework that improves pseudo mask quality without external supervision.
Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model's inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality.