VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps
This work addresses a specific problem in computer vision for jigsaw puzzle solving, introducing a new multimodal paradigm that could benefit applications requiring semantic alignment.
The paper tackled the challenge of jigsaw puzzle solving, especially with eroded gaps, by proposing a vision-language framework that uses textual descriptions for semantic guidance, resulting in a 14.2 percentage point gain in piece accuracy over state-of-the-art models.
Jigsaw puzzle solving remains challenging in computer vision, requiring an understanding of both local fragment details and global spatial relationships. While most traditional approaches only focus on visual cues like edge matching and visual coherence, few methods explore natural language descriptions for semantic guidance in challenging scenarios, especially for eroded gap puzzles. We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance. Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module, which aligns visual patches with textual descriptions through multi-level semantic matching from local tokens to global context. Also, a multimodal architecture that combines dual visual encoders with language features for cross-modal reasoning is integrated into this module. Experiments demonstrate that our method significantly outperforms state-of-the-art models across various datasets, achieving substantial improvements, including a 14.2 percentage point gain in piece accuracy. Ablation studies confirm the critical role of the VLHSA module in driving improvements over vision-only approaches. Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.