DeiSAM: Segment Anything with Deictic Prompting
This addresses the challenge of interpreting contextual language prompts for computer vision, offering a novel approach for more intuitive human-AI interaction in segmentation tasks, though it is incremental in integrating logic with neural networks.
The paper tackles the problem of segmenting objects in complex scenes using deictic natural language prompts, which current deep learning methods struggle with, and proposes DeiSAM, a method combining pre-trained networks with differentiable logic reasoners, achieving substantial improvement over data-driven baselines on the new DeiVG dataset.
Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.