Visual Natural Language Query Auto-Completion for Estimating Instance Probabilities
This work addresses a domain-specific problem in computer vision and natural language processing for applications like image retrieval or interactive systems, but it is incremental as it builds on existing methods like BERT.
The paper tackles the task of query auto-completion for estimating instance probabilities by completing user query prefixes conditioned on images and fine-tuning BERT embeddings to rank instances, showing that combining language and vision outperforms language-only approaches.
We present a new task of query auto-completion for estimating instance probabilities. We complete a user query prefix conditioned upon an image. Given the complete query, we fine tune a BERT embedding for estimating probabilities of a broad set of instances. The resulting instance probabilities are used for selection while being agnostic to the segmentation or attention mechanism. Our results demonstrate that auto-completion using both language and vision performs better than using only language, and that fine tuning a BERT embedding allows to efficiently rank instances in the image. In the spirit of reproducible research we make our data, models, and code available.