An implementation of the "Guess who?" game using CLIP
This work demonstrates an incremental application of CLIP for a specific game, highlighting its zero-shot capabilities and limitations in a controlled setting.
The authors tackled the problem of implementing the 'Guess who?' game using CLIP's zero-shot capabilities, resulting in a system where players use natural language prompts and CLIP automatically evaluates images, with performance benchmarked against different prompting methods and limitations identified.
CLIP (Contrastive Language-Image Pretraining) is an efficient method for learning computer vision tasks from natural language supervision that has powered a recent breakthrough in deep learning due to its zero-shot transfer capabilities. By training from image-text pairs available on the internet, the CLIP model transfers non-trivially to most tasks without the need for any data set specific training. In this work, we use CLIP to implement the engine of the popular game "Guess who?", so that the player interacts with the game using natural language prompts and CLIP automatically decides whether an image in the game board fulfills that prompt or not. We study the performance of this approach by benchmarking on different ways of prompting the questions to CLIP, and show the limitations of its zero-shot capabilites.