CVAILGJan 24, 2025

PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction

arXiv:2501.14210v112 citationsh-index: 20NAACL
Originality Incremental advance
AI Analysis

This work addresses the challenge of time and location prediction from images, which is important for applications like geolocation and historical analysis, but it is incremental as it builds on existing modular and reasoning-based approaches.

The paper tackles the problem of predicting time and location from images by formalizing human-like puzzle-solving abilities into a modular expert pipeline called PuzzleGPT, achieving state-of-the-art performance with improvements of at least 32% and 38% over large VLMs and automated reasoning pipelines on TARA and WikiTilo datasets.

The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can't be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets -- TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes