LGAug 6, 2024
ClassiFIM: An Unsupervised Method To Detect Phase TransitionsVictor Kasatkin, Evgeny Mozgunov, Nicholas Ezzell et al.
Estimation of the Fisher Information Metric (FIM-estimation) is an important task that arises in unsupervised learning of phase transitions, a problem proposed by physicists. This work completes the definition of the task by defining rigorous evaluation metrics distMSE, distMSEPS, and distRE and introduces ClassiFIM, a novel machine learning method designed to solve the FIM-estimation task. Unlike existing methods for unsupervised learning of phase transitions, ClassiFIM directly estimates a well-defined quantity (the FIM), allowing it to be rigorously compared to any present and future other methods that estimate the same. ClassiFIM transforms a dataset for the FIM-estimation task into a dataset for an auxiliary binary classification task and involves selecting and training a model for the latter. We prove that the output of ClassiFIM approaches the exact FIM in the limit of infinite dataset size and under certain regularity conditions. We implement ClassiFIM on multiple datasets, including datasets describing classical and quantum phase transitions, and find that it achieves a good ground truth approximation with modest computational resources. Furthermore, we independently implement two alternative state-of-the-art methods for unsupervised estimation of phase transition locations on the same datasets and find that ClassiFIM predicts such locations at least as well as these other methods. To emphasize the generality of our method, we also propose and generate the MNIST-CNN dataset, which consists of the output of CNNs trained on MNIST for different hyperparameter choices. Using ClassiFIM on this dataset suggests there is a phase transition in the distribution of image-prediction pairs for CNNs trained on MNIST, demonstrating the broad scope of FIM-estimation beyond physics.
CVOct 10, 2025
Constructive Distortion: Improving MLLMs with Attention-Guided Image WarpingDwip Dalal, Gautam Vashishtha, Utkarsh Mishra et al.
Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM's cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.
CVDec 17, 2025
City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMsDwip Dalal, Utkarsh Mishra, Narendra Ahuja et al.
Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path(VoP), which explicitly grounds the agent's internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/