Tian Cao

CV
h-index20
3papers
40citations
Novelty57%
AI Score37

3 Papers

CVOct 14, 2025
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Kartik Narayan, Yang Xu, Tian Cao et al.

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

CVNov 18, 2019
Large Scale Open-Set Deep Logo Detection

Muhammet Bastan, Hao-Yu Wu, Tian Cao et al.

We present an open-set logo detection (OSLD) system, which can detect (localize and recognize) any number of unseen logo classes without re-training; it only requires a small set of canonical logo images for each logo class. We achieve this using a two-stage approach: (1) Generic logo detection to detect candidate logo regions in an image. (2) Logo matching for matching the detected logo regions to a set of canonical logo images to recognize them. We constructed an open-set logo detection dataset with 12.1k logo classes and released it for research purposes.We demonstrate the effectiveness of OSLD on our dataset and on the standard Flickr-32 logo dataset, outperforming the state-of-the-art open-set and closed-set logo detection methods by a large margin. OSLD is scalable to millions of logo classes.

CVNov 12, 2014
Multi-modal Image Registration for Correlative Microscopy

Tian Cao, Christopher Zach, Shannon Modla et al.

Correlative microscopy is a methodology combining the functionality of light microscopy with the high resolution of electron microscopy and other microscopy technologies. Image registration for correlative microscopy is quite challenging because it is a multi-modal, multi-scale and multi-dimensional registration problem. In this report, I introduce two methods of image registration for correlative microscopy. The first method is based on fiducials (beads). I generate landmarks from the fiducials and compute the similarity transformation matrix based on three pairs of nearest corresponding landmarks. A least-squares matching process is applied afterwards to further refine the registration. The second method is inspired by the image analogies approach. I introduce the sparse representation model into image analogies. I first train representative image patches (dictionaries) for pre-registered datasets from two different modalities, and then I use the sparse coding technique to transfer a given image to a predicted image from one modality to another based on the learned dictionaries. The final image registration is between the predicted image and the original image corresponding to the given image in the different modality. The method transforms a multi-modal registration problem to a mono-modal one. I test my approaches on Transmission Electron Microscopy (TEM) and confocal microscopy images. Experimental results of the methods are also shown in this report.