Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
This addresses the need for scalable and adaptable vehicle recognition in smart-city applications, though it is incremental as it builds on existing vision language models and RAG techniques.
The paper tackled the problem of vehicle make and model recognition (VMMR) in intelligent transportation systems, where existing methods struggle with new models, and proposed a zero-shot pipeline using vision language models and Retrieval-Augmented Generation (RAG) to improve recognition by nearly 20% over the CLIP baseline.
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.