What Matters for Grocery Product Retrieval with Open Source Vision Language Models
For practitioners deploying zero-shot product retrieval, it provides actionable guidelines on model selection and data quality, though findings are incremental.
This work evaluates 190 open-source vision-language models on grocery product retrieval, finding that data quality (up to 16.6% gain) outweighs scale, efficient models like MobileCLIP-B can outperform larger ones, and a precision gap persists with 94.5% Recall@5 but a 17.5% drop at Recall@1.
Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($ϕ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.