CVAILGOct 13, 2025

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

arXiv:2510.11302v2
Originality Synthesis-oriented
AI Analysis

This provides a decision framework for practitioners in computer vision to choose architectures based on inference volume and budget, though it is incremental as it applies existing methods to a new economic analysis.

The paper tackles the cost-effectiveness of supervised object detection versus zero-shot vision-language models, finding that supervised YOLO achieves 91.2% accuracy but only pays off beyond 55 million inferences, while zero-shot models like Gemini and GPT-4 have lower accuracy (68.5% and 71.3%) but lower cost-per-detection ($0.00050 and $0.00067).

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes