When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
This provides a decision framework for practitioners in computer vision to choose architectures based on inference volume and budget, though it is incremental as it applies existing methods to a new economic analysis.
The paper tackles the cost-effectiveness of supervised object detection versus zero-shot vision-language models, finding that supervised YOLO achieves 91.2% accuracy but only pays off beyond 55 million inferences, while zero-shot models like Gemini and GPT-4 have lower accuracy (68.5% and 71.3%) but lower cost-per-detection ($0.00050 and $0.00067).
Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.