CVNov 11, 2024

Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs

arXiv:2411.06702v12 citationsh-index: 10
Originality Synthesis-oriented
AI Analysis

This work addresses efficient object tracking for agricultural applications, representing an incremental improvement using existing methods adapted to a specific domain.

The paper tackles sweet pepper tracking in agricultural videos by developing Track Any Peppers (TAP), a weakly supervised ensemble method that uses vision-language models for pseudo-labeling and achieves a HOTA score of 80.4% and MOTA of 66.1%.

In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes