CVMar 4

CLIP-Guided Multi-Task Regression for Multi-View Plant Phenotyping

arXiv:2603.04091v10.35h-index: 99Has Code
AI Analysis80

This work provides a more robust and accurate method for plant phenotyping, which is crucial for agricultural researchers studying plant growth dynamics.

This paper addresses the challenge of predicting plant age and leaf count from multi-view plant imagery, which suffers from viewpoint redundancy and appearance changes. The authors propose a CLIP-guided multi-task regression model that jointly predicts both metrics, achieving a 49.5% reduction in mean age MAE (from 7.74 to 3.91) and a 44.2% reduction in mean leaf-count MAE (from 5.52 to 3.08) on the GroMo25 benchmark.

Modeling plant growth dynamics plays a central role in modern agricultural research. However, learning robust predictors from multi-view plant imagery remains challenging due to strong viewpoint redundancy and viewpoint-dependent appearance changes. We propose a level-aware vision language framework that jointly predicts plant age and leaf count using a single multi-task model built on CLIP embeddings. Our method aggregates rotational views into angle-invariant representations and conditions visual features on lightweight text priors encoding viewpoint level for stable prediction under incomplete or unordered inputs. On the GroMo25 benchmark, our approach reduces mean age MAE from 7.74 to 3.91 and mean leaf-count MAE from 5.52 to 3.08 compared to the GroMo baseline, corresponding to improvements of 49.5% and 44.2%, respectively. The unified formulation simplifies the pipeline by replacing the conventional dual-model setup while improving robustness to missing views. The models and code is available at: https://github.com/SimonWarmers/CLIP-MVP

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes