Tirtho Roy

AI
3papers
24citations
Novelty53%
AI Score47

3 Papers

LGJul 29, 2024Code
Leveraging Vision Language Models for Specialized Agricultural Tasks

Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy et al.

As Vision Language Models (VLMs) become increasingly accessible to farmers and agricultural experts, there is a growing need to evaluate their potential in specialized tasks. We present AgEval, a comprehensive benchmark for assessing VLMs' capabilities in plant stress phenotyping, offering a solution to the challenge of limited annotated data in agriculture. Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples, providing insights into their behavior and adaptability. AgEval encompasses 12 diverse plant stress phenotyping tasks, evaluating zero-shot and few-shot in-context learning performance of state-of-the-art models including Claude, GPT, Gemini, and LLaVA. Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification. To quantify performance disparities across classes, we introduce metrics such as the coefficient of variation (CV), revealing that VLMs' training impacts classes differently, with CV ranging from 26.02% to 58.03%. We also find that strategic example selection enhances model reliability, with exact category examples improving F1 scores by 15.38% on average. AgEval establishes a framework for assessing VLMs in agricultural applications, offering valuable benchmarks for future evaluations. Our findings suggest that VLMs, with minimal few-shot examples, show promise as a viable alternative to traditional specialized models in plant stress phenotyping, while also highlighting areas for further refinement. Results and benchmark details are available at: https://github.com/arbab-ml/AgEval

MAMay 10
SAGE: Scalable Agentic Grounded Evaluation for Crop Disease Diagnosis

Muhammad Arbab Arshad, Tirtho Roy, Yanben Shen et al.

Plant disease diagnosis is critical for food security, yet training disease-recognition models that generalize across crops, pathogens, and field conditions remains challenging because labeled disease images are far less abundant and standardized than data for other biotic stresses such as insects or weeds. Frontier vision-language models offer new opportunities through improved visual reasoning, but they still struggle with fine-grained disease identification due to the lack of structured, crop-specific symptom knowledge. To address this gap, we curate the largest plant disease image--symptom dataset to date, covering 335 crops, 1{,}251 disease classes, and approximately 839K images, designed to support training-free, agentic disease prediction. A scalable automated pipeline generates source-grounded symptom descriptions in which each claim is linked to a verbatim web quote; domain experts validate sampled crops and reconcile disease-name variants across sources. As a baseline, we introduce an autonomous visual reasoning agent that identifies anatomical context, narrows candidate diseases using symptom knowledge, sequentially compares reference images, and produces a fully explainable reasoning trace. Incorporating symptom knowledge improves accuracy by 16.2 percentage points on average at the full reference budget, with consistent gains across all four evaluation crops. Because the framework only requires crop-specific reference images and symptom knowledge, it can be extended to new crops without retraining, while the agentic baseline can directly benefit from future improvements in foundation model capabilities. Dataset and code are available at:https://sage-dataset.github.io/.

AIJan 19
Improving the Safety and Trustworthiness of Medical AI via Multi-Agent Evaluation Loops

Zainab Ghafoor, Md Shafiqul Islam, Koushik Howlader et al.

Large Language Models (LLMs) are increasingly applied in healthcare, yet ensuring their ethical integrity and safety compliance remains a major barrier to clinical deployment. This work introduces a multi-agent refinement framework designed to enhance the safety and reliability of medical LLMs through structured, iterative alignment. Our system combines two generative models - DeepSeek R1 and Med-PaLM - with two evaluation agents, LLaMA 3.1 and Phi-4, which assess responses using the American Medical Association's (AMA) Principles of Medical Ethics and a five-tier Safety Risk Assessment (SRA-5) protocol. We evaluate performance across 900 clinically diverse queries spanning nine ethical domains, measuring convergence efficiency, ethical violation reduction, and domain-specific risk behavior. Results demonstrate that DeepSeek R1 achieves faster convergence (mean 2.34 vs. 2.67 iterations), while Med-PaLM shows superior handling of privacy-sensitive scenarios. The iterative multi-agent loop achieved an 89% reduction in ethical violations and a 92% risk downgrade rate, underscoring the effectiveness of our approach. This study presents a scalable, regulator-aligned, and cost-efficient paradigm for governing medical AI safety.