CVAug 2, 2024Code
MIS-ME: A Multi-modal Framework for Soil Moisture EstimationMohammed Rakib, Adil Aman Mohammed, D. Cole Diggins et al.
Soil moisture estimation is an important task to enable precision agriculture in creating optimal plans for irrigation, fertilization, and harvest. It is common to utilize statistical and machine learning models to estimate soil moisture from traditional data sources such as weather forecasts, soil properties, and crop properties. However, there is a growing interest in utilizing aerial and geospatial imagery to estimate soil moisture. Although these images capture high-resolution crop details, they are expensive to curate and challenging to interpret. Imagine, an AI-enhanced software tool that predicts soil moisture using visual cues captured by smartphones and statistical data given by weather forecasts. This work is a first step towards that goal of developing a multi-modal approach for soil moisture estimation. In particular, we curate a dataset consisting of real-world images taken from ground stations and their corresponding weather data. We also propose MIS-ME - Meteorological & Image based Soil Moisture Estimator, a multi-modal framework for soil moisture estimation. Our extensive analysis shows that MIS-ME achieves a MAPE of 10.14%, outperforming traditional unimodal approaches with a reduction of 3.25% in MAPE for meteorological data and 2.15% in MAPE for image data, highlighting the effectiveness of tailored multi-modal approaches. Our code and dataset will be available at https://github.com/OSU-Complex-Systems/MIS-ME.git.
96.3CYMar 22Code
Deliberative multi-agent large language models improve clinical reasoning in ophthalmologyEhsan Misaghi, Sean T Berkowitz, Bing Yu Chen et al.
Large language models (LLMs) show potential for ophthalmic clinical reasoning, yet individual models risk introducing harm. We evaluated whether multi-agent LLM deliberative councils improve diagnostic performance and mitigate harm compared to individual LLMs. In a comparative cross-sectional study, we assessed 12 individual LLMs and three multi-agent councils on 100 ophthalmology clinical vignettes. Each council comprised four models assembled by type: proprietary flagship, proprietary fast, and open-source. Models independently answered a vignette, anonymously ranked one another's responses, and a designated chair synthesized all responses and peer reviews into a final answer. Councils consistently outperformed pooled individual models across all three tiers. Accuracy improved for proprietary flagship (95.0% vs 90.8%; risk difference [RD]: 4.25 [95% CI: 0.45, 8.05]), proprietary fast (96.0% vs 86.5%; RD: 9.50 [5.31, 13.59]), and open-source councils (91.0% vs 83.2%; RD: 7.75 [4.17, 11.33]). Harm rates declined for proprietary flagship (10.0% vs 22.5%; RD: -12.50 [-16.86, -8.14]), proprietary fast (16.0% vs 31.8%; RD: -15.75 [-21.49, -10.01]), and open-source councils (22.0% vs 38.5%; RD: -16.50 [-22.27, -10.73]). Coverage analysis revealed net positive gains for accuracy (ÎCoverage: 4.4-9.8 percentage points) and safety (ÎCoverage: 13.6-20.6), indicating councils recovered correct diagnoses and averted harm. Councils elevated correct diagnoses to higher rank positions; and produced more complete differentials and management plans (all P<.05). Harmful council responses showed reduced combined commission-and-omission errors and tended to be less severe. Structured deliberation via multi-agent LLM councils may enhance the reliability of LLM-assisted ophthalmic clinical reasoning.
18.8CVMar 18
Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regimeHaiyu Yang, Sumit Sharma, Enhong Liu et al.
Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.
LGMar 14, 2023
Systematic design space exploration by learning the explored space using Machine LearningAvinash Kumar, Anish Kumar, Sumit Sharma et al.
Current practice in parameter space exploration in euclidean space is dominated by randomized sampling or design of experiment methods. The biggest issue with these methods is not keeping track of what part of parameter space has been explored and what has not. In this context, we utilize the geometric learning of explored data space using modern machine learning methods to keep track of already explored regions and samples from the regions that are unexplored. For this purpose, we use a modified version of a robust random-cut forest along with other heuristic-based approaches. We demonstrate our method and its progression in two-dimensional Euclidean space but it can be extended to any dimension since the underlying method is generic.
CVSep 15, 2025Code
A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig DatasetHaiyu Yang, Enhong Liu, Jennifer Sun et al.
Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware tracking and segmentation, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with 93.3% identity preservation score and 89.3% object detection precision. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.
CLAug 13, 2025
Performance of GPT-5 Frontier Models in Ophthalmology Question AnsweringFares Antaki, David Mikhail, Daniel Milad et al.
Large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may improve performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. We evaluated 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the American Academy of Ophthalmology Basic Clinical Science Course (BCSC) dataset. The primary outcome was multiple-choice accuracy; secondary outcomes included head-to-head ranking via a Bradley-Terry model, rationale quality assessment using a reference-anchored, pairwise LLM-as-a-judge framework, and analysis of accuracy-cost trade-offs using token-based cost estimates. GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985), outperforming all GPT-5-nano variants (P < .001), o1-high (P = .04), and GPT-4o (P < .001), but not o3-high (0.958; 95% CI, 0.931-0.981). GPT-5-high ranked first in both accuracy (1.66x stronger than o3-high) and rationale quality (1.11x stronger than o3-high). Cost-accuracy analysis identified several GPT-5 configurations on the Pareto frontier, with GPT-5-mini-low offering the most favorable low-cost, high-performance balance. These results benchmark GPT-5 on a high-quality ophthalmology dataset, demonstrate the influence of reasoning effort on accuracy, and introduce an autograder framework for scalable evaluation of LLM-generated answers against reference standards in ophthalmology.
CVDec 11, 2024
GN-FR:Generalizable Neural Radiance Fields for Flare RemovalGopi Raju Matta, Rahul Siddartha, Rongali Simhachala Venkata Girish et al.
Flare, an optical phenomenon resulting from unwanted scattering and reflections within a lens system, presents a significant challenge in imaging. The diverse patterns of flares, such as halos, streaks, color bleeding, and haze, complicate the flare removal process. Existing traditional and learning-based methods have exhibited limited efficacy due to their reliance on single-image approaches, where flare removal is highly ill-posed. We address this by framing flare removal as a multi-view image problem, taking advantage of the view-dependent nature of flare artifacts. This approach leverages information from neighboring views to recover details obscured by flare in individual images. Our proposed framework, GN-FR (Generalizable Neural Radiance Fields for Flare Removal), can render flare-free views from a sparse set of input images affected by lens flare and generalizes across different scenes in an unsupervised manner. GN-FR incorporates several modules within the Generalizable NeRF Transformer (GNT) framework: Flare-occupancy Mask Generation (FMG), View Sampler (VS), and Point Sampler (PS). To overcome the impracticality of capturing both flare-corrupted and flare-free data, we introduce a masking loss function that utilizes mask information in an unsupervised setting. Additionally, we present a 3D multi-view flare dataset, comprising 17 real flare scenes with 782 images, 80 real flare patterns, and their corresponding annotated flare-occupancy masks. To our knowledge, this is the first work to address flare removal within a Neural Radiance Fields (NeRF) framework.
IVNov 1, 2021
Simulating Realistic MRI variations to Improve Deep Learning model and visual explanations using GradCAMMuhammad Ilyas Patel, Shrey Singla, Razeem Ahmad Ali Mattathodi et al.
In the medical field, landmark detection in MRI plays an important role in reducing medical technician efforts in tasks like scan planning, image registration, etc. First, 88 landmarks spread across the brain anatomy in the three respective views -- sagittal, coronal, and axial are manually annotated, later guidelines from the expert clinical technicians are taken sub-anatomy-wise, for better localization of the existing landmarks, in order to identify and locate the important atlas landmarks even in oblique scans. To overcome limited data availability, we implement realistic data augmentation to generate synthetic 3D volumetric data. We use a modified HighRes3DNet model for solving brain MRI volumetric landmark detection problem. In order to visually explain our trained model on unseen data, and discern a stronger model from a weaker model, we implement Gradient-weighted Class Activation Mapping (Grad-CAM) which produces a coarse localization map highlighting the regions the model is focusing. Our experiments show that the proposed method shows favorable results, and the overall pipeline can be extended to a variable number of landmarks and other anatomies.