Subin Varghese

CV
h-index20
4papers
6citations
Novelty50%
AI Score39

4 Papers

CLNov 6, 2025
RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Joshua Gao, Quoc Huy Pham, Subin Varghese et al.

Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.

CVDec 10, 2024Code
ViewDelta: Scaling Scene Change Detection through Text-Conditioning

Subin Varghese, Joshua Gao, Vedhus Hoskere

We introduce a generalized framework for Scene Change Detection (SCD) that addresses the core ambiguity of distinguishing "relevant" from "nuisance" changes, enabling effective joint training of a single model across diverse domains and applications. Existing methods struggle to generalize due to differences in dataset labeling, where changes such as vegetation growth or lane marking alterations may be labeled as relevant in one dataset and irrelevant in another. To resolve this ambiguity, we propose ViewDelta, a text conditioned change detection framework that uses natural language prompts to define relevant changes precisely, such as a single attribute, a specific set of classes, or all observable differences. To facilitate training in this paradigm, we release the Conditional Change Segmentation dataset (CSeg), the first large-scale synthetic dataset for text conditioned SCD, consisting of over 500,000 image pairs with more than 300,000 unique textual prompts describing relevant changes. Experiments demonstrate that a single ViewDelta model trained jointly on CSeg, SYSU-CD, PSCD, VL-CMU-CD, and their unaligned variants achieves performance competitive with or superior to dataset specific models, highlighting text conditioning as a powerful approach for generalizable SCD. Our code and dataset are available at https://joshuakgao.github.io/viewdelta/.

CVNov 16, 2025
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese, Joshua Gao, Asad Ur Rahman et al.

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

CVJun 26, 2024
View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis

Subin Varghese, Vedhus Hoskere

The built environment, encompassing critical infrastructure such as bridges and buildings, requires diligent monitoring of unexpected anomalies or deviations from a normal state in captured imagery. Anomaly detection methods could aid in automating this task; however, deploying anomaly detection effectively in such environments presents significant challenges that have not been evaluated before. These challenges include camera viewpoints that vary, the presence of multiple objects within a scene, and the absence of labeled anomaly data for training. To address these comprehensively, we introduce and formalize Scene Anomaly Detection (Scene AD) as the task of unsupervised, pixel-wise anomaly localization under these specific real-world conditions. Evaluating progress in Scene AD required the development of ToyCity, the first multi-object, multi-view real-image dataset, for unsupervised anomaly detection. Our initial evaluations using ToyCity revealed that established anomaly detection baselines struggle to achieve robust pixel-level localization. To address this, two data augmentation strategies were created to generate additional synthetic images of non-anomalous regions to enhance generalizability. However, the addition of these synthetic images alone only provided minor improvements. Thus, OmniAD, a refinement of the Reverse Distillation methodology, was created to establish a stronger baseline. Our experiments demonstrate that OmniAD, when used with augmented views, yields a 64.33\% increase in pixel-wise \(F_1\) score over Reverse Distillation with no augmentation. Collectively, this work offers the Scene AD task definition, the ToyCity benchmark, the view synthesis augmentation approaches, and the OmniAD method. Project Page: https://drags99.github.io/OmniAD/