Arnav Gupta

CVApr 3, 2023

Model Explainability in Physiological and Healthcare-based Neural Networks

Rohit Sharma, Abhinav Gupta, Arnav Gupta et al.

The estimation and monitoring of SpO2 are crucial for assessing lung function and treating chronic pulmonary diseases. The COVID-19 pandemic has highlighted the importance of early detection of changes in SpO2, particularly in asymptomatic patients with clinical deterioration. However, conventional SpO2 measurement methods rely on contact-based sensing, presenting the risk of cross-contamination and complications in patients with impaired limb perfusion. Additionally, pulse oximeters may not be available in marginalized communities and undeveloped countries. To address these limitations and provide a more comfortable and unobtrusive way to monitor SpO2, recent studies have investigated SpO2 measurement using videos. However, measuring SpO2 using cameras in a contactless way, particularly from smartphones, is challenging due to weaker physiological signals and lower optical selectivity of smartphone camera sensors. The system includes three main steps: 1) extraction of the region of interest (ROI), which includes the palm and back of the hand, from the smartphone-captured videos; 2) spatial averaging of the ROI to produce R, G, and B time series; and 3) feeding the time series into an optophysiology-inspired CNN for SpO2 estimation. Our proposed method can provide a more efficient and accurate way to monitor SpO2 using videos captured from consumer-grade smartphones, which can be especially useful in telehealth and health screening settings.

CVDec 24, 2025

Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi et al.

Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.

Arnav Gupta

2 Papers