Vatsal Gupta

CL
h-index13
6papers
124citations
Novelty28%
AI Score40

6 Papers

CVJul 15, 2024Code
NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Pranshu Pandya, Vatsal Gupta, Agney S Talwarr et al.

Cognitive textual and visual reasoning tasks, including puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. Due to extensive training on vast amounts of human-curated data, LLMs and VLMs excel in common-sense reasoning tasks, however still struggle with more complex reasoning that demands deeper cognitive understanding. We introduce NTSEBench, a new dataset designed to evaluate cognitive multi-modal reasoning and problem-solving skills of large models. The dataset contains 2728 multiple-choice questions, accompanied by a total of 4,642 images, categorized into 26 different types. These questions are drawn from the nationwide NTSE examination in India and feature a mix of visual and textual general aptitude challenges, designed to assess intelligence and critical thinking skills beyond mere rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities -- text and images -- in the dataset instances.

CLNov 15, 2023
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Vatsal Gupta, Pranshu Pandya, Tushar Kataria et al.

Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations, causing concerns about trust. To enhance trust, it is imperative to gain a comprehensive understanding of the model's failure modes and develop effective strategies to improve their performance. In this study, we introduce a methodology designed to examine how input perturbations affect language models across various scales, including pre-trained models and large language models (LLMs). Utilizing fine-tuning, we enhance the model's robustness to input perturbations. Additionally, we investigate whether exposure to one perturbation enhances or diminishes the model's performance with respect to other perturbations. To address robustness against multiple perturbations, we present three distinct fine-tuning strategies. Furthermore, we broaden the scope of our methodology to encompass large language models (LLMs) by leveraging a chain of thought (CoT) prompting approach augmented with exemplars. We employ the Tabular-NLI task to showcase how our proposed strategies adeptly train a robust model, enabling it to address diverse perturbations while maintaining accuracy on the original dataset. https://msin-infotabs.github.io/

CLJun 27, 2024Code
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

Shubhankar Singh, Purvi Chaurasia, Yerram Varun et al.

Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.

SEJan 31, 2024
ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation

Bhabesh Mali, Karthik Maddala, Vatsal Gupta et al.

System Verilog Assertion (SVA) formulation -- a critical yet complex task is a prerequisite in the Assertion Based Verification (ABV) process. Traditionally, SVA formulation involves expert-driven interpretation of specifications, which is time-consuming and prone to human error. Recently, LLM-informed automatic assertion generation is gaining interest. We designed a novel framework called ChIRAAG, based on OpenAI GPT4, to generate SVA from natural language specifications of a design. ChIRAAG constitutes the systematic breakdown of design specifications into a standardized format, further generating assertions from formatted specifications using LLM. Furthermore, we used few test cases to validate the LLM-generated assertions. Automatic feedback of log messages from the simulation tool to the LLM ensures that the framework can generate correct SVAs. In our experiments, only 27% of LLM-generated raw assertions had errors, which was rectified in few iterations based on the simulation log. Our results on OpenTitan designs show that LLMs can streamline and assist engineers in the assertion generation process, reshaping verification workflows.

AIMar 16
Prose2Policy (P2P): A Practical LLM Pipeline for Translating Natural-Language Access Policies into Executable Rego

Vatsal Gupta, Darshan Sreenivasamurthy

Prose2Policy (P2P) is a LLM-based practical tool that translates natural-language access control policies (NLACPs) into executable Rego code (the policy language of Open Policy Agent, OPA). It provides a modular, end-to-end pipeline that performs policy detection, component extraction, schema validation, linting, compilation, automatic test generation and execution. Prose2Policy is designed to bridge the gap between human-readable access requirements and machine-enforceable policy-as-code (PaC) while emphasizing deployment reliability and auditability. We evaluated Prose2Policy on the ACRE dataset and demonstrated a 95.3\% compile rate for accepted policies, with automated testing achieving a 82.2\% positive-test pass rate and a 98.9\% negative-test pass rate. These results indicate that Prose2Policy produces syntactically robust and behaviorally consistent Rego policies suitable for Zero Trust and compliance-driven environments.

LGAug 19, 2025
Formal Algorithms for Model Efficiency

Naman Tyagi, Srishti Das, Kunal et al.

We introduce the Knob-Meter-Rule (KMR) framework, a unified formalism for representing and reasoning about model efficiency techniques in deep learning. By abstracting diverse methods, including pruning, quantization, knowledge distillation, and parameter-efficient architectures, into a consistent set of controllable knobs, deterministic rules, and measurable meters, KMR provides a mathematically precise and modular perspective on efficiency optimization. The framework enables systematic composition of multiple techniques, flexible policy-driven application, and iterative budgeted optimization through the Budgeted-KMR algorithm. We demonstrate how well-known efficiency methods can be instantiated as KMR triples and present concise algorithmic templates for each. The framework highlights underlying relationships between methods, facilitates hybrid pipelines, and lays the foundation for future research in automated policy learning, dynamic adaptation, and theoretical analysis of cost-quality trade-offs. Overall, KMR offers both a conceptual and practical tool for unifying and advancing model efficiency research.