CLJul 10, 2024Code
Benchmarking LLMs for Environmental Review and PermittingRounak Meyur, Hung Phan, Koby Hayashi et al.
The National Environment Policy Act (NEPA) stands as a foundational piece of environmental legislation in the United States, requiring federal agencies to consider the environmental impacts of their proposed actions. The primary mechanism for achieving this is through the preparation of Environmental Assessments (EAs) and, for significant impacts, comprehensive Environmental Impact Statements (EIS). Large Language Model (LLM)s' effectiveness in specialized domains like NEPA remains untested for adoption in federal decision-making processes. To address this gap, we present NEPA Question and Answering Dataset (NEPAQuAD), the first comprehensive benchmark derived from EIS documents, along with a modular and transparent evaluation pipeline, MAPLE, to assess LLM performance on NEPA-focused regulatory reasoning tasks. Our benchmark leverages actual EIS documents to create diverse question types, ranging from factual to complex problem-solving ones. We built a modular and transparent evaluation pipeline to test both closed- and open-source models in zero-shot or context-driven QA benchmarks. We evaluate five state-of-the-art LLMs using our framework to assess both their prior knowledge and their ability to process NEPA-specific information. The experimental results reveal that all the models consistently achieve their highest performance when provided with the gold passage as context. While comparing the other context-driven approaches for each model, Retrieval Augmented Generation (RAG)-based approaches substantially outperform PDF document contexts, indicating that neither model is well suited for long-context question-answering tasks. Our analysis suggests that NEPA-focused regulatory reasoning tasks pose a significant challenge for LLMs, particularly in terms of understanding the complex semantics and effectively processing the lengthy regulatory documents.
CLAug 21, 2024
WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy DomainRounak Meyur, Hung Phan, Sridevi Wagle et al.
Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Large Language Models (LLM) and Retrieval Augmented Generation (RAG) approaches offer a transformative solution, enabling rapid, accurate cross-document information retrieval and synthesis. As the landscape of Natural Language Processing (NLP) and text generation continues to evolve, benchmarking becomes essential to evaluate and compare the performance of different RAG-based LLMs. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental aspects of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level, providing a foundation for rigorous assessment of RAG-based systems in complex scientific domains and enabling researchers to identify areas for improvement in domain-specific applications.
CRSep 24, 2024
Cyber Knowledge Completion Using Large Language ModelsBraden K Webb, Sumit Purohit, Rounak Meyur
The integration of the Internet of Things (IoT) into Cyber-Physical Systems (CPSs) has expanded their cyber-attack surface, introducing new and sophisticated threats with potential to exploit emerging vulnerabilities. Assessing the risks of CPSs is increasingly difficult due to incomplete and outdated cybersecurity knowledge. This highlights the urgent need for better-informed risk assessments and mitigation strategies. While previous efforts have relied on rule-based natural language processing (NLP) tools to map vulnerabilities, weaknesses, and attack patterns, recent advancements in Large Language Models (LLMs) present a unique opportunity to enhance cyber-attack knowledge completion through improved reasoning, inference, and summarization capabilities. We apply embedding models to encapsulate information on attack patterns and adversarial techniques, generating mappings between them using vector embeddings. Additionally, we propose a Retrieval-Augmented Generation (RAG)-based approach that leverages pre-trained models to create structured mappings between different taxonomies of threat patterns. Further, we use a small hand-labeled dataset to compare the proposed RAG-based approach to a baseline standard binary classification model. Thus, the proposed approach provides a comprehensive framework to address the challenge of cyber-attack knowledge graph completion.
IRJun 16, 2025
Evaluating the Robustness of Dense Retrievers in Interdisciplinary DomainsSarthak Chaturvedi, Anurag Acharya, Rounak Meyur et al.
Evaluation benchmark characteristics may distort the true benefits of domain adaptation in retrieval models. This creates misleading assessments that influence deployment decisions in specialized domains. We show that two benchmarks with drastically different features such as topic diversity, boundary overlap, and semantic complexity can influence the perceived benefits of fine-tuning. Using environmental regulatory document retrieval as a case study, we fine-tune ColBERTv2 model on Environmental Impact Statements (EIS) from federal agencies. We evaluate these models across two benchmarks with different semantic structures. Our findings reveal that identical domain adaptation approaches show very different perceived benefits depending on evaluation methodology. On one benchmark, with clearly separated topic boundaries, domain adaptation shows small improvements (maximum 0.61% NDCG gain). However, on the other benchmark with overlapping semantic structures, the same models demonstrate large improvements (up to 2.22% NDCG gain), a 3.6-fold difference in the performance benefit. We compare these benchmarks through topic diversity metrics, finding that the higher-performing benchmark shows 11% higher average cosine distances between contexts and 23% lower silhouette scores, directly contributing to the observed performance difference. These results demonstrate that benchmark selection strongly determines assessments of retrieval system effectiveness in specialized domains. Evaluation frameworks with well-separated topics regularly underestimate domain adaptation benefits, while those with overlapping semantic boundaries reveal improvements that better reflect real-world regulatory document complexity. Our findings have important implications for developing and deploying AI systems for interdisciplinary domains that integrate multiple topics.
CRDec 20, 2023
Fortify Your Defenses: Strategic Budget Allocation to Enhance Power Grid CybersecurityRounak Meyur, Sumit Purohit, Braden K. Webb
The abundance of cyber-physical components in modern day power grid with their diverse hardware and software vulnerabilities has made it difficult to protect them from advanced persistent threats (APTs). An attack graph depicting the propagation of potential cyber-attack sequences from the initial access point to the end objective is vital to identify critical weaknesses of any cyber-physical system. A cyber security personnel can accordingly plan preventive mitigation measures for the identified weaknesses addressing the cyber-attack sequences. However, limitations on available cybersecurity budget restrict the choice of mitigation measures. We address this aspect through our framework, which solves the following problem: given potential cyber-attack sequences for a cyber-physical component in the power grid, find the optimal manner to allocate an available budget to implement necessary preventive mitigation measures. We formulate the problem as a mixed integer linear program (MILP) to identify the optimal budget partition and set of mitigation measures which minimize the vulnerability of cyber-physical components to potential attack sequences. We assume that the allocation of budget affects the efficacy of the mitigation measures. We show how altering the budget allocation for tasks such as asset management, cybersecurity infrastructure improvement, incident response planning and employee training affects the choice of the optimal set of preventive mitigation measures and modifies the associated cybersecurity risk. The proposed framework can be used by cyber policymakers and system owners to allocate optimal budgets for various tasks required to improve the overall security of a cyber-physical system.
SYNov 9, 2019
DeVLearn: A Deep Visual Learning Framework for Localizing Temporary Faults in Power SystemsShuchismita Biswas, Rounak Meyur, Virgilio Centeno
Frequently recurring transient faults in a transmission network may be indicative of impending permanent failures. Hence, determining their location is a critical task. This paper proposes a novel image embedding aided deep learning framework called DeVLearn for faulted line location using PMU measurements at generator buses. Inspired by breakthroughs in computer vision, DeVLearn represents measurements (one-dimensional time series data) as two-dimensional unthresholded Recurrent Plot (RP) images. These RP images preserve the temporal relationships present in the original time series and are used to train a deep Variational Auto-Encoder (VAE). The VAE learns the distribution of latent features in the images. Our results show that for faults on two different lines in the IEEE 68-bus network, DeVLearn is able to project PMU measurements into a two-dimensional space such that data for faults at different locations separate into well-defined clusters. This compressed representation may then be used with off-the-shelf classifiers for determining fault location. The efficacy of the proposed framework is demonstrated using local voltage magnitude measurements at two generator buses.