Noble Saji Mathews

h-index6

8papers

82citations

Novelty34%

AI Score26

Ranked #159,544 of 194,257 authors (top 82%)#1,877 in SE (top 62%)

8 Papers

18.8SEJul 8

What Makes a Good Bug Report for an AI Agent?

Lara Khatib, Noble Saji Mathews, Meiyappan Nagappan et al.

Automated program repair (APR) agents are transitioning from research benchmarks to developer workflows, yet they still begin with bug reports written for human developers. While decades of research have established what makes a good bug report for humans (e.g., steps to reproduce, stack traces), it remains unclear whether these features transfer to LLM-based agents. We study this question in two analyses. First, we use statistical modeling to examine associations between 27 bug-report features and repair success across 433 SWE-bench Verified issues attempted by 87 repair agents. We find that fix suggestions, reproduction scripts, repository source code, and localization info are associated with higher resolution likelihood, while longer reports are associated with lower odds. Second, we conduct controlled ablations across 2 models and 17 problem-statement mutations on SWE-bench Pro, varying the information available to an agent while holding the underlying task fixed. We remove or isolate selected bug-report content, delete fault-localization cues, and test structural changes that flatten lists or remove section headers. We find that both models depend on localization cues and expected behavior, and that structural changes alone can reduce solve rates, even without removing any content. The two models diverge in how they handle missing information: Qwen searches more widely and can exhaust its turn budget, while Gemma commits to a plausible interpretation early and patches on it. Our findings indicate that a good bug report for an agent overlaps with, but is not identical to, a good report for a human: agents benefit most from concrete, executable, and well-localized information, whereas some qualities long emphasized for human readers, such as natural language steps to reproduce and readable descriptions, contribute little or even correlate with lower success.

20.7SEFeb 21, 2024

Test-Driven Development for Code Generation

Noble Saji Mathews, Meiyappan Nagappan

Recent Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code is often written in response to a requirement. Historically, Test-Driven Development (TDD) has proven its merit, requiring developers to write tests before the functional code, ensuring alignment with the initial problem statements. Applying TDD principles to LLM-based code generation offers one distinct benefit: it enables developers to verify the correctness of generated code against predefined tests. This paper investigates if and how TDD can be incorporated into AI-assisted code-generation processes. We experimentally evaluate our hypothesis that providing LLMs like GPT-4 and Llama 3 with tests in addition to the problem statements enhances code generation outcomes. We experimented with established function-level code generation benchmarks such as MBPP and HumanEval. Our results consistently demonstrate that including test cases leads to higher success in solving programming challenges. We assert that TDD is a promising paradigm for helping ensure that the code generated by LLMs effectively captures the requirements.

16.0CRJan 2, 2024

LLbezpeky: Leveraging Large Language Models for Vulnerability Detection

Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer et al.

Despite the continued research and progress in building secure systems, Android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. Current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. Over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. Large Language Models (LLMs), with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. We dive into the efficacy of LLMs for detecting vulnerabilities in the context of Android security. We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities. Our experiments show that LLMs outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. We use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. Our experiments also shed light on how different various simple configurations can affect the True Positive (TP) and False Positive (FP) rates.

3.3SEDec 18, 2024

Design choices made by LLM-based test generators prevent them from finding bugs

Noble Saji Mathews, Meiyappan Nagappan

There is an increasing amount of research and commercial tools for automated test case generation using Large Language Models (LLMs). This paper critically examines whether recent LLM-based test generation tools, such as Codium CoverAgent and CoverUp, can effectively find bugs or unintentionally validate faulty code. Considering bugs are only exposed by failing test cases, we explore the question: can these tools truly achieve the intended objectives of software testing when their test oracles are designed to pass? Using real human-written buggy code as input, we evaluate these tools, showing how LLM-generated tests can fail to detect bugs and, more alarmingly, how their design can worsen the situation by validating bugs in the generated test suite and rejecting bug-revealing tests. These findings raise important questions about the validity of the design behind LLM-based test generation tools and their impact on software quality and test suite reliability.

3.3SENov 21, 2024

CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

Alex Mathai, Kranthi Sedamaki, Debeshee Das et al.

Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.

8.6SEJun 21, 2021Code

On the Impact of Multiple Source Code Representations on Software Engineering Tasks -- An Empirical Study

Karthik Chandra Swarna, Noble Saji Mathews, Dheeraj Vagavolu et al.

Efficiently representing source code is crucial for various software engineering tasks such as code classification and clone detection. Existing approaches primarily use Abstract Syntax Tree (AST), and only a few focus on semantic graphs such as Control Flow Graph (CFG) and Program Dependency Graph (PDG), which contain information about source code that AST does not. Even though some works tried to utilize multiple representations, they do not provide any insights about the costs and benefits of using multiple representations. The primary goal of this paper is to discuss the implications of utilizing multiple code representations, specifically AST, CFG, and PDG. We modify an AST path-based approach to accept multiple representations as input to an attention-based model. We do this to measure the impact of additional representations (such as CFG and PDG) over AST. We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection. Our approach increases the performance on these tasks by 11% (F1), 15.7% (Accuracy), and 9.3% (F1), respectively, over the baseline. In addition to the effect on performance, we discuss timing overheads incurred with multiple representations. We envision this work providing researchers with a lens to evaluate combinations of code representations for various tasks.

1.2CYJun 18, 2021

Detox Browser -- Towards Filtering Sensitive Content On the Web

Noble Saji Mathews, Sridhar Chimalakonda

The annual consumption of web-based resources is increasing at a very fast rate, mainly due to an increase in affordability and accessibility of the internet. Many are relying on the web to get diverse perspectives, but at the same time, it can expose them to content that is harmful to their mental well-being. Catchy headlines and emotionally charged articles increase the number of readers which in turn increases ad revenue for websites. When a user consumes a large quantity of negative content, it adversely impacts the user's happiness and has a significant impact on his/her mood and state of mind. Many studies carried out during the COVID-19 pandemic has shown that people across the globe irrespective of their country of origin have experienced higher levels of anxiety and depression. Web filters can help in constructing a digital environment that is more suitable for people prone to depression, anxiety and stress. A significant amount of work has been done in the field of web filtering, but there has been limited focus on helping Highly Sensitive Persons (HSP's) or those with stress disorders induced by trauma. Through this paper, we propose detox Browser, a simple tool that enables end-users to tune out of or control their exposure to topics that can affect their mental well being. The extension makes use of sentiment analysis and keywords to filter out flagged content from google search results and warns users if any blacklisted topics are detected when navigating across websites

1.2CYJun 3, 2020

AiR -- An Augmented Reality Application for Visualizing Air Pollution

Noble Saji Mathews, Sridhar Chimalakonda, Suresh Jain

Air quality is a term used to describe the concentration levels of various pollutants in the air we breathe. The air quality, which is degrading rapidly across the globe, has been a source of great concern. Across the globe, governments are taking various measures to reduce air pollution. Bringing awareness about environmental pollution among the public plays a major role in controlling air pollution, as the programs proposed by governments require the support of the public. Though information on air quality is present on multiple portals such as the Central Pollution Control Board (CPCB), which provides Air Quality Index that could be accessed by the public. However, such portals are scarcely visited by the general public. Visualizing air quality in the location where an individual resides could help in bringing awareness among the public. This visualization could be rendered using Augmented Reality techniques. Considering the widespread usage of Android based mobile devices in India, and the importance of air quality visualization, we present AiR, as an Android based mobile application. AiR considers the air quality measured by CPCB, in a locality that is detected by the user's GPS or in a locality of user's choice, and visualizes various air pollutants present in the locality $(PM_1{}_0, PM_2{}_.{}_5, NO_2, SO_2, CO, O_3 \& NH_3)$ and displays them in the user's surroundings. AiR also creates awareness in an interactive manner about the different pollutants, sources, and their impacts on health.