HCApr 17
Investigating Conversational Agents to Support Secondary School Students Learning CSPMatthew Frazier, Kostadin Damevski, Lori Pollock
Secondary school students enrolled in the AP Computer Science Principles (CSP) course commonly utilize web resources (e.g., tutorials, Q\&A sites) to better understand key concepts in the curriculum. The primary obstacle to using these resources is finding information appropriate for the learning task and student's background. In addition to web search, conversational agents are increasingly a viable alternative for CSP students. In this paper, we study the potential of conversational agents to aid secondary school students as they acquire knowledge on CSP concepts. We explore general purpose, generative conversational agents (e.g., ChatGPT) and custom, fixed-response conversational agents built specifically to aid CSP students. We present results from classroom use by 45 high school students in grades 9-11 (ages 14-17) across six CSP sections. Our main contributions are in better understanding how conversational agents can help CSP students and an evaluation of the effectiveness and engagement of different approaches for CSP exploratory search.
CRApr 15
Towards Personalizing Secure Programming Education with LLM-Injected VulnerabilitiesMatthew Frazier, Kostadin Damevski
According to constructivist theory, students learn software security more effectively when examples are grounded in their own code. Generic examples often fail to connect with students' prior work, limiting engagement and understanding. Advances in LLMs are now making it possible to automatically generate personalized examples by embedding security vulnerabilities directly into student-authored code. This paper introduces a method that uses LLMs to inject instances of specific Common Weakness Enumerations (CWEs) into students' own assignment code, creating individualized instructional materials. We present an agentic AI framework, using autonomous LLM-based agents equipped with task-specific tools to orchestrate injection, evaluation, ranking, and learning outcome generation. We report the experience of deploying this system in two undergraduate computer science courses (N=71), where students reviewed code samples containing LLM-injected vulnerabilities and completed a post-project survey. We compared responses with a baseline using a widely adopted set of generic security instructional materials. Students qualitatively reported finding CWE injections into their own code more relevant, clearer, and more engaging than the textbook-style examples. However, our quantitative findings revealed limited statistically significant differences, suggesting that while students valued the personalization, further studies and refinement of the approach are needed to establish stronger empirical support.
SEFeb 20, 2025
Do LLMs Consider Security? An Empirical Study on Responses to Programming QuestionsAmirali Sajadi, Binh Le, Anh Nguyen et al.
The widespread adoption of conversational LLMs for software development has raised new security concerns regarding the safety of LLM-generated content. Our motivational study outlines ChatGPT's potential in volunteering context-specific information to the developers, promoting safe coding practices. Motivated by this finding, we conduct a study to evaluate the degree of security awareness exhibited by three prominent LLMs: Claude 3, GPT-4, and Llama 3. We prompt these LLMs with Stack Overflow questions that contain vulnerable code to evaluate whether they merely provide answers to the questions or if they also warn users about the insecure code, thereby demonstrating a degree of security awareness. Further, we assess whether LLM responses provide information about the causes, exploits, and the potential fixes of the vulnerability, to help raise users' awareness. Our findings show that all three models struggle to accurately detect and warn users about vulnerabilities, achieving a detection rate of only 12.6% to 40% across our datasets. We also observe that the LLMs tend to identify certain types of vulnerabilities related to sensitive information exposure and improper input neutralization much more frequently than other types, such as those involving external control of file names or paths. Furthermore, when LLMs do issue security warnings, they often provide more information on the causes, exploits, and fixes of vulnerabilities compared to Stack Overflow responses. Finally, we provide an in-depth discussion on the implications of our findings and present a CLI-based prompting tool that can be used to generate significantly more secure LLM responses.
CRJun 30, 2025
Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-benchAmirali Sajadi, Kostadin Damevski, Preetha Chatterjee
Large Language Models (LLMs) and their agentic frameworks are increasingly adopted to automate software development tasks such as issue resolution and program repair. While prior work has identified security risks in LLM-generated code, most evaluations have focused on synthetic or isolated settings, leaving open questions about the security of these systems in real-world development contexts. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset. We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches. We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data. Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which LLMs and agents are most likely to generate insecure code. Our findings reveal that the standalone LLM introduces nearly 9x more new vulnerabilities than developers, with many of these exhibiting unique patterns not found in developers' code. Agentic workflows also generate a significant number of vulnerabilities, particularly when granting LLMs more autonomy, potentially increasing the likelihood of misinterpreting project context or task requirements. We find that vulnerabilities are more likely to occur in LLM patches associated with a higher number of files, more lines of generated code, and GitHub issues that lack specific code snippets or information about the expected code behavior and steps to reproduce. These results suggest that contextual factors play a critical role in the security of the generated code and point toward the need for proactive risk assessment methods that account for both code and issue-level information to complement existing vulnerability detection tools.
CRFeb 15
AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability ReportsAmirali Sajadi, Tu Nguyen, Kostadin Damevski et al.
Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof-of-concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE's generalizability through a case study on a recent real-world vulnerability not included in CVE-Bench.
SEMar 8, 2025
Psycholinguistic Analyses in Software Engineering Text: A Systematic Literature ReviewAmirali Sajadi, Kostadin Damevski, Preetha Chatterjee
Context: A deeper understanding of human factors in software engineering (SE) is essential for improving team collaboration, decision-making, and productivity. Communication channels like code reviews and chats provide insights into developers' psychological and emotional states. While large language models excel at text analysis, they often lack transparency and precision. Psycholinguistic tools like Linguistic Inquiry and Word Count (LIWC) offer clearer, interpretable insights into cognitive and emotional processes exhibited in text. Despite its wide use in SE research, no comprehensive review of LIWC's use has been conducted. Objective: We examine the importance of psycholinguistic tools, particularly LIWC, and provide a thorough analysis of its current and potential future applications in SE research. Methods: We conducted a systematic review of six prominent databases, identifying 43 SE-related papers using LIWC. Our analysis focuses on five research questions. Results: Our findings reveal a wide range of applications, including analyzing team communication to detect developer emotions and personality, developing ML models to predict deleted Stack Overflow posts, and more recently comparing AI-generated and human-written text. LIWC has been primarily used with data from project management platforms (e.g., GitHub) and Q&A forums (e.g., Stack Overflow). Key BSE concepts include Communication, Organizational Climate, and Positive Psychology. 26 of 43 papers did not formally evaluate LIWC. Concerns were raised about some limitations, including difficulty handling SE-specific vocabulary. Conclusion: We highlight the potential of psycholinguistic tools and their limitations, and present new use cases for advancing the research of human factors in SE (e.g., bias in human-LLM conversations).
SEDec 28, 2021
Fast Changeset-based Bug Localization with BERTAgnieszka Ciborowska, Kostadin Damevski
Automatically localizing software bugs to the changesets that induced them has the potential to improve software developer efficiency and to positively affect software quality. To facilitate this automation, a bug report has to be effectively matched with source code changes, even when a significant lexical gap exists between natural language used to describe the bug and identifier naming practices used by developers. To bridge this gap, we need techniques that are able to capture software engineering-specific and project-specific semantics in order to detect relatedness between the two types of documents that goes beyond exact term matching. Popular transformer-based deep learning architectures, such as BERT, excel at leveraging contextual information, hence appear to be a suitable candidate for the task. However, BERT-like models are computationally expensive, which precludes them from being used in an environment where response time is important. In this paper, we describe how BERT can be made fast enough to be applicable to changeset-based bug localization. We also explore several design decisions in using BERT for this purpose, including how best to encode changesets and how to match bug reports to individual changes for improved accuracy. We compare the accuracy and performance of our model to a non-contextual baseline (i.e., vector space model) and BERT-based architectures previously used in software engineering. Our evaluation results demonstrate advantages in using the proposed BERT model compared to the baselines, especially for bug reports that lack any hints about related code elements.
SEApr 15, 2019
Modeling Hierarchical Usage Context for Software Exceptions based on Interaction DataHui Chen, Kostadin Damevski, David Shepherd et al.
Traces of user interactions with a software system, captured in production, are commonly used as an input source for user experience testing. In this paper, we present an alternative use, introducing a novel approach of modeling user interaction traces enriched with another type of data gathered in production - software fault reports consisting of software exceptions and stack traces. The model described in this paper aims to improve developers' comprehension of the circumstances surrounding a specific software exception and can highlight specific user behaviors that lead to a high frequency of software faults. Modeling the combination of interaction traces and software crash reports to form an interpretable and useful model is challenging due to the complexity and variance in the combined data source. Therefore, we propose a probabilistic unsupervised learning approach, adapting the Nested Hierarchical Dirichlet Process, which is a Bayesian non-parametric topic model commonly applied to natural language data. This model infers a tree of topics, each of whom describes a set of commonly co-occurring commands and exceptions. The topic tree can be interpreted hierarchically to aid in categorizing the numerous types of exceptions and interactions. We apply the proposed approach to large scale datasets collected from the ABB RobotStudio software application, and evaluate it both numerically and with a small survey of the RobotStudio developers.
OHDec 10, 2016
Detecting Plagiarism based on the Creation ProcessJohannes Schneider, Avi Bernstein, Jan Vom Brocke et al.
All methodologies for detecting plagiarism to date have focused on the final digital "outcome", such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software or by the macro recorders found in most office applications. We look at an author's interaction logs with the software used to create the work. Detection relies on comparing the histograms of multiple logs' command use. A work is classified as plagiarism if its log deviates too much from logs of "honestly created" works or if its log is too similar to another log. The technique supports the detection of plagiarism for digital outcomes that stem from \emph{unique} tasks, such as theses and \emph{equal} tasks such as assignments for which the same problem sets are solved by multiple students. Focusing on the latter case, we evaluate this approach using logs collected by an interactive development environment (IDE) from more than sixty students who completed three programming assignments.
SEAug 17, 2015
Supporting Developers in Porting Software via Combined Textual and Structural Analysis of Software ArtifactsKostadin Damevski, David Shepherd, Nicholas Kraft et al.
This is position paper accepted to the Computational Science & Engineering Software Sustainability and Productivity Challenges (CSESSP Challenges) Workshop, sponsored by the Networking and Information Technology Research and Development (NITRD) Software Design and Productivity (SDP) Coordinating Group, held October 15th-16th 2015 in Washington DC, USA. It discusses the role recommendation systems, based on textual and structural information in source code, and further enhanced by mining related applications, can have in improving the portability of scientific and engineering software.
SEJan 27, 2014
How the Sando Search Tool Recommends QueriesXi Ge, David Shepherd, Kostadin Damevski et al.
Developers spend a significant amount of time searching their local codebase. To help them search efficiently, researchers have proposed novel tools that apply state-of-the-art information retrieval algorithms to retrieve relevant code snippets from the local codebase. However, these tools still rely on the developer to craft an effective query, which requires that the developer is familiar with the terms contained in the related code snippets. Our empirical data from a state-of-the-art local code search tool, called Sando, suggests that developers are sometimes unacquainted with their local codebase. In order to bridge the gap between developers and their ever-increasing local codebase, in this paper we demonstrate the recommendation techniques integrated in Sando.