Chris Brown

SE
h-index6
9papers
49citations
Novelty32%
AI Score49

9 Papers

SEJun 15, 2022Code
FixEval: Execution-based Evaluation of Program Fixes for Programming Problems

Md Mahim Anjum Haque, Wasi Uddin Ahmad, Ismini Lourentzou et al.

The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval.

62.0SEMar 27Code
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

Yoseph Berhanu Alebachew, Hunter Leary, Swanand Vaishampayan et al.

Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering (QA). However, most studies and benchmarks focus on isolated functions or single-file snippets, overlooking the challenges of real-world program comprehension, which often spans multiple files and system-level dependencies. In this work, we introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects. Using this dataset, we systematically evaluate two widely used LLMs (Claude 3.5 Sonnet and GPT-4o) under both direct prompting and agentic configurations. We compare baseline performance with retrieval-augmented generation methods that leverage file-level retrieval and graph-based representations of structural dependencies. Our results show that LLMs achieve moderate accuracy at baseline, with performance improving when structural signals are incorporated. Nonetheless, overall accuracy remains limited for repository-scale comprehension. The analysis reveals that high scores often result from verbatim reproduction of Stack Overflow answers rather than genuine reasoning. To our knowledge, this is the first empirical study to provide such evidence in repository-level QA. We release StackRepoQA to encourage further research into benchmarks, evaluation protocols, and augmentation strategies that disentangle memorization from reasoning, advancing LLMs as reliable tool for repository-scale program comprehension.

SEJan 25
Political and Ideological Pressure in Software Engineering Research: The Case of DEI Backlash

Sonja M. Hyrynsalmi, Chris Brown, Alexander Serebrenik et al.

Political and ideological pressures shape global research. Recently, these pressures have become particularly visible in research related to diversity, equity, and inclusion (DEI). Drastic changes in national funding and governmental guidance, especially in the US, have affected the global software engineering research ecosystem. The impacts of these pressures on research are not always direct, as they operate at multiple levels. However, what is clear is that these pressures affect every field, including software engineering (SE), despite the belief that our field is politically and ideologically neutral. In this position paper, we examine cases of political and ideological pressures on the SE research ecosystem. We investigate the community's perceptions of political and ideological pressures by analyzing community survey responses and outlining case examples of DEI backlash in SE research across three levels: macro, meso, and micro. Our research shows how recent political and ideological pressures have affected SE research across these levels, and, as a result, we propose actionable steps for the community to address these issues at different levels.

40.0SEMay 1
Integrating Log-Based Security Analytics in Agile Workflows: A Real-World Experience Report

Arpit Thool, Chris Brown

Modern organizations increasingly rely on log data and monitoring signals to protect products against account takeovers and abuse, yet integrating security analytics into fast-moving Agile workflows remains challenging. While it is important to understand how security practices are developed and sustained within Agile, real-world case studies of such integrations remain scarce. This experience report provides insights on developer perceptions of an effort to integrate log-based fraud detection within an organization, known as the "Red Flag Project". A cross-functional team of eight members (including one author) iterated weekly to implement a proof-of-concept log-based system that alerts stakeholders when accounts exhibit suspicious activity patterns. Through semi-structured interviews, we investigate developer perceptions of log-based fraud detection integration-exploring their willingness to adopt the system, challenges encountered, and the overall impact on day-to-day development activities and security perceptions. Our findings highlight key lessons, mitigation techniques, and best practices for embedding security analytics into Agile workflows. We provide insights for practitioners and researchers seeking to incorporate security practices into modern development processes while maintaining both speed and resilience in software delivery.

2.1SEApr 5
Integrating DAST in Kanban and CI/CD: A Real World Security Case Study

Arpit Thool, Chris Brown

Modern development methodologies, such as Kanban and continuous integration and continuous deployment (CI/CD), are critical for web application development -- as software products must adapt to changing requirements and deploy products to users quickly. As web application attacks and exploited vulnerabilities are rising, it is increasingly crucial to integrate security into modern development practices. Yet, the iterative and incremental nature of these processes can clash with the sequential nature of security engineering. Thus, it is challenging to adopt security practices and activities in modern development practices. Dynamic Application Security Testing (DAST) is a security practice within software development frameworks that bolsters system security. This study delves into the intersection of Agile development and DAST, exploring how a software organization attempted to integrate DAST into their Kanban workflows and CI/CD pipelines to identify and mitigate security vulnerabilities within the development process. Through an action research case study incorporating interviews among team members, this research elucidates the challenges, mitigation techniques, and best practices associated with incorporating DAST into Agile methodologies from developers' perspectives. We provide insights into integrating security practices with modern development, ensuring both speed and security in software delivery.

7.0SEApr 17
From Papers to Progress: Rethinking Knowledge Accumulation in Software Engineering

Jason Cusati, Chris Brown

Software engineering research has experienced rapid growth in both output and participation over the past decades. Yet concerns persist about the field's ability to accumulate, integrate, and reuse knowledge in ways that support long-term progress. To better understand how the community itself perceives these challenges, we analyze responses from the ICSE 2026 Future of Software Engineering pre-survey, which captures perspectives from 280 globally distributed and highly experienced researchers. Our analysis reveals a tension between increasing research productivity and the limited mechanisms available for synthesizing results, tracking evolving claims, and supporting cumulative understanding over time. Building on these observations, we diagnose four interrelated structural breakdowns: papers function as isolated knowledge units with claims embedded in prose; context and provenance are lost as knowledge moves through the publication pipeline; claims evolve without systematic tracking; and incentive structures favor novelty over consolidation. We argue that addressing these barriers requires rethinking the fundamental properties of research artifacts. We articulate four technology-agnostic principles for future research artifacts: structured and interpretable representations of claims and evidence; inspectable and provenance-aware documentation of methodological decisions; long-lived and reusable substrates that evolve beyond publication; and governance mechanisms that align individual incentives with collective knowledge-building goals. We discuss implications for research practice, publication norms, and community infrastructure, positioning FOSE as a venue for experimenting with alternative artifact designs that support cumulative scientific progress.

HCJul 19, 2025
Designing Conversational AI to Support Think-Aloud Practice in Technical Interview Preparation for CS Students

Taufiq Daryanto, Sophia Stil, Xiaohan Ding et al.

One challenge in technical interviews is the think-aloud process, where candidates verbalize their thought processes while solving coding tasks. Despite its importance, opportunities for structured practice remain limited. Conversational AI offers potential assistance, but limited research explores user perceptions of its role in think-aloud practice. To address this gap, we conducted a study with 17 participants using an LLM-based technical interview practice tool. Participants valued AI's role in simulation, feedback, and learning from generated examples. Key design recommendations include promoting social presence in conversational AI for technical interview simulation, providing feedback beyond verbal content analysis, and enabling crowdsourced think-aloud examples through human-AI collaboration. Beyond feature design, we examined broader considerations, including intersectional challenges and potential strategies to address them, how AI-driven interview preparation could promote equitable learning in computing careers, and the need to rethink AI's role in interview practice by suggesting a research direction that integrates human-AI collaboration.

SEApr 19, 2021
Demystifying Regular Expression Bugs: A comprehensive study on regular expression bug causes, fixes, and testing

Peipei Wang, Chris Brown, Jamie A. Jennings et al.

Regular expressions cause string-related bugs and open security vulnerabilities for DOS attacks. However, beyond ReDoS (Regular expression Denial of Service), little is known about the extent to which regular expression issues affect software development and how these issues are addressed in practice. We conduct an empirical study of 356 merged regex-related pull request bugs from Apache, Mozilla, Facebook, and Google GitHub repositories. We identify and classify the nature of the regular expression problems, the fixes, and the related changes in the test code. The most important findings in this paper are as follows: 1) incorrect regular expression behavior is the dominant root cause of regular expression bugs (165/356, 46.3%). The remaining root causes are incorrect API usage (9.3%) and other code issues that require regular expression changes in the fix (29.5%), 2) fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests, 3) most (51%) of the regex-related pull requests do not contain test code changes. Certain regex bug types (e.g., compile error, performance issues, regex representation) are less likely to include test code changes than others, and 4) the dominant type of test code changes in regex-related pull requests is test case addition (75%). The results of this study contribute to a broader understanding of the practical problems faced by developers when using, fixing, and testing regular expressions.

SEMar 17, 2021
Nudging Students Toward Better Software Engineering Behaviors

Chris Brown, Chris Parnin

Student experiences in large undergraduate Computer Science courses are increasingly impacted by automated systems. Bots, or agents of software automation, are useful for efficiently grading and generating feedback. Current efforts at automation in CS education focus on supporting instructional tasks, but do not address student struggles due to poor behaviors, such as procrastination. In this paper, we explore using bots to improve the software engineering behaviors of students using developer recommendation choice architectures, a framework incorporating behavioral science concepts in recommendations to improve the actions of programmers. We implemented this framework in class-bot, a novel system designed to nudge students to make better choices while working on programming assignments. This work presents a preliminary evaluation integrating this tool in an introductory programming course. Our results show that class-bot is beneficial for improving student development behaviors increasing code quality and productivity.