SEOct 2, 2023
CAT-LM: Training Language Models on Aligned Code And TestsNikitha Rao, Kush Jain, Uri Alon et al.
Testing is an integral part of the software development process. Yet, writing tests is time-consuming and therefore often neglected. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code-under-test context when producing a test file. In this work, we propose the Aligned Code And Tests Language Model (CAT-LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger language models trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.
SESep 19, 2024
Prompts Are Programs Too! Understanding How Developers Build Software Containing PromptsJenny T. Liang, Melissa Lin, Nikitha Rao et al.
Generative pre-trained models power intelligent software features used by millions of users controlled by developer-written natural language prompts. Despite the impact of prompt-powered software, little is known about its development process and its relationship to programming. In this work, we argue that some prompts are programs and that the development of prompts is a distinct phenomenon in programming known as "prompt programming". We develop an understanding of prompt programming using Straussian grounded theory through interviews with 20 developers engaged in prompt development across a variety of contexts, models, domains, and prompt structures. We contribute 15 observations to form a preliminary understanding of current prompt programming practices. For example, rather than building mental models of code, prompt programmers develop mental models of the foundation model (FM)'s behavior on the prompt by interacting with the FM. While prior research shows that experts have well-formed mental models, we find that prompt programmers who have developed dozens of prompts still struggle to develop reliable mental models. Our observations show that prompt programming differs from traditional software development, motivating the creation of prompt programming tools and providing implications for software engineering stakeholders.
CVFeb 19, 2025
SegSub: Evaluating Robustness to Knowledge Conflicts and Hallucinations in Vision-Language ModelsPeter Carragher, Nikitha Rao, Abhinand Jha et al.
Vision language models (VLM) demonstrate sophisticated multimodal reasoning yet are prone to hallucination when confronted with knowledge conflicts, impeding their deployment in information-sensitive contexts. While existing research addresses robustness in unimodal models, the multimodal domain lacks systematic investigation of cross-modal knowledge conflicts. This research introduces \segsub, a framework for applying targeted image perturbations to investigate VLM resilience against knowledge conflicts. Our analysis reveals distinct vulnerability patterns: while VLMs are robust to parametric conflicts (20% adherence rates), they exhibit significant weaknesses in identifying counterfactual conditions (<30% accuracy) and resolving source conflicts (<1% accuracy). Correlations between contextual richness and hallucination rate (r = -0.368, p = 0.003) reveal the kinds of images that are likely to cause hallucinations. Through targeted fine-tuning on our benchmark dataset, we demonstrate improvements in VLM knowledge conflict detection, establishing a foundation for developing hallucination-resilient multimodal systems in information-sensitive environments.
SEMay 31, 2023
AI for Low-Code for AINikitha Rao, Jason Tsay, Kiran Kate et al.
Low-code programming allows citizen developers to create programs with minimal coding effort, typically via visual (e.g. drag-and-drop) interfaces. In parallel, recent AI-powered tools such as Copilot and ChatGPT generate programs from natural language instructions. We argue that these modalities are complementary: tools like ChatGPT greatly reduce the need to memorize large APIs but still require their users to read (and modify) programs, whereas visual tools abstract away most or all programming but struggle to provide easy access to large APIs. At their intersection, we propose LowCoder, the first low-code tool for developing AI pipelines that supports both a visual programming interface (LowCoder_VP) and an AI-powered natural language interface (LowCoder_NL). We leverage this tool to provide some of the first insights into whether and how these two modalities help programmers by conducting a user study. We task 20 developers with varying levels of AI expertise with implementing four ML pipelines using LowCoder, replacing the LowCoder_NL component with a simple keyword search in half the tasks. Overall, we find that LowCoder is especially useful for (i) Discoverability: using LowCoder_NL, participants discovered new operators in 75% of the tasks, compared to just 32.5% and 27.5% using web search or scrolling through options respectively in the keyword-search condition, and (ii) Iterative Composition: 82.5% of tasks were successfully completed and many initial pipelines were further successfully improved. Qualitative analysis shows that AI helps users discover how to implement constructs when they know what to do, but still fails to support novices when they lack clarity on what they want to accomplish. Overall, our work highlights the benefits of combining the power of AI with low-code programming.
SEJan 15, 2021
SoftNER: Mining Knowledge Graphs From Cloud IncidentsManish Shetty, Chetan Bansal, Sumit Kumar et al.
The move from boxed products to services and the widespread adoption of cloud computing has had a huge impact on the software development life cycle and DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Prior work on incident management has heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for mining Knowledge Graphs from incident reports. First, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for extracting factual information in the form of named entities. Next, we present an approach to mine relations between the named entities for automatically constructing knowledge graphs. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning pipeline has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state-of-the-art NER models. Lastly, using the knowledge extracted by SoftNER, we are able to build accurate models for applications such as incident triaging and recommending entities based on their relevance to incident titles.
SENov 24, 2020
Search4Code: Code Search Intent Classification Using Weak SupervisionNikitha Rao, Chetan Bansal, Joe Guan
Developers use search for various tasks such as finding code, documentation, debugging information, etc. In particular, web search is heavily used by developers for finding code examples and snippets during the coding process. Recently, natural language based code search has been an active area of research. However, the lack of real-world large-scale datasets is a significant bottleneck. In this work, we propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages. We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine and show that the CNN based model can achieve an accuracy of 77% and 76% for C# and Java respectively. Furthermore, we are also releasing Search4Code, the first large-scale real-world dataset of code search queries mined from Bing web search engine. We hope that the dataset will aid future research on code search.
SEJul 10, 2020
Neural Knowledge Extraction From Cloud Service IncidentsManish Shetty, Chetan Bansal, Sumit Kumar et al.
In the last decade, two paradigm shifts have reshaped the software industry - the move from boxed products to services and the widespread adoption of cloud computing. This has had a huge impact on the software development life cycle and the DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Incidents are created to ensure timely communication of service issues and, also, their resolution. Prior work on incident management has been heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key,value pairs and tables for bootstrapping the training data. Further, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning based approach has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.
IRMay 18, 2020
Product Insights: Analyzing Product Intents in Web SearchNikitha Rao, Chetan Bansal, Subhabrata Mukherjee et al.
Web search engines are frequently used to access information about products. This has increased in recent times with the rising popularity of e-commerce. However, there is limited understanding of what users search for and their intents when it comes to product search on the web. In this work, we study search logs from Bing web search engine to characterize user intents and study user behavior for product search. We propose a taxonomy of product intents by analyzing product search queries. This is a challenging task given that only 15%-17% of web search queries are about products. We train machine learning classifiers with query log features to classify queries based on intent with an overall F1-score of 78%. We further analyze various characteristics of product search queries in terms of search metrics like dwell time, success, popularity and session-specific information.
CRMay 1, 2020
Studying Ransomware Attacks Using Web Search LogsChetan Bansal, Pantazis Deligiannis, Chandra Maddila et al.
Cyber attacks are increasingly becoming prevalent and causing significant damage to individuals, businesses and even countries. In particular, ransomware attacks have grown significantly over the last decade. We do the first study on mining insights about ransomware attacks by analyzing query logs from Bing web search engine. We first extract ransomware related queries and then build a machine learning model to identify queries where users are seeking support for ransomware attacks. We show that user search behavior and characteristics are correlated with ransomware attacks. We also analyse trends in the temporal and geographical space and validate our findings against publicly available information. Lastly, we do a case study on 'Nemty', a popular ransomware, to show that it is possible to derive accurate insights about cyber attacks by query log analysis.
SEDec 19, 2019
Analyzing Web Search Behavior for Software Engineering TasksNikitha Rao, Chetan Bansal, Thomas Zimmermann et al.
Web search plays an integral role in software engineering (SE) to help with various tasks such as finding documentation, debugging, installation, etc. In this work, we present the first large-scale analysis of web search behavior for SE tasks using the search query logs from Bing, a commercial web search engine. First, we use distant supervision techniques to build a machine learning classifier to extract the SE search queries with an F1 score of 93%. We then perform an analysis on one million search sessions to understand how software engineering related queries and sessions differ from other queries and sessions. Subsequently, we propose a taxonomy of intents to identify the various contexts in which web search is used in software engineering. Lastly, we analyze millions of SE queries to understand the distribution, search metrics and trends across these SE search intents. Our analysis shows that SE related queries form a significant portion of the overall web search traffic. Additionally, we found that there are six major intent categories for which web search is used in software engineering. The techniques and insights can not only help improve existing tools but can also inspire the development of new tools that aid in finding information for SE related tasks.