CLJun 19, 2023Code
Guiding Language Models of Code with Global Context using MonitorsLakshya A Agrawal, Aditya Kanade, Navin Goyal et al.
Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen .
PLJan 31, 2023
FLAME: A small language model for spreadsheet formulasHarshit Joshi, Abishai Ebenezer, José Cambronero et al. · stanford
Spreadsheets are a vital tool for end-user data management. Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and GraphCodeBERT.
SESep 17, 2022
CodeQueries: A Dataset of Semantic Queries over CodeSurya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj et al. · deepmind
Developers often have questions about semantic aspects of code they are working on, e.g., "Is there a class whose parent classes declare a conflicting attribute?". Answering them requires understanding code semantics such as attributes and inheritance relation of classes. An answer to such a question should identify code spans constituting the answer (e.g., the declaration of the subclass) as well as supporting facts (e.g., the definitions of the conflicting attributes). The existing work on question-answering over code has considered yes/no questions or method-level context. We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code. Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans. We curate the dataset based on queries supported by a widely-used static analysis tool, CodeQL, and include both positive and negative examples, and queries requiring single-hop and multi-hop reasoning. To assess the value of our dataset, we evaluate baseline neural approaches. We study a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries. We also evaluate a BERT style model (CuBERT) with fine-tuning. We find that these models achieve limited success on CodeQueries. CodeQueries is thus a challenging dataset to test the ability of neural models, to understand code semantics, in the extractive question-answering setting.
CLJul 15, 2024
An Empirical Study of Validating Synthetic Data for Formula GenerationUsneek Singh, José Cambronero, Sumit Gulwani et al. · microsoft-research
Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.
AIDec 15, 2022Code
BNSynth: Bounded Boolean Functional SynthesisRavi Raja, Stanly Samuel, Chiranjib Bhattacharyya et al.
The automated synthesis of correct-by-construction Boolean functions from logical specifications is known as the Boolean Functional Synthesis (BFS) problem. BFS has many application areas that range from software engineering to circuit design. In this paper, we introduce a tool BNSynth, that is the first to solve the BFS problem under a given bound on the solution space. Bounding the solution space induces the synthesis of smaller functions that benefit resource constrained areas such as circuit design. BNSynth uses a counter-example guided, neural approach to solve the bounded BFS problem. Initial results show promise in synthesizing smaller solutions; we observe at least \textbf{3.2X} (and up to \textbf{24X}) improvement in the reduction of solution size on average, as compared to state of the art tools on our benchmarks. BNSynth is available on GitHub under an open source license.
74.8CVApr 17Code
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMsRohit Sinha, Aditya Kanade, Sai Srinivas Kancheti et al.
Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
AISep 22, 2023
Frustrated with Code Quality Issues? LLMs can Help!Nalin Wadhwa, Jui Pradhan, Atharv Sonwane et al.
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The \emph{proposer LLM} of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The \emph{ranker LLM} evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM is able to reduce false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts.
MANov 14, 2025Code
Exposing Weak Links in Multi-Agent Systems under Adversarial PromptingNirmit Arora, Sathvik Joel, Ishan Kavathekar et al.
LLM-based agents are increasingly deployed in multi-agent systems (MAS). As these systems move toward real-world applications, their security becomes paramount. Existing research largely evaluates single-agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi-agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine-grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter-agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi-agent pipelines. Using SafeAgents, we conduct a comprehensive study across five widely adopted multi-agent architectures (centralized, decentralized, and hybrid variants) on four datasets spanning web tasks, tool use, and code generation. Our findings reveal that common design patterns carry significant vulnerabilities. For example, centralized systems that delegate only atomic instructions to sub-agents obscure harmful objectives, reducing robustness. Our results highlight the need for security-aware design in MAS. Link to code is https://github.com/microsoft/SafeAgents
SEMay 27, 2025Code
Code Researcher: Deep Research Agent for Large Systems Code and Commit HistoryRamneet Singh, Sathvik Joel, Abhav Mehrotra et al.
Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase is a daunting task, even for humans. It requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches for mitigating crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to gather sufficient context. The context is stored in a structured memory which is used for synthesizing a patch. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate of 58%, compared to 37.5% by SWE-agent. On an average, Code Researcher explores 10 files in each trajectory whereas SWE-agent explores only 1.33 files, highlighting Code Researcher's ability to deeply explore the codebase. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.
CVApr 16, 2022
A Robust and Scalable Attention Guided Deep Learning Framework for Movement Quality AssessmentAditya Kanade, Mansi Sharma, Manivannan Muniyandi
Physical rehabilitation programs frequently begin with a brief stay in the hospital and continue with home-based rehabilitation. Lack of feedback on exercise correctness is a significant issue in home-based rehabilitation. Automated movement quality assessment (MQA) using skeletal movement data (hereafter referred to as skeletal data) collected via depth imaging devices can assist with home-based rehabilitation by providing the necessary quantitative feedback. This paper aims to use recent advances in deep learning to address the problem of MQA. Movement quality score generation is an essential component of MQA. We propose three novel skeletal data augmentation schemes. We show that using the proposed augmentations for generating movement quality scores result in significant performance boosts over existing methods. Finally, we propose a novel transformer based architecture for MQA. Four novel feature extractors are proposed and studied that allow the transformer network to operate on skeletal data. We show that adding the attention mechanism in the design of the proposed feature extractor allows the transformer network to pay attention to specific body parts that make a significant contribution towards executing a movement. We report an improvement in movement quality score prediction of 12% on UI-PRMD dataset and 21% on KIMORE dataset compared to the existing methods.
CVMay 28, 2025Code
Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMsAditya Kanade, Tanuja Ganu
Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these underlying failures. Our preliminary study on a joint perception-reasoning dataset revealed that for one leading MLLM, 29% of its correct answers to reasoning questions still exhibited visual perception errors. To systematically address this, we introduce "Do You See Me", a scalable benchmark with 1,758 images and 2,612 questions. It spans seven human-psychology inspired subtasks in 2D and 3D, featuring controllable complexity to rigorously evaluate MLLM visual skills. Our findings on 3 leading closed-source and 5 major open-source models reveal a stark deficit: humans achieve 96.49% accuracy, while top MLLMs average below 50%. This performance gap widens rapidly with increased task complexity (e.g., from 12% to 45% in the visual form constancy subtask). Further analysis into the root causes suggests that failures stem from challenges like misallocated visual attention and the instability of internal representations for fine-grained details, especially at or below encoder patch resolution. This underscores an urgent need for MLLMs with truly robust visual perception. The benchmark dataset, source code and evaluation scripts are available at https://github.com/microsoft/Do-You-See-Me.
SEMar 5, 2025Code
Robust Learning of Diverse Code EditsTushar Aggarwal, Swayam Singh, Abhijeet Awasthi et al.
Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families (DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation and general problem-solving abilities post adaptation. We opensource the models, synthetic dataset, and implementation at https://aka.ms/nextcoder.
SEDec 21, 2019Code
Learning and Evaluating Contextual Embedding of Source CodeAditya Kanade, Petros Maniatis, Gogul Balakrishnan et al.
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.
SEJan 29, 2024
NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional CorrectnessManav Singhal, Tushar Aggarwal, Abhijeet Awasthi et al.
Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of such requirements. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of 27 code LMs. Our finding is that LMs generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
SEApr 22, 2024
Class-Level Code Generation from Natural Language Using Iterative, Tool-Enhanced Reasoning over RepositoryAjinkya Deshpande, Anmol Agarwal, Shashank Shet et al.
LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level across various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Prior research treats class-level generation as an isolated task, neglecting the intricate dependencies & interactions that characterize real-world software environments. To address this gap, we introduce RepoClassBench, a comprehensive benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes "Natural Language to Class generation" tasks across Java, Python & C# from a selection of repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages & under various settings. Our findings emphasize the critical need for code-generation benchmarks to incorporate repo-level dependencies to more accurately reflect the complexities of software development. Our work shows the benefits of leveraging specialized tools to enhance LLMs' understanding of repository context. We plan to make our dataset & evaluation harness public.
68.1CVApr 9
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy OptimizationSai Srinivas Kancheti, Aditya Kanade, Rohit Sinha et al.
Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
AIJun 17, 2024
MASAI: Modular Architecture for Software-engineering AI AgentsDaman Arora, Atharv Sonwane, Nalin Wadhwa et al.
A common method to solve complex problems in software engineering, is to divide the problem into multiple sub-problems. Inspired by this, we propose a Modular Architecture for Software-engineering AI (MASAI) agents, where different LLM-powered sub-agents are instantiated with well-defined objectives and strategies tuned to achieve those objectives. Our modular architecture offers several advantages: (1) employing and tuning different problem-solving strategies across sub-agents, (2) enabling sub-agents to gather information from different sources scattered throughout a repository, and (3) avoiding unnecessarily long trajectories which inflate costs and add extraneous context. MASAI enabled us to achieve the highest performance (28.33% resolution rate) on the popular and highly challenging SWE-bench Lite dataset consisting of 300 GitHub issues from 11 Python repositories. We conduct a comprehensive evaluation of MASAI relative to other agentic methods and analyze the effects of our design decisions and their contribution to the success of MASAI.
SEMay 23, 2023
GrACE: Generation using Associated Code EditsPriyanshu Gupta, Avishree Khare, Yasharth Bajpai et al.
Developers expend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) of code with the knowledge of prior, relevant edits. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, Codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, the knowledge of prior edits boosts the performance of the LLMs significantly and enables them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.
AIDec 6, 2021
Tele-EvalNet: A Low-cost, Teleconsultation System for Home based Rehabilitation of Stroke Survivors using Multiscale CNN-LSTM ArchitectureAditya Kanade, Mansi Sharma, M. Manivannan
Technology has an important role to play in the field of Rehabilitation, improving patient outcomes and reducing healthcare costs. However, existing approaches lack clinical validation, robustness and ease of use. We propose Tele-EvalNet, a novel system consisting of two components: a live feedback model and an overall performance evaluation model. The live feedback model demonstrates feedback on exercise correctness with easy to understand instructions highlighted using color markers. The overall performance evaluation model learns a mapping of joint data to scores, given to the performance by clinicians. The model does this by extracting clinically approved features from joint data. Further, these features are encoded to a lower dimensional space with an autoencoder. A novel multi-scale CNN-LSTM network is proposed to learn a mapping of performance data to the scores by leveraging features extracted at multiple scales. The proposed system shows a high degree of improvement in score predictions and outperforms the state-of-the-art rehabilitation models.
LGJul 12, 2021
Stateful Detection of Model Extraction AttacksSoham Pal, Yash Gupta, Aditya Kanade et al.
Machine-Learning-as-a-Service providers expose machine learning (ML) models through application programming interfaces (APIs) to developers. Recent work has shown that attackers can exploit these APIs to extract good approximations of such ML models, by querying them with samples of their choosing. We propose VarDetect, a stateful monitor that tracks the distribution of queries made by users of such a service, to detect model extraction attacks. Harnessing the latent distributions learned by a modified variational autoencoder, VarDetect robustly separates three types of attacker samples from benign samples, and successfully raises an alarm for each. Further, with VarDetect deployed as an automated defense mechanism, the extracted substitute models are found to exhibit poor performance and transferability, as intended. Finally, we demonstrate that even adaptive attackers with prior knowledge of the deployment of VarDetect, are detected by it.
SEMay 28, 2019
Deep Learning for Bug-Localization in Student ProgramsRahul Gupta, Aditya Kanade, Shirish Shevade
Providing feedback is an integral part of teaching. Most open online courses on programming make use of automated grading systems to support programming assignments and give real-time feedback. These systems usually rely on test results to quantify the programs' functional correctness. They return failing tests to the students as feedback. However, students may find it difficult to debug their programs if they receive no hints about where the bug is and how to fix it. In this work, we present the first deep learning based technique that can localize bugs in a faulty program w.r.t. a failing test, without even running the program. At the heart of our technique is a novel tree convolutional neural network which is trained to predict whether a program passes or fails a given test. To localize the bugs, we analyze the trained network using a state-of-the-art neural prediction attribution technique and see which lines of the programs make it predict the test outcomes. Our experiments show that the proposed technique is generally more accurate than two state-of-the-art program-spectrum based and one syntactic difference based bug-localization baselines.
LGMay 22, 2019
A framework for the extraction of Deep Neural Networks by leveraging public dataSoham Pal, Yash Gupta, Aditya Shukla et al.
Machine learning models trained on confidential datasets are increasingly being deployed for profit. Machine Learning as a Service (MLaaS) has made such models easily accessible to end-users. Prior work has developed model extraction attacks, in which an adversary extracts an approximation of MLaaS models by making black-box queries to it. However, none of these works is able to satisfy all the three essential criteria for practical model extraction: (1) the ability to work on deep learning models, (2) the non-requirement of domain knowledge and (3) the ability to work with a limited query budget. We design a model extraction framework that makes use of active learning and large public datasets to satisfy them. We demonstrate that it is possible to use this framework to steal deep classifiers trained on a variety of datasets from image and text domains. By querying a model via black-box access for its top prediction, our framework improves performance on an average over a uniform noise baseline by 4.70x for image tasks and 2.11x for text tasks respectively, while using only 30% (30,000 samples) of the public dataset at its disposal.
LGApr 3, 2019
Neural Program Repair by Jointly Learning to Localize and RepairMarko Vasic, Aditya Kanade, Petros Maniatis et al.
Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. The state-of-the-art solution for variable misuse enumerates potential fixes for all possible bug locations in a program, before selecting the best prediction. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. We present multi-headed pointer networks for this purpose, with one head each for localization and repair. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone.
AIJun 11, 2018
Greybox fuzzing as a contextual bandits problemKetan Patil, Aditya Kanade
Greybox fuzzing is one of the most useful and effective techniques for the bug detection in large scale application programs. It uses minimal amount of instrumentation. American Fuzzy Lop (AFL) is a popular coverage based evolutionary greybox fuzzing tool. AFL performs extremely well in fuzz testing large applications and finding critical vulnerabilities, but AFL involves a lot of heuristics while deciding the favored test case(s), skipping test cases during fuzzing, assigning fuzzing iterations to test case(s). In this work, we aim at replacing the heuristics the AFL uses while assigning the fuzzing iterations to a test case during the random fuzzing. We formalize this problem as a `contextual bandit problem' and we propose an algorithm to solve this problem. We have implemented our approach on top of the AFL. We modify the AFL's heuristics with our learned model through the policy gradient method. Our learning algorithm selects the multiplier of the number of fuzzing iterations to be assigned to a test case during random fuzzing, given a fixed length substring of the test case to be fuzzed. We fuzz the substring with this new energy value and continuously updates the policy based upon the interesting test cases it produces on fuzzing.
SEApr 13, 2018
Active Learning for Efficient Testing of Student ProgramsIshan Rastogi, Aditya Kanade, Shirish Shevade
In this work, we propose an automated method to identify semantic bugs in student programs, called ATAS, which builds upon the recent advances in both symbolic execution and active learning. Symbolic execution is a program analysis technique which can generate test cases through symbolic constraint solving. Our method makes use of a reference implementation of the task as its sole input. We compare our method with a symbolic execution-based baseline on 6 programming tasks retrieved from CodeForces comprising a total of 23K student submissions. We show an average improvement of over 2.5x over the baseline in terms of runtime (thus making it more suitable for online evaluation), without a significant degradation in evaluation accuracy.
AIJan 31, 2018
Deep Reinforcement Learning for Programming Language CorrectionRahul Gupta, Aditya Kanade, Shirish Shevade
Novice programmers often struggle with the formal syntax of programming languages. To assist them, we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We demonstrate that the agent can be trained through self-exploration directly from the raw input, that is, program text itself, without any knowledge of the formal syntax of the programming language. We leverage expert demonstrations for one tenth of the training data to accelerate training. The proposed technique is evaluated on 6975 erroneous C programs with typographic errors, written by students during an introductory programming course. Our technique fixes 14% more programs and 29% more compiler error messages relative to those fixed by a state-of-the-art tool, DeepFix, which uses a fully supervised neural machine translation approach.
SEMar 15, 2016
Semi-Supervised Verified Feedback GenerationShalini Kaleeswaran, Anirudh Santhiar, Aditya Kanade et al.
Students have enthusiastically taken to online programming lessons and contests. Unfortunately, they tend to struggle due to lack of personalized feedback when they make mistakes. The overwhelming number of submissions precludes manual evaluation. There is an urgent need of program analysis and repair techniques capable of handling both the scale and variations in student submissions, while ensuring quality of feedback. Towards this goal, we present a novel methodology called semi-supervised verified feedback generation. We cluster submissions by solution strategy and ask the instructor to identify or add a correct submission in each cluster. We then verify every submission in a cluster against the instructor-validated submission in the same cluster. If faults are detected in the submission then feedback suggesting fixes to them is generated. Clustering reduces the burden on the instructor and also the variations that have to be handled during feedback generation. The verified feedback generation ensures that only correct feedback is generated. We have applied this methodology to iterative dynamic programming (DP) assignments. Our clustering technique uses features of DP solutions. We have designed a novel counter-example guided feedback generation algorithm capable of suggesting fixes to all faults in a submission. In an evaluation on 2226 submissions to 4 problems, we could generate verified feedback for 1911 (85%) submissions in 1.6s each on an average. Our technique does a good job of reducing the burden on the instructor. Only one submission had to be manually validated or added for every 16 submissions.
SEJun 6, 2013
MintHint: Automated Synthesis of Repair HintsShalini Kaleeswaran, Varun Tulsian, Aditya Kanade et al.
Being able to automatically repair programs is an extremely challenging task. In this paper, we present MintHint, a novel technique for program repair that is a departure from most of today's approaches. Instead of trying to fully automate program repair, which is often an unachievable goal, MintHint performs statistical correlation analysis to identify expressions that are likely to occur in the repaired code and generates, using pattern-matching based synthesis, repair hints from these expressions. Intuitively, these hints suggest how to rectify a faulty statement and help developers find a complete, actual repair. MintHint can address a variety of common faults, including incorrect, spurious, and missing expressions. We present a user study that shows that developers' productivity can improve manyfold with the use of repair hints generated by MintHint -- compared to having only traditional fault localization information. We also apply MintHint to several faults of a widely used Unix utility program to further assess the effectiveness of the approach. Our results show that MintHint performs well even in situations where (1) the repair space searched does not contain the exact repair, and (2) the operational specification obtained from the test cases for repair is incomplete or even imprecise.