SEMar 31
VeriAct: Beyond Verifiability -- Agentic Synthesis of Correct and Complete Formal SpecificationsMd Rakib Hossain Misu, Iris Ma, Cristina V. Lopes
Formal specifications play a central role in ensuring software reliability and correctness. However, automatically synthesizing high-quality formal specifications remains a challenging task, often requiring domain expertise. Recent work has applied large language models to generate specifications in Java Modeling Language (JML), reporting high verification pass rates. But does passing a verifier mean that the specification is actually correct and complete? In this work, we first conduct a comprehensive evaluation comparing classical and prompt-based approaches for automated JML specification synthesis. We then investigate whether prompt optimization can push synthesis quality further by evolving prompts through structured verification feedback. While optimization improves verifier pass rates, we find a clear performance ceiling. More critically, we propose Spec-Harness, an evaluation framework that measures specification correctness and completeness through symbolic verification, revealing that a large fraction of verifier-accepted specifications, including optimized ones, are in fact incorrect or incomplete, over- or under-constraining both inputs and outputs in ways invisible to the verifier. To push beyond this ceiling, we propose VeriAct, a verification-guided agentic framework that iteratively synthesizes and repairs specifications through a closed loop of LLM-driven planning, code execution, verification, and Spec-Harness feedback. Our experiments on two benchmark datasets show that VeriAct outperforms both prompt-based and prompt-optimized baselines, producing specifications that are not only verifiable but also correct and complete.
SEApr 12, 2018Code
The Java Build Framework: Large Scale CompilationPedro Martins, Rohan Achar, Cristina V. Lopes
Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the lack of large compilable and runnable datasets is that research that requires such properties does not generalize beyond small benchmarks. We present the Java Build Framework, a method and tool capable of automatically compiling a large percentage of Java projects available in open source repositories like GitHub. Two elements are at the core: a very large repository of JAR files, and techniques of resolution of compilation faults and dependencies.
SEAug 31, 2016Code
Collective Intelligence for Smarter API Recommendations in PythonAndrea Renika D'Souza, Di Yang, Cristina V. Lopes
Software developers use Application Programming Interfaces (APIs) of libraries and frameworks extensively while writing programs. In this context, the recommendations provided in code completion pop-ups help developers choose the desired methods. The candidate lists recommended by these tools, however, tend to be large, ordered alphabetically and sometimes even incomplete. A fair amount of work has been done recently to improve the relevance of these code completion results, especially for statically typed languages like Java. However, these proposed techniques rely on the static type of the object and are therefore inapplicable for a dynamically typed language like Python. In this paper, we present PyReco, an intelligent code completion system for Python which uses the mined API usages from open source repositories to order the results based on relevance rather than the conventional alphabetic order. To recommend suggestions that are relevant for a working context, a nearest neighbor classifier is used to identify the best matching usage among all the extracted usage patterns. To evaluate the effectiveness of our system, the code completion queries are automatically extracted from projects and tested quantitatively using a ten-fold cross validation technique. The evaluation shows that our approach outperforms the alphabetically ordered API recommendation systems in recommending APIs for standard, as well as, third-party libraries.
CLApr 17, 2025
Memorization: A Close Look at BooksIris Ma, Ian Domingo, Alberto Krone-Martins et al.
To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the "prefix-prompting" extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice's Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs.
DCSep 7, 2019
GoTcha: An Interactive Debugger for GoT-Based Distributed SystemsRohan Achar, Pritha Dawn, Cristina V. Lopes
Debugging distributed systems is hard. Most of the techniques that have been developed for debugging such systems use either extensive model checking, or postmortem analysis of logs and traces. Interactive debugging is typically a tool that is only effective in single threaded and single process applications, and is rarely applied to distributed systems. While the live observation of state changes using interactive debuggers is effective, it comes with a host of problems in distributed scenarios. In this paper, we discuss the requirements an interactive debugger for distributed systems should meet, the role the underlying distributed model plays in facilitating the debugger, and the implementation of our interactive debugger: GoTcha. GoTcha is a browser based interactive debugger for distributed systems built on the Global Object Tracker (GoT) programming model. We show how the GoT model facilitates the debugger, and the features that the debugger can offer. We also demonstrate a typical debugging workflow.
AISep 12, 2017
Information Design in Crowdfunding under Thresholding PoliciesWen Shen, Jacob W. Crandall, Ke Yan et al.
Crowdfunding has emerged as a prominent way for entrepreneurs to secure funding without sophisticated intermediation. In crowdfunding, an entrepreneur often has to decide how to disclose the campaign status in order to collect as many contributions as possible. Such decisions are difficult to make primarily due to incomplete information. We propose information design as a tool to help the entrepreneur to improve revenue by influencing backers' beliefs. We introduce a heuristic algorithm to dynamically compute information-disclosure policies for the entrepreneur, followed by an empirical evaluation to demonstrate its competitiveness over the widely-adopted immediate-disclosure policy. Our results demonstrate that the immediate-disclosure policy is not optimal when backers follow thresholding policies despite its ease of implementation. With appropriate heuristics, an entrepreneur can benefit from dynamic information disclosure. Our work sheds light on information design in a dynamic setting where agents make decisions using thresholding policies.
AIMar 7, 2016
An Online Mechanism for Ridesharing in Autonomous Mobility-on-Demand SystemsWen Shen, Cristina V. Lopes, Jacob W. Crandall
With proper management, Autonomous Mobility-on-Demand (AMoD) systems have great potential to satisfy the transport demands of urban populations by providing safe, convenient, and affordable ridesharing services. Meanwhile, such systems can substantially decrease private car ownership and use, and thus significantly reduce traffic congestion, energy consumption, and carbon emissions. To achieve this objective, an AMoD system requires private information about the demand from passengers. However, due to self-interestedness, passengers are unlikely to cooperate with the service providers in this regard. Therefore, an online mechanism is desirable if it incentivizes passengers to truthfully report their actual demand. For the purpose of promoting ridesharing, we hereby introduce a posted-price, integrated online ridesharing mechanism (IORS) that satisfies desirable properties such as ex-post incentive compatibility, individual rationality, and budget-balance. Numerical results indicate the competitiveness of IORS compared with two benchmarks, namely the optimal assignment and an offline, auction-based mechanism.
SEDec 20, 2015
SourcererCC: Scaling Code Clone Detection to Big CodeHitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko et al.
Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.
SEAug 18, 2015
On Designing and Testing Distributed Virtual EnvironmentsArthur Valadares, Eugenia Gabrielova, Cristina V. Lopes
Distributed Real-Time (DRT) systems are among the most complex software systems to design, test, maintain and evolve. The existence of components distributed over a network often conflicts with real-time requirements, leading to design strategies that depend on domain- and even application-specific knowledge. Distributed Virtual Environment (DVE) systems are DRT systems that connect multiple users instantly with each other and with a shared virtual space over a network. DVE systems deviate from traditional DRT systems in the importance of the quality of the end user experience. We present an analysis of important, but challenging, issues in the design, testing and evaluation of DVE systems through the lens of experiments with a concrete DVE, OpenSimulator. We frame our observations within six dimensions of well-known design concerns: correctness, fault tolerance/prevention, scalability, time sensitivity, consistency, and overhead of distribution. Furthermore, we place our experimental work in a broader historical context, showing that these challenges are intrinsic to DVEs and suggesting lines of future research.
SEAug 4, 2015
How Scale Affects Structure in Java ProgramsCristina V. Lopes, Joel Ossher
Many internal software metrics and external quality attributes of Java programs correlate strongly with program size. This knowledge has been used pervasively in quantitative studies of software through practices such as normalization on size metrics. This paper reports size-related super- and sublinear effects that have not been known before. Findings obtained on a very large collection of Java programs -- 30,911 projects hosted at Google Code as of Summer 2011 -- unveils how certain characteristics of programs vary disproportionately with program size, sometimes even non-monotonically. Many of the specific parameters of nonlinear relations are reported. This result gives further insights for the differences of "programming in the small" vs. "programming in the large." The reported findings carry important consequences for OO software metrics, and software research in general: metrics that have been known to correlate with size can now be properly normalized so that all the information that is left in them is size-independent.