72.9AIMar 16Code
Agent Lifecycle Toolkit (ALTK): Reusable Middleware Components for Robust AI AgentsZidane Wright, Jason Tsay, Anupama Murthi et al.
As AI agents move from demos into enterprise deployments, their failure modes become consequential: a misinterpreted tool argument can corrupt production data, a silent reasoning error can go undetected until damage is done, and outputs that violate organizational policy can create legal or compliance risk. Yet, most agent frameworks leave builders to handle these failure modes ad hoc, resulting in brittle, one-off safeguards that are hard to reuse or maintain. We present the Agent Lifecycle Toolkit (ALTK), an open-source collection of modular middleware components that systematically address these gaps across the full agent lifecycle. Across the agent lifecycle, we identify opportunities to intervene and improve, namely, post-user-request, pre-LLM prompt conditioning, post-LLM output processing, pre-tool validation, post-tool result checking, and pre-response assembly. ALTK provides modular middleware that detects, repairs, and mitigates common failure modes. It offers consistent interfaces that fit naturally into existing pipelines. It is compatible with low-code and no-code tools such as the ContextForge MCP Gateway and Langflow. Finally, it significantly reduces the effort of building reliable, production-grade agents.
HCJun 30, 2014Code
Transparency and Coordination in Peer ProductionLaura Dabbish, Colleen Stuart, Jason Tsay et al.
This paper examines coordination in transparent work environments - environments where the content of work artifacts, and the actions taken on these artifacts, are fully visible to organizational members. Our qualitative study of a community of open source software developers revealed a coordination system characterized by interest-based, asynchronous interaction and knowledge transfer. At the core of asynchronous knowledge transfer, lies the concept of quasi-codification, which occurs when rich process knowledge is implicitly encoded in work artifacts. Our findings suggest that members are able to more selectively form dependencies, monitor the trajectory of projects, and make their work understandable to others which facilitates coordination. We discuss two important characteristics that enable coordination activities in a transparent environment: the presence of an imagined audience that dictates the way artifacts are crafted, and experience within the environment, that allows individuals to derive knowledge from these artifacts. By showing how transparency influences coordination, this research challenges previous conceptions of coordination for complex, collaborative work.
CLApr 9, 2025
Towards LLMs Robustness to Changes in Prompt Format StylesLilian Ngweta, Kiran Kate, Jason Tsay et al.
Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.
AISep 2, 2025
When Agents go Astray: Course-Correcting SWE Agents with PRMsShubham Gandhi, Jason Tsay, Jatin Ganhotra et al.
Large Language Model (LLM) agents are increasingly deployed for complex, multi-step software engineering (SWE) tasks. However, their trajectories often contain costly inefficiencies, such as redundant exploration, looping, and failure to terminate once a solution is reached. Prior work has largely treated these errors in a post-hoc manner, diagnosing failures only after execution. In this paper, we introduce SWE-PRM, an inference-time Process Reward Model (PRM) that intervenes during execution to detect and course-correct trajectory-level errors. Our PRM design leverages a taxonomy of common inefficiencies and delivers lightweight, interpretable feedback without modifying the underlying policy. On SWE-bench Verified, closed-source PRMs improve resolution from 40.0% to 50.6% (+10.6 p.p.), with the largest gains on medium and hard tasks. Among feedback strategies, taxonomy-guided PRMs outperform unguided or explicit action-prescriptive variants, increasing success rate while reducing trajectory length. These benefits come at an acceptable added inference cost of as low as $0.2, making PRMs a practical and scalable mechanism for improving SWE agents' reliability and efficiency.
SEOct 17, 2025
Repairing Tool Calls Using Post-tool Execution Reflection and RAGJason Tsay, Zidane Wright, Gaodan Fang et al.
Agentic systems interact with external systems by calling tools such as Python functions, REST API endpoints, or command line tools such as kubectl in Kubernetes. These tool calls often fail for various syntactic and semantic reasons. Some less obvious semantic errors can only be identified and resolved after analyzing the tool's response. To repair these errors, we develop a post-tool execution reflection component that combines large language model (LLM)-based reflection with domain-specific retrieval-augmented generation (RAG) using documents describing both the specific tool being called and troubleshooting documents related to the tool. For this paper, we focus on the use case of the kubectl command line tool to manage Kubernetes, a platform for orchestrating cluster applications. Through a larger empirical study and a smaller manual evaluation, we find that our RAG-based reflection will repair kubectl commands such that they are both more likely to successfully execute (pass rate) for 55% of our models evaluated and 36% more likely to correctly answer the user query on average. We find that troubleshooting documents improve pass rate compared to official documentation by an average of 10%.
SEJul 7, 2025
OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language ModelsKoren Lazar, Matan Vetzler, Kiran Kate et al.
AI agents and business automation tools interacting with external web services require standardized, machine-readable information about their APIs in the form of API specifications. However, the information about APIs available online is often presented as unstructured, free-form HTML documentation, requiring external users to spend significant time manually converting it into a structured format. To address this, we introduce OASBuilder, a novel framework that transforms long and diverse API documentation pages into consistent, machine-readable API specifications. This is achieved through a carefully crafted pipeline that integrates large language models and rule-based algorithms which are guided by domain knowledge of the structure of documentation webpages. Our experiments demonstrate that OASBuilder generalizes well across hundreds of APIs, and produces valid OpenAPI specifications that encapsulate most of the information from the original documentation. OASBuilder has been successfully implemented in an enterprise environment, saving thousands of hours of manual effort and making hundreds of complex enterprise APIs accessible as tools for LLMs.
SEMay 31, 2023
AI for Low-Code for AINikitha Rao, Jason Tsay, Kiran Kate et al.
Low-code programming allows citizen developers to create programs with minimal coding effort, typically via visual (e.g. drag-and-drop) interfaces. In parallel, recent AI-powered tools such as Copilot and ChatGPT generate programs from natural language instructions. We argue that these modalities are complementary: tools like ChatGPT greatly reduce the need to memorize large APIs but still require their users to read (and modify) programs, whereas visual tools abstract away most or all programming but struggle to provide easy access to large APIs. At their intersection, we propose LowCoder, the first low-code tool for developing AI pipelines that supports both a visual programming interface (LowCoder_VP) and an AI-powered natural language interface (LowCoder_NL). We leverage this tool to provide some of the first insights into whether and how these two modalities help programmers by conducting a user study. We task 20 developers with varying levels of AI expertise with implementing four ML pipelines using LowCoder, replacing the LowCoder_NL component with a simple keyword search in half the tasks. Overall, we find that LowCoder is especially useful for (i) Discoverability: using LowCoder_NL, participants discovered new operators in 75% of the tasks, compared to just 32.5% and 27.5% using web search or scrolling through options respectively in the keyword-search condition, and (ii) Iterative Composition: 82.5% of tasks were successfully completed and many initial pipelines were further successfully improved. Qualitative analysis shows that AI helps users discover how to implement constructs when they know what to do, but still fails to support novices when they lack clarity on what they want to accomplish. Overall, our work highlights the benefits of combining the power of AI with low-code programming.
LGFeb 22, 2022
Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE)Jason Tsay, Andrea Bartezzaghi, Aleke Nolte et al.
Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.
CYAug 22, 2018
FactSheets: Increasing Trust in AI Services through Supplier's Declarations of ConformityMatthew Arnold, Rachel K. E. Bellamy, Michael Hind et al.
Accuracy is an important concern for suppliers of artificial intelligence (AI) services, but considerations beyond accuracy, such as safety (which includes fairness and explainability), security, and provenance, are also critical elements to engender consumers' trust in a service. Many industries use transparent, standardized, but often not legally required documents called supplier's declarations of conformity (SDoCs) to describe the lineage of a product along with the safety and performance testing it has undergone. SDoCs may be considered multi-dimensional fact sheets that capture and quantify various aspects of the product and its development to make it worthy of consumers' trust. Inspired by this practice, we propose FactSheets to help increase trust in AI services. We envision such documents to contain purpose, performance, safety, security, and provenance information to be completed by AI service providers for examination by consumers. We suggest a comprehensive set of declaration items tailored to AI and provide examples for two fictitious AI services in the appendix of the paper.