SEMar 13, 2023Code
Challenges and Practices of Deep Learning Model Reengineering: A Case Study on Computer VisionWenxin Jiang, Vishnu Banna, Naveen Vivek et al.
Many engineering organizations are reimplementing and extending deep neural networks from the research community. We describe this process as deep learning model reengineering. Deep learning model reengineering - reusing, reproducing, adapting, and enhancing state-of-the-art deep learning approaches - is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing. In addition, individual engineers may lack expertise in software engineering, yet teams must apply knowledge of software engineering and deep learning to succeed. Prior work has examined on DL systems from a "product" view, examining defects from projects regardless of the engineers' purpose. Our study is focused on reengineering activities from a "process" view, and focuses on engineers specifically engaged in the reengineering process. Our goal is to understand the characteristics and challenges of deep learning model reengineering. We conducted a case study of this phenomenon, focusing on the context of computer vision. Our results draw from two data sources: defects reported in open-source reeengineering projects, and interviews conducted with open-source project contributors and the leaders of a reengineering team. Our results describe how deep learning-based computer vision techniques are reengineered, analyze the distribution of defects in this process, and discuss challenges and practices. Integrating our quantitative and qualitative data, we proposed a novel reengineering workflow. Our findings inform several future directions, including: measuring additional unknown aspects of model reengineering; standardizing engineering practices to facilitate reengineering; and developing tools to support model reengineering and model reuse.
SEMar 5, 2023
An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model RegistryWenxin Jiang, Nicholas Synovic, Matt Hyatt et al.
Deep Neural Networks (DNNs) are being adopted as components in software systems. Creating and specializing DNNs from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) and fine-tune these models for downstream tasks. Prior works have studied reuse practices for traditional software packages to guide software engineers towards better package maintenance and dependency management. We lack a similar foundation of knowledge to guide behaviors in pre-trained model ecosystems. In this work, we present the first empirical investigation of PTM reuse. We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse. From this data, we model the decision-making process for PTM reuse. Based on the identified practices, we describe useful attributes for model reuse, including provenance, reproducibility, and portability. Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks. We substantiate these identified challenges with systematic measurements in the Hugging Face ecosystem. Our work informs future directions on optimizing deep learning ecosystems by automated measuring useful attributes and potential attacks, and envision future research on infrastructure and standardization for model registries.
SEOct 5, 2023Code
PeaTMOSS: Mining Pre-Trained Models in Open-Source SoftwareWenxin Jiang, Jason Jones, Jerin Yasmin et al.
Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.
SEMay 29
Measuring Delivery Consistency in Practice: A DORA Extension from a Multi-Platform Release SettingLuiz Parente, James C. Davis
The DevOps Research and Assessment (DORA) framework is the most widely adopted measurement system for performance measurement across engineering teams. However, every DORA metric is a first-moment statistic or a simple ratio, which limits the potential insights into engineering process. For example, metrics like Deployment Frequency do not capture the distributional shape of deployment timing, so teams with identical measures can deploy on a metronomic cadence or in undesirably erratic bursts. We have been developing and piloting Delivery Consistency (DC), a bounded second-moment measure of cadence regularity derived from the coefficient of variation of inter-release intervals. In conjunction with other DORA concepts, we integrated DC into the Delivery Health Matrix, an eight-archetype diagnostic that maps joint readings to differentiated interventions. We report an experience evaluation on a four-platform software delivery group using 120 weeks of data extracted from our Jira, GitHub, and Firebase records. DC allowed us to distinguish platforms with identical DORA tier placements but different cadence regularity, and the Matrix summarized the readings into an archetype that pointed at a shared organization or procedural constraint.
SEMay 5
SysLLMatic: Large Language Models are Software System OptimizersHuiyun Peng, Arjun Gupte, Ryan Hasler et al.
Automatic software system optimization can improve software speed, reduce operating costs, and save energy. Traditional approaches to optimization rely on manual tuning and compiler heuristics, limiting their ability to generalize across diverse codebases and system contexts. Recent methods using Large Language Models (LLMs) introduce automation on simple programs, but they do not scale effectively to the complexity and size of real-world software systems. We present SysLLMatic, a system that integrates LLMs with performance diagnostics and a curated catalog of 43 optimization patterns to automatically optimize software systems. By leveraging profiling to identify performance hotspots, our approach enables LLMs to optimize real-world software beyond isolated code snippets. We evaluate it on three benchmark suites: HumanEval_CPP (competitive programming in C++), SciMark2 (scientific kernels in Java), and DaCapo (large-scale software systems in Java). Results show that SysLLMatic can improve software system performance, including latency, throughput, energy efficiency, memory usage, and CPU utilization. It consistently outperforms state-of-the-art LLM baselines on microbenchmarks. On large-scale application codes, to which prior LLM approaches have not scaled, it surpasses compiler optimizations, achieving average relative improvements of 1.54x in latency (vs. 1.01x for the compiler) and 1.24x in energy (vs. 1.08x for the compiler). Our findings demonstrate that LLMs, guided by performance knowledge through the optimization pattern catalog and appropriate performance diagnostics, can serve as viable software system optimizers. We further identify limitations of our approach and the challenges involved in handling complex applications. This work provides a foundation for generating optimized code across various languages, benchmarks, and program sizes in a principled manner.
SEMar 17Code
A Longitudinal Study of Usability in Identity-Based Software SigningKelechi G. Kalu, Hieu Tran, Santiago Torres-Arias et al.
Identity-based software signing tools aim to make software artifact provenance verifiable while reducing the operational burden of long-lived key management. However, there is limited cross-tool longitudinal evidence about which usability problems arise in practice and how those problems evolve as tools mature. This gap matters because unusable signing and verification workflows can lead to incomplete adoption, misconfiguration, or skipped verification, undermining intended integrity guarantees. We conducted the first mining-software-repositories study of five open-source identity-based signing ecosystems: Sigstore, OpenPubKey, HashiCorp Vault, Keyfactor, and Notary v2. We analyzed approximately 3,900 GitHub issues from Nov. 2021 to Nov. 2025. We coded each issue for the reported usability concern and the implicated architectural component, and compared patterns across tools and over time. Across ecosystems, reported concerns concentrate in verification workflows, policy and configuration surfaces, and integration boundaries. Longitudinal Poisson trend analysis shows substantial declines in reported issues for most ecosystems. However, across usability themes, workflow- and documentation-related concerns decline unevenly across tools and concern types, and verification workflows and configuration surfaces remain persistent friction points. These results indicate that identity-based signing reduces some usability burdens while relocating complexity to verification semantics, policy configuration, and deployment integration. Designing future signing ecosystems therefore requires treating verification semantics and release workflows as first-class usability targets rather than peripheral integration concerns.
SEOct 2, 2023
"I see models being a whole other thing": An Empirical Study of Pre-Trained Model Naming Conventions and A Tool for Enhancing Naming ConsistencyWenxin Jiang, Mingyu Kim, Chingwo Cheung et al.
As innovation in deep learning continues, many engineers are incorporating Pre-Trained Models (PTMs) as components in computer systems. Some PTMs are foundation models, and others are fine-tuned variations adapted to different needs. When these PTMs are named well, it facilitates model discovery and reuse. However, prior research has shown that model names are not always well chosen and can sometimes be inaccurate and misleading. The naming practices for PTM packages have not been systematically studied, which hampers engineers' ability to efficiently search for and reliably reuse these models. In this paper, we conduct the first empirical investigation of PTM naming practices in the Hugging Face PTM registry. We begin by reporting on a survey of 108 Hugging Face users, highlighting differences from traditional software package naming and presenting findings on PTM naming practices. The survey results indicate a mismatch between engineers' preferences and current practices in PTM naming. We then introduce DARA, the first automated DNN ARchitecture Assessment technique designed to detect PTM naming inconsistencies. Our results demonstrate that architectural information alone is sufficient to detect these inconsistencies, achieving an accuracy of 94% in identifying model types and promising performance (over 70%) in other architectural metadata as well. We also highlight potential use cases for automated naming tools, such as model validation, PTM metadata generation and verification, and plagiarism detection. Our study provides a foundation for automating naming inconsistency detection. Finally, we envision future work focusing on automated tools for standardizing package naming, improving model selection and reuse, and strengthening the security of the PTM supply chain.
SEMar 5, 2023
Discrepancies among Pre-trained Deep Neural Networks: A New Threat to Model Zoo ReliabilityDiego Montes, Pongpatapee Peerapatanapokin, Jeff Schultz et al.
Training deep neural networks (DNNs) takes signifcant time and resources. A practice for expedited deployment is to use pre-trained deep neural networks (PTNNs), often from model zoos -- collections of PTNNs; yet, the reliability of model zoos remains unexamined. In the absence of an industry standard for the implementation and performance of PTNNs, engineers cannot confidently incorporate them into production systems. As a first step, discovering potential discrepancies between PTNNs across model zoos would reveal a threat to model zoo reliability. Prior works indicated existing variances in deep learning systems in terms of accuracy. However, broader measures of reliability for PTNNs from model zoos are unexplored. This work measures notable discrepancies between accuracy, latency, and architecture of 36 PTNNs across four model zoos. Among the top 10 discrepancies, we find differences of 1.23%-2.62% in accuracy and 9%-131% in latency. We also fnd mismatches in architecture for well-known DNN architectures (e.g., ResNet and AlexNet). Our findings call for future works on empirical validation, automated tools for measurement, and best practices for implementation.
CRAug 9, 2023
An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security FailuresTanmay Singla, Dharun Anandayuvaraj, Kelechi G. Kalu et al.
As we increasingly depend on software systems, the consequences of breaches in the software supply chain become more severe. High-profile cyber attacks like those on SolarWinds and ShadowHammer have resulted in significant financial and data losses, underlining the need for stronger cybersecurity. One way to prevent future breaches is by studying past failures. However, traditional methods of analyzing these failures require manually reading and summarizing reports about them. Automated support could reduce costs and allow analysis of more failures. Natural Language Processing (NLP) techniques such as Large Language Models (LLMs) could be leveraged to assist the analysis of failures. In this study, we assessed the ability of Large Language Models (LLMs) to analyze historical software supply chain breaches. We used LLMs to replicate the manual analysis of 69 software supply chain security failures performed by members of the Cloud Native Computing Foundation (CNCF). We developed prompts for LLMs to categorize these by four dimensions: type of compromise, intent, nature, and impact. GPT 3.5s categorizations had an average accuracy of 68% and Bard had an accuracy of 58% over these dimensions. We report that LLMs effectively characterize software supply chain failures when the source articles are detailed enough for consensus among manual analysts, but cannot yet replace human analysts. Future work can improve LLM performance in this context, and study a broader range of articles and failures.
SEMar 16
Beyond Local Code Optimization: Multi-Agent Reasoning for Software System OptimizationHuiyun Peng, Parth Vinod Patil, Antonio Zhong Qiu et al.
Large language models and AI agents have recently shown promise in automating software performance optimization, but existing approaches predominantly rely on local, syntax-driven code transformations. This limits their ability to reason about program behavior and capture whole system performance interactions. As modern software increasingly comprises interacting components - such as microservices, databases, and shared infrastructure - effective code optimization requires reasoning about program structure and system architecture beyond individual functions or files. This paper explores the feasibility of whole system optimization for microservices. We introduce a multi-agent framework that integrates control-flow and data-flow representations with architectural and cross-component dependency signals to support system-level performance reasoning. The proposed system is decomposed into coordinated agent roles - summarization, analysis, optimization, and verification - that collaboratively identify cross-cutting bottlenecks and construct multi-step optimization strategies spanning the software stack. We present a proof-of-concept on a microservice-based system that illustrates the effectiveness of our proposed framework, achieving a 36.58% improvement in throughput and a 27.81% reduction in average response time.
SEMar 13
An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific ProcessNicholas M. Synovic, Karolina Ryzka, Alessandra V. Vellucci Solari et al.
Deep learning has achieved recognition for its impact within natural sciences, however scientists are inhibited by the prohibitive technical cost and computational complexity of training project specific models from scratch. Following software engineering community guidance, natural scientists are reusing pre-trained deep learning models (PTMs) to amortize these costs. While prior works recommend PTM reuse patterns, to our knowledge, little work has been done to empirically evaluate their usage and impact within the natural sciences. We present the first empirical study of PTM reuse patterns in the natural sciences, quantifying the utilization and impact of conceptual, adaptation, and deployment reuse within the scientific process. Leveraging an automated large language model driven pipeline, we analyze 17,511 peer reviewed, open access papers to identify PTM reuse by scientific field, associated reuse patterns, and the impact of PTM integration into the scientific process from January 1st, 2000 to December 10th, 2025. Our results show that "Biochemistry, Genetics and Molecular Biology" has outpaced other natural scientific fields in PTM reuse, "adaptation" reuse is the most prevalent PTM reuse pattern identified across all natural science fields, and the "Test" stage of the scientific process has been most impacted by PTM integration. This aligns with the growing interest of leveraging computational methods to conduct high throughput, data driven scientific research. Our work characterizes and identifies current PTM reuse practices within the natural sciences, evaluates their impact on the scientific process, and establishes a foundation for future work into the implementation and broader scientific implications of PTM reuse.
SEJan 28
Operationalizing Research Software for Supply Chain SecurityKelechi G. Kalu, Soham Rattan, Taylor R. Schorlemmer et al.
Empirical studies of research software are hard to compare because the literature operationalizes ``research software'' inconsistently. Motivated by the research software supply chain (RSSC) and its security risks, we introduce an RSSC-oriented taxonomy that makes scope and operational boundaries explicit for empirical research software security studies. We conduct a targeted scoping review of recent repository mining and dataset construction studies, extracting each work's definition, inclusion criteria, unit of analysis, and identification heuristics. We synthesize these into a harmonized taxonomy and a mapping that translates prior approaches into shared taxonomy dimensions. We operationalize the taxonomy on a large community-curated corpus from the Research Software Encyclopedia (RSE), producing an annotated dataset, a labeling codebook, and a reproducible labeling pipeline. Finally, we apply OpenSSF Scorecard as a preliminary security analysis to show how repository-centric security signals differ across taxonomy-defined clusters and why taxonomy-aware stratification is necessary for interpreting RSSC security measurements.
SEApr 14
Why Johnny Adopts Identity-Based Software Signing: A Usability Case Study of SigstoreKelechi G. Kalu, Sofia Okorafor, Tanmay Singla et al.
Software signing is the most robust method for ensuring the integrity and authenticity of components in a software supply chain. Legacy key-managed signing tools (e.g., OpenPGP) burdened practitioners with key management and signer identification, creating both usability challenges and security risks. A new class of identity-based signing tools automate many of these concerns, but little is known about their usability and its effect on their adoption and effectiveness in practice. A usability evaluation can clarify the extent to which identity-based designs succeed and highlight priorities for improvement. To fill this gap, we conducted the first usability study of Sigstore, a pioneering and widely adopted exemplar of identity-based signing. Through interviews with 17 industry experts, we examined (1) the problems and advantages associated with practitioners' tooling choices, (2) how and why their signing-tool usage has evolved over time, and (3) the contexts that cause usability concerns. Our findings illuminate the usability factors of identity-based signing tools and yield recommendations for toolmakers, adopting organizations, and the research community. Notably, components of identity-based tooling exhibit different levels of maturity and readiness for adoption, and integration flexibility is a common pain point but potentially mitigable through plugins and APIs. Our results will help identity-based signing toolmakers further strengthen software supply chain security.
SEMar 30, 2023
Analysis of Failures and Risks in Deep Learning Model Converters: A Case Study in the ONNX EcosystemPurvish Jajal, Wenxin Jiang, Arav Tewari et al.
Software engineers develop, fine-tune, and deploy deep learning (DL) models using a variety of development frameworks and runtime environments. DL model converters move models between frameworks and to runtime environments. Conversion errors compromise model quality and disrupt deployment. However, the failure characteristics of DL model converters are unknown, adding risk when using DL interoperability technologies. This paper analyzes failures in DL model converters. We survey software engineers about DL interoperability tools, use cases, and pain points (N=92). Then, we characterize failures in model converters associated with the main interoperability tool, ONNX (N=200 issues in PyTorch and TensorFlow). Finally, we formulate and test two hypotheses about structural causes for the failures we studied. We find that the node conversion stage of a model converter accounts for ~75% of the defects and 33% of reported failure are related to semantically incorrect models. The cause of semantically incorrect models is elusive, but models with behaviour inconsistencies share operator sequences. Our results motivate future research on making DL interoperability software simpler to maintain, extend, and validate. Research into behavioural tolerances and architectural coverage metrics could be fruitful.
SEFeb 1, 2024Code
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source SoftwareWenxin Jiang, Jerin Yasmin, Jason Jones et al.
The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's comprehensiveness, we developed prompts for a large language model to automatically extract model metadata, including the model's training datasets, parameters, and evaluation metrics. Our analysis of this dataset provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation. Our example application reveals inconsistencies in software licenses across PTMs and their dependent projects. PeaTMOSS lays the foundation for future research, offering rich opportunities to investigate the PTM supply chain. We outline mining opportunities on PTMs, their downstream usage, and cross-cutting questions.
LGJul 1, 2024
Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the EdgeNick John Eliopoulos, Purvish Jajal, James C. Davis et al.
This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. However, these methods are not designed with edge device deployment in mind: they do not leverage information about the latency-workload trends to improve efficiency. We address this shortcoming in our work. First, we identify factors that affect ViT latency-workload relationships. Second, we determine token pruning schedule by leveraging non-linear latency-workload relationships. Third, we demonstrate a training-free, token pruning method utilizing this schedule. We show other methods may increase latency by 2-30%, while we reduce latency by 9-26%. For similar latency (within 5.2% or 7ms) across devices we achieve 78.6%-84.5% ImageNet1K accuracy, while the state-of-the-art, Token Merging, achieves 45.8%-85.4%.
CVSep 11, 2024
Token Turing Machines are Efficient Vision ModelsPurvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou et al.
We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT-B achieves 45.65mIoU at 13.8 frame-per-second (FPS) whereas our ViTTM-B model acheives a 45.17 mIoU with 26.8 FPS (+94%).
SDJan 3, 2025Code
Detecting Music Performance Errors with TransformersBenjamin Shiue-Hal Chou, Purvish Jajal, Nicholas John Eliopoulos et al.
Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at https://github.com/ben2002chou/Polytune.
CVApr 29, 2024Code
A Partial Replication of MaskFormer in TensorFlow on TPUs for the TensorFlow Model GardenVishal Purohit, Wenxin Jiang, Akshath R. Ravikiran et al.
This paper undertakes the task of replicating the MaskFormer model a universal image segmentation model originally developed using the PyTorch framework, within the TensorFlow ecosystem, specifically optimized for execution on Tensor Processing Units (TPUs). Our implementation exploits the modular constructs available within the TensorFlow Model Garden (TFMG), encompassing elements such as the data loader, training orchestrator, and various architectural components, tailored and adapted to meet the specifications of the MaskFormer model. We address key challenges encountered during the replication, non-convergence issues, slow training, adaptation of loss functions, and the integration of TPU-specific functionalities. We verify our reproduced implementation and present qualitative results on the COCO dataset. Although our implementation meets some of the objectives for end-to-end reproducibility, we encountered challenges in replicating the PyTorch version of MaskFormer in TensorFlow. This replication process is not straightforward and requires substantial engineering efforts. Specifically, it necessitates the customization of various components within the TFMG, alongside thorough verification and hyper-parameter tuning. The replication is available at: https://github.com/PurdueDualityLab/tf-maskformer/tree/main/official/projects/maskformer
SEMay 11
AutoSOUP: Safety-Oriented Unit Proof Generation for Component-level Memory-Safety VerificationPaschal C. Amusuo, Ricardo Calvo, Dharun Anandayuvaraj et al.
Memory-safety errors remain a persistent source of zero-day vulnerabilities in low-level software. The problem is especially acute in embedded systems, where hardware protections are often limited and dynamic analysis is difficult to apply effectively. Memory-safety verification can provide stronger assurance by proving the absence of such errors or exposing violations when they exist. However, current verification workflows remain largely manual and require substantial specialized expertise, limiting their adoption in practice. We present AutoSOUP, a system for automating component-level memory-safety verification through Safety-Oriented Unit Proofs. We formalize these unit proofs as artifacts that encode verification choices (scope, loop bounds, and environment models) for verifying safety properties, and introduce three techniques for deriving them automatically. To overcome the limitations of existing automation approaches, we further introduce LLM-As-Function-Call, a hybrid architecture that combines deterministic program synthesis with LLMs to automate these techniques and produce justifiable unit proofs. We evaluate AutoSOUP by assessing its ability to automate memory-safety verification and expose vulnerabilities in verified components, and we characterize the assumptions and guarantees of the resulting proofs.
SEDec 25, 2025
How Do Agents Perform Code Optimization? An Empirical StudyHuiyun Peng, Antonio Zhong, Ricardo Andrés Calvo Méndez et al.
Performance optimization is a critical yet challenging aspect of software development, often requiring a deep understanding of system behavior, algorithmic tradeoffs, and careful code modifications. Although recent advances in AI coding agents have accelerated code generation and bug fixing, little is known about how these agents perform on real-world performance optimization tasks. We present the first empirical study comparing agent- and human-authored performance optimization commits, analyzing 324 agent-generated and 83 human-authored PRs from the AIDev dataset across adoption, maintainability, optimization patterns, and validation practices. We find that AI-authored performance PRs are less likely to include explicit performance validation than human-authored PRs (45.7\% vs. 63.6\%, $p=0.007$). In addition, AI-authored PRs largely use the same optimization patterns as humans. We further discuss limitations and opportunities for advancing agentic code optimization.
SEOct 3, 2025Code
AgentHub: A Research Agenda for Agent Sharing InfrastructureErik Pautsch, Tanmay Singla, Wenxin Jiang et al.
LLM-based agents are rapidly proliferating, yet the infrastructure for discovering, evaluating, and governing them remains fragmented compared to mature ecosystems like software package registries (e.g., npm) and model hubs (e.g., Hugging Face). Recent research and engineering works have begun to consider the requisite infrastructure, but so far they focus narrowly -- on distribution, naming, or protocol negotiation. However, considering broader software engineering requirements would improve open-source distribution and ease reuse. We therefore propose AgentHub, a research agenda for agent sharing. By framing the key challenges of capability clarity, lifecycle transparency, interoperability, governance, security, and workflow integration, AgentHub charts a community-wide agenda for building reliable and scalable agent ecosystems. Our vision is a future where agents can be shared, trusted, and composed as seamlessly as today's software libraries.
SESep 7, 2025Code
Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source ProjectsJerin Yasmin, Wenxin Jiang, James C. Davis et al.
Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.
LGMay 6, 2025
Improving the Reproducibility of Deep Learning Software: An Initial Investigation through a Case Study AnalysisNikita Ravi, Abhinav Goel, James C. Davis et al.
The field of deep learning has witnessed significant breakthroughs, spanning various applications, and fundamentally transforming current software capabilities. However, alongside these advancements, there have been increasing concerns about reproducing the results of these deep learning methods. This is significant because reproducibility is the foundation of reliability and validity in software development, particularly in the rapidly evolving domain of deep learning. The difficulty of reproducibility may arise due to several reasons, including having differences from the original execution environment, incompatible software libraries, proprietary data and source code, lack of transparency, and the stochastic nature in some software. A study conducted by the Nature journal reveals that more than 70% of researchers failed to reproduce other researchers experiments and over 50% failed to reproduce their own experiments. Irreproducibility of deep learning poses significant challenges for researchers and practitioners. To address these concerns, this paper presents a systematic approach at analyzing and improving the reproducibility of deep learning models by demonstrating these guidelines using a case study. We illustrate the patterns and anti-patterns involved with these guidelines for improving the reproducibility of deep learning models. These guidelines encompass establishing a methodology to replicate the original software environment, implementing end-to-end training and testing algorithms, disclosing architectural designs, and enhancing transparency in data processing and training pipelines. We also conduct a sensitivity analysis to understand the model performance across diverse conditions. By implementing these strategies, we aim to bridge the gap between research and practice, so that innovations in deep learning can be effectively reproduced and deployed within software.
CRAug 21, 2025
PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al.
Machine learning model repositories such as the Hugging Face Model Hub facilitate model exchanges. However, bad actors can deliver malware through compromised models. Existing defenses such as safer model formats, restrictive (but inflexible) loading policies, and model scanners have shortcomings: 44.9% of popular models on Hugging Face still use the insecure pickle format, 15% of these cannot be loaded by restrictive loading policies, and model scanners have both false positives and false negatives. Pickle remains the de facto standard for model exchange, and the ML community lacks a tool that offers transparent safe loading. We present PickleBall to help machine learning engineers load pickle-based models safely. PickleBall statically analyzes the source code of a given machine learning library and computes a custom policy that specifies a safe load-time behavior for benign models. PickleBall then dynamically enforces the policy during load time as a drop-in replacement for the pickle module. PickleBall generates policies that correctly load 79.8% of benign pickle-based models in our dataset, while rejecting all (100%) malicious examples in our dataset. In comparison, evaluated model scanners fail to identify known malicious models, and the state-of-art loader loads 22% fewer benign models than PickleBall. PickleBall removes the threat of arbitrary function invocation from malicious pickle-based models, raising the bar for attackers to depend on code reuse techniques.
CLJun 27, 2025
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern DefensesMohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim et al.
The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. Token-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. To address the complementary limitations of these methods, we propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries.
LGMay 30, 2025
Inference-Time Alignment of Diffusion Models with Evolutionary AlgorithmsPurvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou et al.
Diffusion models are state-of-the-art generative models in various domains, yet their samples often fail to satisfy downstream objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets. We introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black-boxes and search their latent space to maximize alignment objectives. Our method enables efficient inference-time alignment for both differentiable and non-differentiable alignment objectives across a range of diffusion models. On the DrawBench and Open Image Preferences benchmark, our EA methods outperform state-of-the-art gradient-based and gradient-free inference-time methods. In terms of memory consumption, we require 55% to 76% lower GPU memory than gradient-based methods. In terms of running-time, we are 72% to 80% faster than gradient-based methods. We achieve higher alignment scores over 50 optimization steps on Open Image Preferences than gradient-based and gradient-free methods.
CVNov 22, 2025
AdaPerceiver: Transformers with Adaptive Width, Depth, and TokensPurvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou et al.
Modern transformer architectures achieve remarkable performance across tasks and domains but remain rigid in how they allocate computation at inference time. Real-world deployment often requires models to adapt to diverse hardware and latency constraints, yet most approaches to dynamic computation focus on a single axis -- such as reducing the number of tokens. We present a novel capability: AdaPerceiver, the first transformer architecture with unified adaptivity across depth, width, and tokens within a single model. We propose an architecture that supports adaptivity along these axes. We couple this with an efficient joint training regime that ensures the model maintains performance across its various configurations. We evaluate AdaPerceiver on image classification, semantic segmentation, and depth estimation tasks. On image classification, AdaPerceiver expands the accuracy-throughput Pareto front. It achieves 85.4% accuracy while yielding 36% higher throughput than FlexiViT-L. On dense prediction, AdaPerceiver matches ViT-H/14 while having $\sim$26x fewer encoder FLOPs (floating-point operations) on semantic segmentation and depth estimation. Finally, we show how AdaPerceiver equipped with a policy can maintain ImageNet1K accuracy ($\pm0.1$ percentage points) while reducing FLOPs by $24-33$%.
SDSep 16, 2025
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error DetectionBenjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos et al.
Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection. \textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, \textit{LadderSym} introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the \textit{MAESTRO-E} and \textit{CocoChorales-E} datasets by measuring the F1 score for each note category. Compared to the previous state of the art, \textit{LadderSym} more than doubles F1 for missed notes on \textit{MAESTRO-E} (26.8\% $\rightarrow$ 56.3\%) and improves extra note detection by 14.4 points (72.0\% $\rightarrow$ 86.4\%). Similar gains are observed on \textit{CocoChorales-E}. This work introduces general insights about comparison models that could inform sequence evaluation tasks for reinforcement Learning, human skill assessment, and model evaluation.
LGDec 25, 2024
Recommending Pre-Trained Models for IoT DevicesParth V. Patil, Wenxin Jiang, Huiyun Peng et al.
The availability of pre-trained models (PTMs) has enabled faster deployment of machine learning across applications by reducing the need for extensive training. Techniques like quantization and distillation have further expanded PTM applicability to resource-constrained IoT hardware. Given the many PTM options for any given task, engineers often find it too costly to evaluate each model's suitability. Approaches such as LogME, LEEP, and ModelSpider help streamline model selection by estimating task relevance without exhaustive tuning. However, these methods largely leave hardware constraints as future work-a significant limitation in IoT settings. In this paper, we identify the limitations of current model recommendation approaches regarding hardware constraints and introduce a novel, hardware-aware method for PTM selection. We also propose a research agenda to guide the development of effective, hardware-conscious model recommendation systems for IoT applications.
SEJun 12, 2024
What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claimsJason Jones, Wenxin Jiang, Nicholas Synovic et al.
Background: Collaborative Software Package Registries (SPRs) are an integral part of the software supply chain. Much engineering work synthesizes SPR package into applications. Prior research has examined SPRs for traditional software, such as NPM (JavaScript) and PyPI (Python). Pre-Trained Model (PTM) Registries are an emerging class of SPR of increasing importance, because they support the deep learning supply chain. Aims: Recent empirical research has examined PTM registries in ways such as vulnerabilities, reuse processes, and evolution. However, no existing research synthesizes them to provide a systematic understanding of the current knowledge. Some of the existing research includes qualitative claims lacking quantitative analysis. Our research fills these gaps by providing a knowledge synthesis and quantitative analyses. Methods: We first conduct a systematic literature review (SLR). We then observe that some of the claims are qualitative. We identify quantifiable metrics associated with those claims, and measure in order to substantiate these claims. Results: From our SLR, we identify 12 claims about PTM reuse on the HuggingFace platform, 4 of which lack quantitative validation. We successfully test 3 of these claims through a quantitative analysis, and directly compare one with traditional software. Our findings corroborate qualitative claims with quantitative measurements. Our findings are: (1) PTMs have a much higher turnover rate than traditional software, indicating a dynamic and rapidly evolving reuse environment within the PTM ecosystem; and (2) There is a strong correlation between documentation quality and PTM popularity. Conclusions: We confirm qualitative research claims with concrete metrics, supporting prior qualitative and case study research. Our measures show further dynamics of PTM reuse, inspiring research infrastructure and new measures.
CVSep 27, 2021
Efficient Computer Vision on Edge Devices with Pipeline-Parallel Hierarchical Neural NetworksAbhinav Goel, Caleb Tung, Xiao Hu et al.
Computer vision on low-power edge devices enables applications including search-and-rescue and security. State-of-the-art computer vision algorithms, such as Deep Neural Networks (DNNs), are too large for inference on low-power edge devices. To improve efficiency, some existing approaches parallelize DNN inference across multiple edge devices. However, these techniques introduce significant communication and synchronization overheads or are unable to balance workloads across devices. This paper demonstrates that the hierarchical DNN architecture is well suited for parallel processing on multiple edge devices. We design a novel method that creates a parallel inference pipeline for computer vision problems that use hierarchical DNNs. The method balances loads across the collaborating devices and reduces communication costs to facilitate the processing of multiple video frames simultaneously with higher throughput. Our experiments consider a representative computer vision problem where image recognition is performed on each video frame, running on multiple Raspberry Pi 4Bs. With four collaborating low-power edge devices, our approach achieves 3.21X higher throughput, 68% less energy consumption per device per frame, and 58% decrease in memory when compared with existing single-device hierarchical DNNs.
SEJul 2, 2021
An Experience Report on Machine Learning Reproducibility: Guidance for Practitioners and TensorFlow Model Garden ContributorsVishnu Banna, Akhil Chinnakotla, Zhengxin Yan et al.
Machine learning techniques are becoming a fundamental tool for scientific and engineering progress. These techniques are applied in contexts as diverse as astronomy and spam filtering. However, correctly applying these techniques requires careful engineering. Much attention has been paid to the technical potential; relatively little attention has been paid to the software engineering process required to bring research-based machine learning techniques into practical utility. Technology companies have supported the engineering community through machine learning frameworks such as TensorFLow and PyTorch, but the details of how to engineer complex machine learning models in these frameworks have remained hidden. To promote best practices within the engineering community, academic institutions and Google have partnered to launch a Special Interest Group on Machine Learning Models (SIGMODELS) whose goal is to develop exemplary implementations of prominent machine learning models in community locations such as the TensorFlow Model Garden (TFMG). The purpose of this report is to define a process for reproducing a state-of-the-art machine learning model at a level of quality suitable for inclusion in the TFMG. We define the engineering process and elaborate on each step, from paper analysis to model release. We report on our experiences implementing the YOLO model family with a team of 26 student researchers, share the tools we developed, and describe the lessons we learned along the way.
CVJun 19, 2021
Low-Power Multi-Camera Object Re-Identification using Hierarchical Neural NetworksAbhinav Goel, Caleb Tung, Xiao Hu et al.
Low-power computer vision on embedded devices has many applications. This paper describes a low-power technique for the object re-identification (reID) problem: matching a query image against a gallery of previously seen images. State-of-the-art techniques rely on large, computationally-intensive Deep Neural Networks (DNNs). We propose a novel hierarchical DNN architecture that uses attribute labels in the training dataset to perform efficient object reID. At each node in the hierarchy, a small DNN identifies a different attribute of the query image. The small DNN at each leaf node is specialized to re-identify a subset of the gallery: only the images with the attributes identified along the path from the root to a leaf. Thus, a query image is re-identified accurately after processing with a few small DNNs. We compare our method with state-of-the-art object reID techniques. With a 4% loss in accuracy, our approach realizes significant resource savings: 74% less memory, 72% fewer operations, and 67% lower query latency, yielding 65% less energy consumption.
SEMay 10, 2021
Why Aren't Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular ExpressionsJames C. Davis, Louis G. Michael, Christy A. Coghlan et al.
This paper explores the extent to which regular expressions (regexes) are portable across programming languages. Many languages offer similar regex syntaxes, and it would be natural to assume that regexes can be ported across language boundaries. But can regexes be copy/pasted across language boundaries while retaining their semantic and performance characteristics? In our survey of 158 professional software developers, most indicated that they re-use regexes across language boundaries and about half reported that they believe regexes are a universal language. We experimentally evaluated the riskiness of this practice using a novel regex corpus -- 537,806 regexes from 193,524 projects written in JavaScript, Java, PHP, Python, Ruby, Go, Perl, and Rust. Using our polyglot regex corpus, we explored the hitherto-unstudied regex portability problems: logic errors due to semantic differences, and security vulnerabilities due to performance differences. We report that developers' belief in a regex lingua franca is understandable but unfounded. Though most regexes compile across language boundaries, 15% exhibit semantic differences across languages and 10% exhibit performance differences across languages. We explained these differences using regex documentation, and further illuminate our findings by investigating regex engine implementations. Along the way we found bugs in the regex engines of JavaScript-V8, Python, Ruby, and Rust, and potential semantic and performance regex bugs in thousands of modules.
SESep 22, 2020
An Empirical Study on the Impact of Deep Parameters on Mobile App Energy UsageQiang Xu, James C. Davis, Y. Charlie Hu et al.
Improving software performance through configuration parameter tuning is a common activity during software maintenance. Beyond traditional performance metrics like latency, mobile app developers are interested in reducing app energy usage. Some mobile apps have centralized locations for parameter tuning, similar to databases and operating systems, but it is common for mobile apps to have hundreds of parameters scattered around the source code. The correlation between these "deep" parameters and app energy usage is unclear. Researchers have studied the energy effects of deep parameters in specific modules, but we lack a systematic understanding of the energy impact of mobile deep parameters. In this paper we empirically investigate this topic, combining a developer survey with systematic energy measurements. Our motivational survey of 25 Android developers suggests that developers do not understand, and largely ignore, the energy impact of deep parameters. To assess the potential implications of this practice, we propose a deep parameter energy profiling framework that can analyze the energy impact of deep parameters in an app. Our framework identifies deep parameters, mutates them based on our parameter value selection scheme, and performs reliable energy impact analysis. Applying the framework to 16 popular Android apps, we discovered that deep parameter-induced energy inefficiency is rare. We found only 2 out of 1644 deep parameters for which a different value would significantly improve its app's energy efficiency. A detailed analysis found that most deep parameters have either no energy impact, limited energy impact, or an energy impact only under extreme values. Our study suggests that it is generally safe for developers to ignore the energy impact when choosing deep parameter values in mobile apps.
SESep 11, 2020
A Principled Approach to GraphQL Query Cost AnalysisAlan Cha, Erik Wittern, Guillaume Baudart et al.
The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL's expressiveness is risky for service providers because clients can succinctly request stupendous amounts of data, and responding to overly complex queries can be costly or disrupt service availability. Recent empirical work has shown that many service providers are at risk. Using traditional API management methods is not sufficient, and practitioners lack principled means of estimating and measuring the cost of the GraphQL queries they receive. In this work, we present a linear-time GraphQL query analysis that can measure the cost of a query without executing it. Our approach can be applied in a separate API management layer and used with arbitrary GraphQL backends. In contrast to existing static approaches, our analysis supports common GraphQL conventions that affect query cost, and our analysis is provably correct based on our formal specification of GraphQL semantics. We demonstrate the potential of our approach using a novel GraphQL query-response corpus for two commercial GraphQL APIs. Our query analysis consistently obtains upper cost bounds, tight enough relative to the true response sizes to be actionable for service providers. In contrast, existing static GraphQL query analyses exhibit over-estimates and under-estimates because they fail to support GraphQL conventions.
SEJul 30, 2019
An Empirical Study of GraphQL SchemasErik Wittern, Alan Cha, James C. Davis et al.
GraphQL is a query language for APIs and a runtime to execute queries. Using GraphQL queries, clients define precisely what data they wish to retrieve or mutate on a server, leading to fewer round trips and reduced response sizes. Although interest in GraphQL is on the rise, with increasing adoption at major organizations, little is known about what GraphQL interfaces look like in practice. This lack of knowledge makes it hard for providers to understand what practices promote idiomatic, easy-to-use APIs, and what pitfalls to avoid. To address this gap, we study the design of GraphQL interfaces in practice by analyzing their schemas - the descriptions of their exposed data types and the possible operations on the underlying data. We base our study on two novel corpuses of GraphQL schemas, one of 16 commercial GraphQL schemas and the other of 8,399 GraphQL schemas mined from GitHub projects. We make both corpuses available to other researchers. Using these corpuses, we characterize the size of schemas and their use of GraphQL features and assess the use of both prescribed and organic naming conventions. We also report that a majority of APIs are susceptible to denial of service through complex queries, posing real security risks previously discussed only in theory. We also assess ways in which GraphQL APIs attempt to address these concerns.