Tao Dong

HC
6papers
67citations
Novelty32%
AI Score39

6 Papers

77.0SEMay 28
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Ningzhi Tang, Chaoran Chen, Gelei Xu et al.

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.

SEDec 29, 2025
From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering

Tao Dong, Harini Sampath, Ja Young Lee et al.

As Large Language Models (LLMs) evolve from code generators into collaborative partners for software engineers, our methods for evaluation are lagging. Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership. To bridge this evaluation gap, this paper makes two core contributions. First, we present a foundational taxonomy of desirable agent behaviors for enterprise software engineering, derived from an analysis of 91 sets of user-defined agent rules. This taxonomy defines four key expectations of agent behavior: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solving Problems Effectively, and Collaborating with the User. Second, recognizing that these expectations are not static, we introduce the Context-Adaptive Behavior (CAB) Framework. This emerging framework reveals how behavioral expectations shift along two empirically-derived axes: the Time Horizon (from immediate needs to future ideals), established through interviews with 15 expert engineers, and the Type of Work (from enterprise production to rapid prototyping, for example), identified through a prompt analysis of a prototyping agent. Together, these contributions offer a human-centered foundation for designing and evaluating the next generation of AI agents, moving the field's focus from the correctness of generated code toward the dynamics of true collaborative intelligence.

NIFeb 3, 2022
Multi Objective Resource Optimization of Wireless Network Based on Cross Domain Virtual Network Embedding

Chao Wang, Tao Dong, Youxiang Duan et al.

The rapid development of virtual network architecture makes it possible for wireless network to be widely used. With the popularity of artificial intelligence (AI) industry in daily life, efficient resource allocation of wireless network has become a problem. Especially when network users request wireless network resources from different management domains, they still face many practical problems. From the perspective of virtual network embedding (VNE), this paper designs and implements a multi-objective optimization VNE algorithm for wireless network resource allocation. Resource allocation in virtual network is essentially a problem of allocating underlying resources for virtual network requests (VNRs). According to the proposed objective formula, we consider the optimization mapping cost, network delay and VNR acceptance rate. VNE is completed by node mapping and link mapping. In the experiment and simulation stage, it is compared with other VNE algorithms, the cross domain VNE algorithm proposed in this paper is optimal in the above three indicators. This shows the effectiveness of the algorithm in wireless network resource allocation.

HCOct 28, 2020
Towards Supporting Programming Education at Scale via Live Streaming

Yan Chen, Walter S. Lasecki, Tao Dong

Live streaming, which allows streamers to broadcast their work to live viewers, is an emerging practice for teaching and learning computer programming. Participation in live streaming is growing rapidly, despite several apparent challenges, such as a general lack of training in pedagogy among streamers and scarce signals about a stream's characteristics (e.g., difficulty, style, and usefulness) to help viewers decide what to watch. To understand why people choose to participate in live streaming for teaching or learning programming, and how they cope with both apparent and non-obvious challenges, we interviewed 14 streamers and 12 viewers about their experience with live streaming programming. Among other results, we found that the casual and impromptu nature of live streaming makes it easier to prepare than pre-recorded videos, and viewers have the opportunity to shape the content and learning experience via real-time communication with both the streamer and each other. Nonetheless, we identified several challenges that limit the potential of live streaming as a learning medium. For example, streamers voiced privacy and harassment concerns, and existing streaming platforms do not adequately support viewer-streamer interactions, adaptive learning, and discovery and selection of streaming content. Based on these findings, we suggest specialized tools to facilitate knowledge sharing among people teaching and learning computer programming online, and we offer design recommendations that promote a healthy, safe, and engaging learning environment.

HCJul 11, 2018
A Computational Method for Evaluating UI Patterns

Bardia Doosti, Tao Dong, Biplab Deka et al.

UI design languages, such as Google's Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navigation Drawer. To address these challenges, this short paper introduces a method for measuring the impact of design languages and informing design debates through analyzing a dataset consisting of view hierarchies, screenshots, and app metadata for more than 9,000 mobile apps. Our data analysis shows that use of Material Design is positively correlated to app ratings, and to some extent, also the number of installs. Furthermore, we show that use of UI components vary by app category, suggesting a more nuanced view needed in design debates.

PMJun 5, 2018
A Machine Learning Framework for Stock Selection

XingYu Fu, JinHong Du, YiFeng Guo et al.

This paper demonstrates how to apply machine learning algorithms to distinguish good stocks from the bad stocks. To this end, we construct 244 technical and fundamental features to characterize each stock, and label stocks according to their ranking with respect to the return-to-volatility ratio. Algorithms ranging from traditional statistical learning methods to recently popular deep learning method, e.g. Logistic Regression (LR), Random Forest (RF), Deep Neural Network (DNN), and the Stacking, are trained to solve the classification task. Genetic Algorithm (GA) is also used to implement feature selection. The effectiveness of the stock selection strategy is validated in Chinese stock market in both statistical and practical aspects, showing that: 1) Stacking outperforms other models reaching an AUC score of 0.972; 2) Genetic Algorithm picks a subset of 114 features and the prediction performances of all models remain almost unchanged after the selection procedure, which suggests some features are indeed redundant; 3) LR and DNN are radical models; RF is risk-neutral model; Stacking is somewhere between DNN and RF. 4) The portfolios constructed by our models outperform market average in back tests.