SYJun 2
Dynamics of the Thermomagnetic PendulumRyan Thompson, Ethan Wang, Nilay Kant
A thermomagnetic pendulum is introduced as a coupled thermo-magnetic-mechanical system consisting of a ferromagnetic bob under gravity and an offset permanent magnet. Heating drives the bob temperature above and below the Curie point, causing magnetic attraction to vanish and recover as the bob moves and cools. A multiphysics model is developed in which the magnetic torque depends nonlinearly on the bob temperature field and pendulum configuration. The formulation couples transient three-dimensional heat transfer, a temperature-dependent magnetization law, and pendulum dynamics. Simulations show angular torque asymmetry, rapid force reduction near the Curie point, and sustained oscillations.
CRMay 2
FP-Agent: Fingerprinting AI Browsing AgentsEthan Wang, Zubair Shafiq, Yash Vekaria
AI browsing agents are an emerging class of AI-powered bots capable of autonomously navigating websites. Unlike traditional web bots, AI browsing agents typically operate using real browsers and perform everyday tasks, making them difficult to detect. Yet little is known about whether existing AI browsing agents can be distinguished from humans and one another based on their browser or behavioral fingerprints. In this paper, we present the first controlled measurement study of seven AI browsing agents and human users. Using an instrumented honey website, we collect browser and behavioral fingerprint features while AI browsing agents and humans perform three tasks: flight booking, online shopping, and forum interaction. We then train FP-Agent, a multi-class classifier, to evaluate the discriminative power of these features. We find that browser fingerprints provide limited discriminative power when shared by multiple AI browsing agents. Behavioral fingerprints, however, are distinctive: differences in typing, scrolling, and mouse behavior separate AI browsing agents from humans and one another. In a case study evaluating Cloudflare's bot detection, FP-Agent detects all seven AI browsing agents, whereas Cloudflare detects only one. Our findings show that behavioral fingerprints are a critical component to reliably detect and control this emerging form of web traffic.
SEMay 19
Code Generation by Differential Test Time ScalingYifeng He, Ethan Wang, Jicheng Wang et al.
Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.
CRApr 30
Tracking Conversations: Measuring Content and Identity Exposure on AI ChatbotsMuhammad Jazlan, Ethan Wang, Yash Vekaria et al.
AI chatbots are becoming a primary interface for seeking information. As their popularity grows, chatbot providers are starting to deploy advertising and analytics. Despite this, tracking on AI chatbots has not been systematically studied. We present a systematic measurement of web tracking on 20 popular AI chatbots. Under controlled settings using a sensitive prompt, we capture and compare network traffic in normal chats and, where supported, private chats. We search for exposure of two categories of information: content, including prompts, prompt-derived titles, chat URLs, and chat identifiers; and identity, including names, emails, account identifiers, first-party cookies, and explicit IP/User-Agent fields in payloads. We find that 17 of 20 chatbots share information with at least one third party. Three chatbots share plaintext conversation text, including both prompt and response snippets, with Microsoft Clarity through session replay. Fifteen chatbots share conversation URLs or chat identifiers with third-party advertising, analytics, or social endpoints. Several chatbots expose user identity through support widgets, analytics, advertising, and session replay tags; in some cases, hashed emails are shared.
SEFeb 4, 2024
UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program TestingYifeng He, Jiabo Huang, Yuyang Rong et al.
The remarkable capability of large language models (LLMs) in generating high-quality code has drawn increasing attention in the software testing community. However, existing code LLMs often demonstrate unsatisfactory capabilities in generating accurate and complete tests since they were trained on code snippets collected without differentiating between code for testing purposes and other code. In this paper, we present a large-scale dataset UniTSyn, which is capable of enhancing the prowess of LLMs for Unit Test Synthesis. Associating tests with the tested functions is crucial for LLMs to infer the expected behavior and the logic paths to be verified. By leveraging Language Server Protocol, UniTSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language heuristics that tend to be fragile and difficult to scale. It contains 2.7 million focal-test pairs across five mainstream programming languages, making it possible to be utilized for enhancing the test generation ability of LLMs. The details of UniTSyn can be found in Table 1. Our experiments demonstrate that, by building an autoregressive model based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations, resulting in improved generation accuracy and code coverage across all evaluated programming languages. Code and data will be publicly available.
CRJun 12, 2024
Security of AI AgentsYifeng He, Ethan Wang, Yuyang Rong et al.
AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments. Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential vulnerabilities are not addressed by the frameworks used to build the agents, nor by research aimed at improving the agents. In this paper, we identify and describe these vulnerabilities in detail from a system security perspective, emphasizing their causes and severe effects. Furthermore, we introduce defense mechanisms corresponding to each vulnerability with design and experiments to evaluate their viability. Altogether, this paper contextualizes the security issues in the current development of AI agents and delineates methods to make AI agents safer and more reliable.
LGMay 18, 2021
Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled ApproachYan Li, Lingxiao Wang, Jiachen Yang et al.
Multi-agent reinforcement learning (MARL) becomes more challenging in the presence of more agents, as the capacity of the joint state and action spaces grows exponentially in the number of agents. To address such a challenge of scale, we identify a class of cooperative MARL problems with permutation invariance, and formulate it as a mean-field Markov decision processes (MDP). To exploit the permutation invariance therein, we propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. Moreover, its sample complexity is independent of the number of agents. We validate the theoretical advantages of MF-PPO with numerical experiments in the multi-agent particle environment (MPE). In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors with a smaller number of model parameters, which is the key to its generalization performance.