AIFeb 11
GameDevBench: Evaluating Agentic Capabilities Through Game DevelopmentWayne Chi, Yixiong Fang, Arnav Yayavaram et al.
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
CYJun 12, 2022
"COVID-19 was a FIFA conspiracy #curropt": An Investigation into the Viral Spread of COVID-19 MisinformationAlexander Wang, Jerry Sun, Kaitlyn Chen et al.
The outbreak of the infectious and fatal disease COVID-19 has revealed that pandemics assail public health in two waves: first, from the contagion itself and second, from plagues of suspicion and stigma. Now, we have in our hands and on our phones an outbreak of moral controversy. Modern dependency on social medias has not only facilitated access to the locations of vaccine clinics and testing sites but also-and more frequently-to the convoluted explanations of how "COVID-19 was a FIFA conspiracy"[1]. The MIT Media Lab finds that false news "diffuses significantly farther, faster, deeper, and more broadly than truth, in all categories of information, and by an order of magnitude"[2]. The question is, how does the spread of misinformation interact with a physical epidemic disease? In this paper, we estimate the extent to which misinformation has influenced the course of the COVID-19 pandemic using natural language processing models and provide a strategy to combat social media posts that are likely to cause widespread harm.
LGApr 4
Online learning of smooth functions on $\mathbb{R}$Jesse Geneson, Kuldeep Singh, Alexander Wang
We study adversarial online learning of real-valued functions on $\mathbb{R}$. In each round the learner is queried at $x_t\in\mathbb{R}$, predicts $\hat y_t$, and then observes the true value $f(x_t)$; performance is measured by cumulative $p$-loss $\sum_{t\ge 1}|\hat y_t-f(x_t)|^p$. For the class \[ \mathcal{G}_q=\Bigl\{f:\mathbb{R}\to\mathbb{R}\ \text{absolutely continuous}:\ \int_{\mathbb{R}}|f'(x)|^q\,dx\le 1\Bigr\}, \] we show that the standard model becomes ill-posed on $\mathbb{R}$: for every $p\ge 1$ and $q>1$, an adversary can force infinite loss. Motivated by this obstruction, we analyze three modified learning scenarios that limit the influence of queries that are far from previously observed inputs. In Scenario 1 the adversary must choose each new query within distance $1$ of some past query. In Scenario 2 the adversary may query anywhere, but the learner is penalized only on rounds whose query lies within distance $1$ of a past query. In Scenario 3 the loss in round $t$ is multiplied by a weight $g(\min_{j<t}|x_t-x_j|)$. We obtain sharp characterizations for Scenarios 1-2 in several regimes. For Scenario 3 we identify a clean threshold phenomenon: if $g$ decays too slowly, then the adversary can force infinite weighted loss. In contrast, for rapidly decaying weights such as $g(z)=e^{-cz}$ we obtain finite and sharp guarantees in the quadratic case $p=q=2$. Finally, we study a natural multivariable slice generalization $\mathcal{G}_{q,d}$ of $\mathcal{G}_q$ on $\mathbb{R}^d$ and show a sharp dichotomy: while the one-dimensional case admits finite opt-values in certain regimes, for every $d\ge 2$ the slice class $\mathcal{G}_{q,d}$ is too permissive, and even under Scenarios 1-3 an adversary can force infinite loss.
CVJan 6, 2025Code
SAM-EM: Real-Time Segmentation for Automated Liquid Phase Transmission Electron MicroscopyAlexander Wang, Max Xu, Risha Goel et al.
The absence of robust segmentation frameworks for noisy liquid phase transmission electron microscopy (LPTEM) videos prevents reliable extraction of particle trajectories, creating a major barrier to quantitative analysis and to connecting observed dynamics with materials characterization and design. To address this challenge, we present Segment Anything Model for Electron Microscopy (SAM-EM), a domain-adapted foundation model that unifies segmentation, tracking, and statistical analysis for LPTEM data. Built on Segment Anything Model 2 (SAM~2), SAM-EM is derived through full-model fine-tuning on 46,600 curated LPTEM synthetic video frames, substantially improving mask quality and temporal identity stability compared to zero-shot SAM~2 and existing baselines. Beyond segmentation, SAM-EM integrates particle tracking with statistical tools, including mean-squared displacement and particle displacement distribution analysis, providing an end-to-end framework for extracting and interpreting nanoscale dynamics. Crucially, full fine-tuning allows SAM-EM to remain robust under low signal-to-noise conditions, such as those caused by increased liquid sample thickness in LPTEM experiments. By establishing a reliable analysis pipeline, SAM-EM transforms LPTEM into a quantitative single-particle tracking platform and accelerates its integration into data-driven materials discovery and design. Project page: \href{https://github.com/JamaliLab/SAM-EM}{github.com/JamaliLab/SAM-EM}.
CVAug 27, 2020
SketchEmbedNet: Learning Novel Concepts by Imitating DrawingsAlexander Wang, Mengye Ren, Richard S. Zemel
Sketch drawings capture the salient information of visual concepts. Previous work has shown that neural networks are capable of producing sketches of natural objects drawn from a small number of classes. While earlier approaches focus on generation quality or retrieval, we explore properties of image representations learned by training a model to produce sketches of images. We show that this generative, class-agnostic model produces informative embeddings of images from novel examples, classes, and even novel datasets in a few-shot setting. Additionally, we find that these learned representations exhibit interesting structure and compositionality.
CVJul 22, 2019
Covering up bias in CelebA-like datasets with Markov blankets: A post-hoc cure for attribute prior avoidanceVinay Uday Prabhu, Dian Ang Yap, Alexander Wang et al.
Attribute prior avoidance entails subconscious or willful non-modeling of (meta)attributes that datasets are oft born with, such as the 40 semantic facial attributes associated with the CelebA and CelebA-HQ datasets. The consequences of this infirmity, we discover, are especially stark in state-of-the-art deep generative models learned on these datasets that just model the pixel-space measurements, resulting in an inter-attribute bias-laden latent space. This viscerally manifests itself when we perform face manipulation experiments based on latent vector interpolations. In this paper, we address this and propose a post-hoc solution that utilizes an Ising attribute prior learned in the attribute space and showcase its efficacy via qualitative experiments.