CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active ParametersAilin Huang, Ang Li, Aobo Kong et al.
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
LGAug 28, 2023
NAS-X: Neural Adaptive Smoothing via TwistingDieterich Lawson, Michael Li, Scott Linderman
Sequential latent variable models (SLVMs) are essential tools in statistics and machine learning, with applications ranging from healthcare to neuroscience. As their flexibility increases, analytic inference and model learning can become challenging, necessitating approximate methods. Here we introduce neural adaptive smoothing via twisting (NAS-X), a method that extends reweighted wake-sleep (RWS) to the sequential setting by using smoothing sequential Monte Carlo (SMC) to estimate intractable posterior expectations. Combining RWS and smoothing SMC allows NAS-X to provide low-bias and low-variance gradient estimates, and fit both discrete and continuous latent variable models. We illustrate the theoretical advantages of NAS-X over previous methods and explore these advantages empirically in a variety of tasks, including a challenging application to mechanistic models of neuronal dynamics. These experiments show that NAS-X substantially outperforms previous VI- and RWS-based methods in inference and model learning, achieving lower parameter error and tighter likelihood bounds.
HCMar 13
Towards Fluent Interaction with Cyber-Physical ArchitectureJesse T. Gonzalez, Neeta Khanuja, Michael Li et al.
What happens when your walls begin to move? This paper explores the design of human-robot interaction for architectural-scale, shape-changing environments. We present findings from two studies: (1) a series of speculative design workshops (N=20) that uncovered aspirational visions for these spaces, and (2) a task-based Wizard-of-Oz elicitation study (N=12) that grounded these visions in the challenges of practical interaction. Our workshop findings reveal a complex landscape of user desires, exposing critical tensions between proactive automation and the preservation of user autonomy, and between personalization and public ownership. Our elicitation study reveals a set of core interaction challenges related to multimodal collaboration; and, most critically: suggests the need for a modality-agnostic model of evolving user intent. We conclude with a set of grounded proposals for creating robotic environments that are collaborative and trusted partners in everyday life.
CLJul 1, 2024
View From Above: A Framework for Evaluating Distribution Shifts in Model BehaviorTanush Chopra, Michael Li, Jacob Haimes
When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
CLJun 2, 2025Code
Echoes of BERT: Do Modern Language Models Rediscover the Classical NLP Pipeline?Michael Li, Nishant Subramani · cmu
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information relies primarily on studies of early models like BERT and GPT-2. Building on classic BERTology work, we analyze 25 models spanning from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1), probing layer-by-layer representations across eight linguistic tasks in English. Consistent with earlier findings, we find that hierarchical organization persists in modern models: early layers capture syntax, middle layers handle semantics and entity-level information, and later layers encode discourse phenomena. We dive deeper, conducting an in-depth multilingual analysis of two specific linguistic properties - lexical identity and inflectional morphology - that help disentangle form from meaning. We find that lexical information concentrates linearly in early layers but becomes increasingly nonlinear deeper in the network, while inflectional information remains linearly accessible throughout all layers. Additional analyses of attention mechanisms, steering vectors, and pretraining checkpoints reveal where this information resides within layers, how it can be functionally manipulated, and how representations evolve during pretraining. Taken together, our findings suggest that, even with substantial advances in LLM technologies, transformer models learn to organize linguistic information in similar ways, regardless of model architecture, size, or training regime, indicating that these properties are important for next token prediction. Our code is available at https://github.com/ml5885/model_internal_sleuthing
AIMar 30
Beyond the Answer: Decoding the Behavior of LLMs as Scientific ReasonersRohan Pandey, Eric Ye, Michael Li
As Large Language Models (LLMs) achieve increasingly sophisticated performance on complex reasoning tasks, current architectures serve as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety. Furthermore, understanding how prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with AGI systems. In this work, we use a custom variant of Genetic Pareto (GEPA) to systematically optimize prompts for scientific reasoning tasks, and analyze how prompting can affect reasoning behavior. We investigate the structural patterns and logical heuristics inherent in GEPA-optimized prompts, and evaluate their transferability and brittleness. Our findings reveal that gains in scientific reasoning often correspond to model-specific heuristics that fail to generalize across systems, which we call "local" logic. By framing prompt optimization as a tool for model interpretability, we argue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effectively collaborating with superhuman intelligence.
CLMay 8
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model CircuitsMichael Li, Nishant Subramani
The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.
LGAug 12, 2025
Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman ProblemMichael Li, Eric Bae, Christopher Haberland et al.
The Traveling Salesman Problem (TSP) is a classic NP-hard combinatorial optimization task with numerous practical applications. Classic heuristic solvers can attain near-optimal performance for small problem instances, but become computationally intractable for larger problems. Real-world logistics problems such as dynamically re-routing last-mile deliveries demand a solver with fast inference time, which has led researchers to investigate specialized neural network solvers. However, neural networks struggle to generalize beyond the synthetic data they were trained on. In particular, we show that there exist TSP distributions that are realistic in practice, which also consistently lead to poor worst-case performance for existing neural approaches. To address this issue of distribution robustness, we present Combinatorial Optimization with Generative Sampling (COGS), where training data is sampled from a generative TSP model. We show that COGS provides better data coverage and interpolation in the space of TSP training distributions. We also present TSPLib50, a dataset of realistically distributed TSP samples, which tests real-world generalization ability without conflating this issue with instance size. We evaluate our method on various synthetic datasets as well as TSPLib50, and compare to state-of-the-art neural baselines. We demonstrate that COGS improves distribution robustness, with most performance gains coming from worst-case scenarios.
CVSep 29, 2021
Generative Probabilistic Image ColorizationChie Furusawa, Shinya Kitaoka, Michael Li et al.
We propose Generative Probabilistic Image Colorization, a diffusion-based generative process that trains a sequence of probabilistic models to reverse each step of noise corruption. Given a line-drawing image as input, our method suggests multiple candidate colorized images. Therefore, our method accounts for the ill-posed nature of the colorization problem. We conducted comprehensive experiments investigating the colorization of line-drawing images, report the influence of a score-based MCMC approach that corrects the marginal distribution of estimated samples, and further compare different combinations of models and the similarity of their generated images. Despite using only a relatively small training dataset, we experimentally develop a method to generate multiple diverse colorization candidates which avoids mode collapse and does not require any additional constraints, losses, or re-training with alternative training conditions. Our proposed approach performed well not only on color-conditional image generation tasks using biased initial values, but also on some practical image completion and inpainting tasks.