Yuxuan Wan

SE
h-index30
32papers
634citations
Novelty54%
AI Score59

32 Papers

38.1SEMar 15Code
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Jingyu Xiao, Ming Wang, Man Ho Lam et al.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs' capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs' framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at https://github.com/WebPAI/DesignBench.

CLMar 15, 2023
ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

Haoran Wu, Wenxuan Wang, Yuxuan Wan et al.

ChatGPT is a cutting-edge artificial intelligence language model developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Error Correction(GEC) task, and compare it with commercial GEC product (e.g., Grammarly) and state-of-the-art models (e.g., GECToR). By testing on the CoNLL2014 benchmark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., $F_{0.5}$ score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes beyond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under-correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.

LGOct 18, 2022
Transferable Unlearnable Examples

Jie Ren, Han Xu, Yuxuan Wan et al.

With more people publishing their personal data online, unauthorized data usage has become a serious concern. The unlearnable strategies have been introduced to prevent third parties from training on the data without permission. They add perturbations to the users' data before publishing, which aims to make the models trained on the perturbed published dataset invalidated. These perturbations have been generated for a specific training setting and a target dataset. However, their unlearnable effects significantly decrease when used in other training settings and datasets. To tackle this issue, we propose a novel unlearnable strategy based on Classwise Separability Discriminant (CSD), which aims to better transfer the unlearnable effects to other training settings and datasets by enhancing the linear separability. Extensive experiments demonstrate the transferability of the proposed unlearnable examples across training settings and datasets.

AIJan 22Code
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Yuxuan Wan, Tianqing Fang, Zaitang Li et al.

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.

LGJun 1, 2022
Defense Against Gradient Leakage Attacks via Learning to Obscure Data

Yuxuan Wan, Han Xu, Xiaorui Liu et al.

Federated learning is considered as an effective privacy-preserving learning mechanism that separates the client's data and model training process. However, federated learning is still under the risk of privacy leakage because of the existence of attackers who deliberately conduct gradient leakage attacks to reconstruct the client data. Recently, popular strategies such as gradient perturbation methods and input encryption methods have been proposed to defend against gradient leakage attacks. Nevertheless, these defenses can either greatly sacrifice the model performance, or be evaded by more advanced attacks. In this paper, we propose a new defense method to protect the privacy of clients' data by learning to obscure data. Our defense method can generate synthetic samples that are totally distinct from the original samples, but they can also maximally preserve their predictive features and guarantee the model performance. Furthermore, our defense strategy makes the gradient leakage attack and its variants extremely difficult to reconstruct the client data. Through extensive experiments, we show that our proposed defense method obtains better privacy protection while preserving high accuracy compared with state-of-the-art methods.

LGOct 17, 2022
Probabilistic Categorical Adversarial Attack & Adversarial Training

Han Xu, Pengfei He, Jie Ren et al.

The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lack of extensive exploration. Previously established methods leverage greedy search method, which can be very time-consuming to conduct successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms.

16.6CVMay 15Code
ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

Haozhe Si, Yuxuan Wan, Yuqing Wang et al.

Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

19.4SEMay 13Code
UIBenchKit: A unified toolkit for design-to-code model evaluation

Chinh T. Le, Trevor Ong Yee Siang, Jingyu Xiao et al.

Recent years have seen substantial progress in automated design-to-code generation, with many methods proposed for generating HTML and CSS from webpage screenshots. However, the absence of a standardized evaluation platform makes it difficult to compare these methods fairly, limiting both practical adoption and systematic research progress. To bridge this gap, we introduce UIBenchKit, an open-source, integrated toolkit designed to unify the evaluation of design-to-code tasks. UIBenchKit abstracts the complexities of environment setup, model inference, and code rendering, offering researchers a plug-and-play architecture to compare various methods under consistent settings. In addition, it offers an analytical interface for comparison across multiple metrics. Using UIBenchKit, we conduct a benchmarking study of existing tools and derive several findings that highlight directions for future improvement. By providing a streamlined environment for both experimentation and evaluation, UIBenchKit aims to accelerate future benchmarking and innovations in web engineering. The evaluation platform and toolkit are available at the project page https://www.uibenchkit.com/.

LGOct 18, 2022
Towards Fair Classification against Poisoning Attacks

Han Xu, Xiaorui Liu, Yuxuan Wan et al.

Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.

26.7ROMay 6
HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

Nandiraju Gireesh, Yuanliang Ju, Chaoyi Xu et al.

Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

SEJan 1, 2024Code
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

Yuxuan Wan, Wenxuan Wang, Yiliu Yang et al. · pku, tencent-ai

We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs' prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29\% to 90\% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5\%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs' formal reasoning capabilities. We make our code, data, and results publicly available (https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.

AIAug 1, 2025Code
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang et al. · tencent-ai

General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

CVJul 30, 2025Code
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan et al.

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

16.8CVMay 18
LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

Haozhe Si, Yuxuan Wan, Yuqing Wang et al.

Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.

SESep 15, 2025Code
EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression

Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan et al.

Multimodal Large Language Models have demonstrated exceptional performance in UI2Code tasks, significantly enhancing website development efficiency. However, these tasks incur substantially higher computational overhead than traditional code generation due to the large number of input image tokens and extensive output code tokens required. Our comprehensive study identifies significant redundancies in both image and code tokens that exacerbate computational complexity and hinder focus on key UI elements, resulting in excessively lengthy and often invalid HTML files. We propose EfficientUICoder, a compression framework for efficient UI code generation with three key components. First, Element and Layout-aware Token Compression preserves essential UI information by detecting element regions and constructing UI element trees. Second, Region-aware Token Refinement leverages attention scores to discard low-attention tokens from selected regions while integrating high-attention tokens from unselected regions. Third, Adaptive Duplicate Token Suppression dynamically reduces repetitive generation by tracking HTML/CSS structure frequencies and applying exponential penalties. Extensive experiments show EfficientUICoderachieves a 55%-60% compression ratio without compromising webpage quality and delivers superior efficiency improvements: reducing computational cost by 44.9%, generated tokens by 41.4%, prefill time by 46.6%, and inference time by 48.8% on 34B-level MLLMs. Code is available at https://github.com/WebPAI/EfficientUICoder.

ROMar 27, 2024Code
SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network

Yuxuan Wan, Kaichen Zhou, jinhong Chen et al.

Autonomous assembly in robotics and 3D vision presents significant challenges, particularly in ensuring assembly correctness. Presently, predominant methods such as MEPNet focus on assembling components based on manually provided images. However, these approaches often fall short in achieving satisfactory results for tasks requiring long-term planning. Concurrently, we observe that integrating a self-correction module can partially alleviate such issues. Motivated by this concern, we introduce the Single-Step Assembly Error Correction Task, which involves identifying and rectifying misassembled components. To support research in this area, we present the LEGO Error Correction Assembly Dataset (LEGO-ECA), comprising manual images for assembly steps and instances of assembly failures. Additionally, we propose the Self-Correct Assembly Network (SCANet), a novel method to address this task. SCANet treats assembled components as queries, determining their correctness in manual images and providing corrections when necessary. Finally, we utilize SCANet to correct the assembly results of MEPNet. Experimental results demonstrate that SCANet can identify and correct MEPNet's misassembled results, significantly improving the correctness of assembly. Our code and dataset could be found at https://scanet-iros2024.github.io/.

16.6SEMay 17
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

Yuxuan Wan, Tingshuo Liang, Jiakai Xu et al.

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements into structured acceptance tests before any code is written, (2) deploying the application and validating it through browser-based interaction simulation, and (3) translating browser-observed failures into structured repair reports for the coding agent. Enabled by TDDev, we conduct the first controlled empirical study of Test-driven development (TDD) strategies for web application generation, comparing four development protocols across two coding agents, two backbone models, and two benchmarks. TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement. Mismatching protocol to generation style eliminates the TDD benefit entirely while multiplying token cost up to 25-fold. A user study confirms that TDDev reduces manual developer intervention to zero, shifting the workload from continuous prompt engineering to autonomous, feedback-driven refinement.

AISep 30, 2025Code
90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development

Runxin Yang, Yuxuan Wan, Shuqing Li et al.

Developing 3D games requires specialized expertise across multiple domains, including programming, 3D modeling, and engine configuration, which limits access to millions of potential creators. Recently, researchers have begun to explore automated game development. However, existing approaches face three primary challenges: (1) limited scope to 2D content generation or isolated code snippets; (2) requirement for manual integration of generated components into game engines; and (3) poor performance on handling interactive game logic and state management. While Multimodal Large Language Models (MLLMs) demonstrate potential capabilities to ease the game generation task, a critical gap still remains in translating these outputs into production-ready, executable game projects based on game engines such as Unity and Unreal Engine. To bridge the gap, this paper introduces UniGen, the first end-to-end coordinated multi-agent framework that automates zero-coding development of runnable 3D games from natural language requirements. Specifically, UniGen uses a Planning Agent that interprets user requirements into structured blueprints and engineered logic descriptions; after which a Generation Agent produces executable C# scripts; then an Automation Agent handles engine-specific component binding and scene construction; and lastly a Debugging Agent provides real-time error correction through conversational interaction. We evaluated UniGen on three distinct game prototypes. Results demonstrate that UniGen not only democratizes game creation by requiring no coding from the user, but also reduces development time by 91.4%. We release UniGen at https://github.com/yxwan123/UniGen. A video demonstration is available at https://www.youtube.com/watch?v=xyJjFfnxUx0.

SEJan 1, 2024
New Job, New Gender? Measuring the Social Bias in Image Generation Models

Wenxuan Wang, Haonan Bai, Jen-tse Huang et al. · pku, tencent-ai

Image generation models can generate or edit images from a given text. Recent advancements in image generation technology, exemplified by DALL-E and Midjourney, have been groundbreaking. These advanced models, despite their impressive capabilities, are often trained on massive Internet datasets, making them susceptible to generating content that perpetuates social stereotypes and biases, which can lead to severe consequences. Prior research on assessing bias within image generation models suffers from several shortcomings, including limited accuracy, reliance on extensive human labor, and lack of comprehensive analysis. In this paper, we propose BiasPainter, a novel evaluation framework that can accurately, automatically and comprehensively trigger social bias in image generation models. BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries. These queries span 62 professions, 39 activities, 57 types of objects, and 70 personality traits. The framework then compares the edited images to the original seed images, focusing on the significant changes related to gender, race, and age. BiasPainter adopts a key insight that these characteristics should not be modified when subjected to neutral prompts. Built upon this design, BiasPainter can trigger the social bias and evaluate the fairness of image generation models. We use BiasPainter to evaluate six widely-used image generation models, such as stable diffusion and Midjourney. Experimental results show that BiasPainter can successfully trigger social bias in image generation models. According to our human evaluation, BiasPainter can achieve 90.8% accuracy on automatic bias detection, which is significantly higher than the results reported in previous work.

23.2LGMay 6
Attribution-Guided Continual Learning for Large Language Models

Yazheng Liu, Yuxuan Wan, Rui Xu et al.

Large language models (LLMs) often suffer from catastrophic forgetting in continual learning: after learning new tasks sequentially, they perform worse on earlier tasks. Existing methods mitigate catastrophic forgetting by data replay, parameter freezing, or regularization. However, these methods lack semantic awareness of internal knowledge distribution in LLMs. As a result, they cannot distinguish parameters that should be preserved or updated. We propose an attribution-guided continual fine-tuning framework for LLMs. Our method estimates task-specific, element-wise parameter importance in each Transformer layer and uses these scores to modulate gradients. Parameters important to previous tasks receive smaller updates, while less relevant ones remain plastic for learning new tasks. Experiments on continual learning benchmarks show that our method consistently outperforms baselines, achieving better retention of old tasks while maintaining competitive performance on new tasks.

SEDec 19, 2024
MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs

Yuxuan Wan, Yi Dong, Jingyu Xiao et al.

Multi-page websites dominate modern web development. However, existing design-to-code methods rely on simplified assumptions, limiting to single-page, self-contained webpages without external resource connection. To address this gap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation task, which transforms UI designs into multi-page, functional web UIs with internal/external navigation, image loading, and backend routing. We propose a novel resource list data structure to track resources, links, and design components. Our study applies existing methods to the MRWeb problem using a newly curated dataset of 500 websites (300 synthetic, 200 real-world). Specifically, we identify the best metric to evaluate the similarity of the web UI, assess the impact of the resource list on MRWeb generation, analyze MLLM limitations, and evaluate the effectiveness of the MRWeb tool in real-world workflows. The results show that resource lists boost navigation functionality from 0% to 66%-80% while facilitating visual similarity. Our proposed metrics and evaluation framework provide new insights into MLLM performance on MRWeb tasks. We release the MRWeb tool, dataset, and evaluation framework to promote further research.

CVMar 25, 2024
DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

Yunlong Tang, Yuxuan Wan, Lei Qi et al.

Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.

SENov 5, 2024
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

Jingyu Xiao, Yuxuan Wan, Yintong Huo et al.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task, i.e., generating UI code from UI mock-ups. However, existing benchmarks only contain static web pages for evaluation and ignore the dynamic interaction, limiting the practicality, usability and user engagement of the generated webpages. To bridge these gaps, we present the first systematic investigation of MLLMs in generating interactive webpages. Specifically, we formulate the Interaction-to-Code task and establish the Interaction2Code benchmark, encompassing 127 unique webpages and 374 distinct interactions across 15 webpage types and 31 interaction categories. Through comprehensive experiments utilizing state-of-the-art (SOTA) MLLMs, evaluated via both automatic metrics and human assessments, we identify four critical limitations of MLLM on Interaction-to-Code task: (1) inadequate generation of interaction compared with full page, (2) prone to ten types of failure, (3) poor performance on visually subtle interactions, and (4) insufficient undestanding on interaction when limited to single-modality visual descriptions. To address these limitations, we propose four enhancement strategies: Interactive Element Highlighting, Failureaware Prompting (FAP), Visual Saliency Enhancement, and Visual-Textual Descriptions Combination, all aiming at improving MLLMs' performance on the Interaction-toCode task. The Interaction2Code benchmark and code are available in https://github. com/WebPAI/Interaction2Code.

ROFeb 25, 2025
FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real

Weiheng Liu, Yuxuan Wan, Jilong Wang et al. · pku

Generalizable object fetching in cluttered scenes remains a fundamental and application-critical challenge in embodied AI. Closely packed objects cause inevitable occlusions, making safe action generation particularly difficult. Under such partial observability, effective policies must not only generalize across diverse objects and layouts but also reason about occlusion to avoid collisions. However, collecting large-scale real-world data for this task remains prohibitively expensive, leaving this problem largely unsolved. In this paper, we introduce FetchBot, a sim-to-real framework for this challenge. We first curate a large-scale synthetic dataset featuring 1M diverse scenes and 500k representative demonstrations. Based on this dataset, FetchBot employs a depth-conditioned method for action generation, which leverages structural cues to enable robust obstacle-aware action planning. However, depth is perfect in simulation but noisy in real-world environments. To address this sim-to-real gap, FetchBot predicts depth from RGB inputs using a foundation model and integrates local occupancy prediction as a pre-training task, providing a generalizable latent representation for sim-to-real transfer. Extensive experiments in simulation and real-world environments demonstrate the strong zero-shot sim-to-real transfer, effective clutter handling, and adaptability to novel scenarios. In cluttered environments, it achieves an average real-world success rate of 89.95%, significantly outperforming prior methods. Moreover, FetchBot demonstrates excellent robustness in challenging cases, such as fetching transparent, reflective, and irregular objects, highlighting its practical value.

SESep 29, 2025
Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development

Yuxuan Wan, Tingshuo Liang, Jiakai Xu et al.

Developing full-stack web applications is complex and time-intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models (MLLMs) enable automated webpage generation from visual inputs, current solutions remain limited to front-end tasks and fail to deliver fully functional applications. In this work, we introduce TDDev, the first test-driven development (TDD)-enabled LLM-agent framework for end-to-end full-stack web application generation. Given a natural language description or design image, TDDev automatically derives executable test cases, generates front-end and back-end code, simulates user interactions, and iteratively refines the implementation until all requirements are satisfied. Our framework addresses key challenges in full-stack automation, including underspecified user requirements, complex interdependencies among multiple files, and the need for both functional correctness and visual fidelity. Through extensive experiments on diverse application scenarios, TDDev achieves a 14.4% improvement on overall accuracy compared to state-of-the-art baselines, demonstrating its effectiveness in producing reliable, high-quality web applications without requiring manual intervention.

CVMar 17, 2025
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Haozhe Si, Yuxuan Wan, Minh Do et al.

Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

17.3ROApr 2
Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

Yuhui Chen, Haoran Li, Zhennan Jiang et al.

Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks. Videos are available at our project website https://cccedric.github.io/poco/.

SEOct 20, 2025
TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework

Shuzheng Gao, Eric John Li, Man Ho Lam et al.

Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction. Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit;

LGAug 15, 2025
Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing

Ruicheng Xian, Yuxuan Wan, Han Zhao

Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness -- preventing disparate impacts across demographic groups -- is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair classifiers from closed-weight LLMs via prompting: the LLM is treated as a feature extractor, and features are elicited from its probabilistic predictions (e.g., token log probabilities) using prompts strategically designed for the specified fairness criterion to obtain sufficient statistics for fair classification; a fair algorithm is then applied to these features to train a lightweight fair classifier in a post-hoc manner. Experiments on five datasets, including three tabular ones, demonstrate strong accuracy-fairness tradeoffs for the classifiers derived by our framework from both open-weight and closed-weight LLMs; in particular, our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings (i.e., head-tuning) or from scratch on raw tabular features.

SEJun 24, 2024
Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Yuxuan Wan, Chaozheng Wang, Yi Dong et al.

Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating code for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 15% improvement in visual similarity and 8% in code similarity for large input images. Human evaluations show that DCGen can help developers implement webpages significantly faster and more similar to the UI designs. To the best of our knowledge, DCGen is the first segment-aware MLLM-based approach for generating UI code directly from screenshots.

LGMay 6, 2024
Deep Learning for Detecting and Early Predicting Chronic Obstructive Pulmonary Disease from Spirogram Time Series

Shuhao Mei, Xin Li, Yuxi Zhou et al.

Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung condition characterized by airflow obstruction. Current diagnostic methods primarily rely on identifying prominent features in spirometry (Volume-Flow time series) to detect COPD, but they are not adept at predicting future COPD risk based on subtle data patterns. In this study, we introduce a novel deep learning-based approach, DeepSpiro, aimed at the early prediction of future COPD risk. DeepSpiro consists of four key components: SpiroSmoother for stabilizing the Volume-Flow curve, SpiroEncoder for capturing volume variability-pattern through key patches of varying lengths, SpiroExplainer for integrating heterogeneous data and explaining predictions through volume attention, and SpiroPredictor for predicting the disease risk of undiagnosed high-risk patients based on key patch concavity, with prediction horizons of 1, 2, 3, 4, 5 years, or even longer. Evaluated on the UK Biobank dataset, DeepSpiro achieved an AUC of 0.8328 for COPD detection and demonstrated strong predictive performance for future COPD risk (p-value < 0.001). In summary, DeepSpiro can effectively predicts the long-term progression of the COPD disease.

CLMay 21, 2023
BiasAsker: Measuring the Bias in Conversational AI System

Yuxuan Wan, Wenxuan Wang, Pinjia He et al.

Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to the data-driven, black-box nature of modern AI techniques, comprehensively identifying and measuring biases in conversational systems remains a challenging task. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups as well as biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods simply based on the sentiment and the toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset, containing a total of 841 groups and 8,110 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on 8 commercial systems and 2 famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research.