HCSep 19, 2022Code
TimberTrek: Exploring and Curating Sparse Decision Trees with Interactive VisualizationZijie J. Wang, Chudi Zhong, Rui Xin et al. · gatech
Given thousands of equally accurate machine learning (ML) models, how can users choose among them? A recent ML technique enables domain experts and data scientists to generate a complete Rashomon set for sparse decision trees--a huge set of almost-optimal interpretable ML models. To help ML practitioners identify models with desirable properties from this Rashomon set, we develop TimberTrek, the first interactive visualization system that summarizes thousands of sparse decision trees at scale. Two usage scenarios highlight how TimberTrek can empower users to easily explore, compare, and curate models that align with their domain knowledge and values. Our open-source tool runs directly in users' computational notebooks and web browsers, lowering the barrier to creating more responsible ML models. TimberTrek is available at the following public demo link: https://poloclub.github.io/timbertrek.
CVJun 3Code
MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPUKun Cheng, Songshuo Lu, Sicong Liao et al.
Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.
CVJun 3Code
HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition ScenariosYinxiang Yu, Maoxiang Chu, Qi Niu et al.
Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.
CVMar 28, 2022Code
SC^2-PCR: A Second Order Spatial Compatibility for Efficient and Robust Point Cloud RegistrationZhi Chen, Kun Sun, Fan Yang et al.
In this paper, we present a second order spatial compatibility (SC^2) measure based method for efficient and robust point cloud registration (PCR), called SC^2-PCR. Firstly, we propose a second order spatial compatibility (SC^2) measure to compute the similarity between correspondences. It considers the global compatibility instead of local consistency, allowing for more distinctive clustering between inliers and outliers at early stage. Based on this measure, our registration pipeline employs a global spectral technique to find some reliable seeds from the initial correspondences. Then we design a two-stage strategy to expand each seed to a consensus set based on the SC^2 measure matrix. Finally, we feed each consensus set to a weighted SVD algorithm to generate a candidate rigid transformation and select the best model as the final result. Our method can guarantee to find a certain number of outlier-free consensus sets using fewer samplings, making the model estimation more efficient and robust. In addition, the proposed SC^2 measure is general and can be easily plugged into deep learning based frameworks. Extensive experiments are carried out to investigate the performance of our method. Code will be available at \url{https://github.com/ZhiChen902/SC2-PCR}.
CLApr 10, 2025
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement LearningByteDance Seed, Jiaze Chen, Tiantian Fan et al. · bytedance
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
CVAug 1, 2023Code
Zero-Shot Learning by Harnessing Adversarial SamplesZhi Chen, Pengfei Zhang, Jingjing Li et al.
Zero-Shot Learning (ZSL) aims to recognize unseen classes by generalizing the knowledge, i.e., visual and semantic relationships, obtained from seen classes, where image augmentation techniques are commonly applied to improve the generalization ability of a model. However, this approach can also cause adverse effects on ZSL since the conventional augmentation techniques that solely depend on single-label supervision is not able to maintain semantic information and result in the semantic distortion issue consequently. In other words, image argumentation may falsify the semantic (e.g., attribute) information of an image. To take the advantage of image augmentations while mitigating the semantic distortion issue, we propose a novel ZSL approach by Harnessing Adversarial Samples (HAS). HAS advances ZSL through adversarial training which takes into account three crucial aspects: (1) robust generation by enforcing augmentations to be similar to negative classes, while maintaining correct labels, (2) reliable generation by introducing a latent space constraint to avert significant deviations from the original data manifold, and (3) diverse generation by incorporating attribute-based perturbation by adjusting images according to each semantic attribute's localization. Through comprehensive experiments on three prominent zero-shot benchmark datasets, we demonstrate the effectiveness of our adversarial samples approach in both ZSL and Generalized Zero-Shot Learning (GZSL) scenarios. Our source code is available at https://github.com/uqzhichen/HASZSL.
IRJul 14, 2022
Improving Multi-Interest Network with Stable LearningZhaocheng Liu, Yingtao Luo, Di Zeng et al. · cmu
Modeling users' dynamic preferences from historical behaviors lies at the core of modern recommender systems. Due to the diverse nature of user interests, recent advances propose the multi-interest networks to encode historical behaviors into multiple interest vectors. In real scenarios, the corresponding items of captured interests are usually retrieved together to get exposure and collected into training data, which produces dependencies among interests. Unfortunately, multi-interest networks may incorrectly concentrate on subtle dependencies among captured interests. Misled by these dependencies, the spurious correlations between irrelevant interests and targets are captured, resulting in the instability of prediction results when training and test distributions do not match. In this paper, we introduce the widely used Hilbert-Schmidt Independence Criterion (HSIC) to measure the degree of independence among captured interests and empirically show that the continuous increase of HSIC may harm model performance. Based on this, we propose a novel multi-interest network, named DEep Stable Multi-Interest Learning (DESMIL), which tries to eliminate the influence of subtle dependencies among captured interests via learning weights for training samples and make model concentrate more on underlying true causation. We conduct extensive experiments on public recommendation datasets, a large-scale industrial dataset and the synthetic datasets which simulate the out-of-distribution data. Experimental results demonstrate that our proposed DESMIL outperforms state-of-the-art models by a significant margin. Besides, we also conduct comprehensive model analysis to reveal the reason why DESMIL works to a certain extent.
LGSep 16, 2022
Exploring the Whole Rashomon Set of Sparse Decision TreesRui Xin, Chudi Zhong, Zhi Chen et al.
In any given machine learning problem, there may be many models that could explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed within a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be extremely complicated, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.
CLSep 3, 2024Code
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best PracticesZhi Chen, Qiguang Chen, Libo Qin et al.
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
CLAug 9, 2023
CLEVA: Chinese Language Models EVAluation PlatformYanyang Li, Jianqiao Zhao, Duo Zheng et al.
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.
LGMar 28, 2023
Exploring and Interacting with the Set of Good Sparse Generalized Additive ModelsChudi Zhong, Zhi Chen, Jiachang Liu et al.
In real applications, interaction between machine learning models and domain experts is critical; however, the classical machine learning paradigm that usually produces only a single model does not facilitate such interaction. Approximating and exploring the Rashomon set, i.e., the set of all near-optimal models, addresses this practical challenge by providing the user with a searchable space containing a diverse set of models from which domain experts can choose. We present algorithms to efficiently and accurately approximate the Rashomon set of sparse, generalized additive models with ellipsoids for fixed support sets and use these ellipsoids to approximate Rashomon sets for many different support sets. The approximated Rashomon set serves as a cornerstone to solve practical challenges such as (1) studying the variable importance for the model class; (2) finding models under user-specified constraints (monotonicity, direct editing); and (3) investigating sudden changes in the shape functions. Experiments demonstrate the fidelity of the approximated Rashomon set and its effectiveness in solving practical challenges.
ROMay 31
$τ_0$-WM: A Unified Video-Action World Model for Robotic ManipulationPengfei Zhou, Shengcong Chen, Di Chen et al.
Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present $τ_0$-World Model ($τ_0$-WM), a unified video-action world model that integrates policy learning, video prediction, and action evaluation within a single future-predictive framework. Built on a shared video diffusion backbone, $τ_0$-WM provides two complementary interfaces. First, a video action model jointly predicts future visual latents and continuous action chunks from multi-view observations, language instructions, and robot state. Second, an action-conditioned video simulator rolls out candidate action chunks into multi-view futures and predicts dense task-progress scores. The model is trained on approximately $27{,}300$ hours of real-robot teleoperation, UMI-style interaction, egocentric human videos, and rollout or failure trajectories using modality-specific supervision masks. At inference time, $τ_0$-WM uses test-time computation to sample action candidates, rank them with re-denoising consistency, and invoke simulator-based rectification for low-quality candidates. On challenging long-horizon and fine-grained robotic manipulation tasks, $τ_0$-WM shows superior performance over other relevant baselines.
CVApr 22, 2022
SegDiscover: Visual Concept Discovery via Unsupervised Semantic SegmentationHaiyang Huang, Zhi Chen, Cynthia Rudin
Visual concept discovery has long been deemed important to improve interpretability of neural networks, because a bank of semantically meaningful concepts would provide us with a starting point for building machine learning models that exhibit intelligible reasoning process. Previous methods have disadvantages: either they rely on labelled support sets that incorporate human biases for objects that are "useful," or they fail to identify multiple concepts that occur within a single image. We reframe the concept discovery task as an unsupervised semantic segmentation problem, and present SegDiscover, a novel framework that discovers semantically meaningful visual concepts from imagery datasets with complex scenes without supervision. Our method contains three important pieces: generating concept primitives from raw images, discovering concepts by clustering in the latent space of a self-supervised pretrained encoder, and concept refinement via neural network smoothing. Experimental results provide evidence that our method can discover multiple concepts within a single image and outperforms state-of-the-art unsupervised methods on complex datasets such as Cityscapes and COCO-Stuff. Our method can be further used as a neural network explanation tool by comparing results obtained by different encoders.
CLApr 10, 2022
UniDU: Towards A Unified Generative Dialogue Understanding FrameworkZhi Chen, Lu Chen, Bei Chen et al.
With the development of pre-trained language models, remarkable success has been witnessed in dialogue understanding (DU). However, current DU approaches usually employ independent models for each distinct DU task without considering shared knowledge across different DU tasks. In this paper, we propose a unified generative dialogue understanding framework, named {\em UniDU}, to achieve effective information exchange across diverse DU tasks. Here, we reformulate all DU tasks into a unified prompt-based generative model paradigm. More importantly, a novel model-agnostic multi-task training strategy (MATS) is introduced to dynamically adapt the weights of diverse tasks for best knowledge sharing during training, based on the nature and available data of each task. Experiments on ten DU datasets covering five fundamental DU tasks show that the proposed UniDU framework largely outperforms task-specific well-designed methods on all tasks. MATS also reveals the knowledge-sharing structure of these tasks. Finally, UniDU obtains promising performance in the unseen dialogue domain, showing the great potential for generalization.
CLSep 10, 2022
OPAL: Ontology-Aware Pretrained Language Model for End-to-End Task-Oriented DialogueZhi Chen, Yuncong Liu, Lu Chen et al.
This paper presents an ontology-aware pretrained language model (OPAL) for end-to-end task-oriented dialogue (TOD). Unlike chit-chat dialogue models, task-oriented dialogue models fulfill at least two task-specific modules: dialogue state tracker (DST) and response generator (RG). The dialogue state consists of the domain-slot-value triples, which are regarded as the user's constraints to search the domain-related databases. The large-scale task-oriented dialogue data with the annotated structured dialogue state usually are inaccessible. It prevents the development of the pretrained language model for the task-oriented dialogue. We propose a simple yet effective pretraining method to alleviate this problem, which consists of two pretraining phases. The first phase is to pretrain on large-scale contextual text data, where the structured information of the text is extracted by the information extracting tool. To bridge the gap between the pretraining method and downstream tasks, we design two pretraining tasks: ontology-like triple recovery and next-text generation, which simulates the DST and RG, respectively. The second phase is to fine-tune the pretrained model on the TOD data. The experimental results show that our proposed method achieves an exciting boost and get competitive performance even without any TOD data on CamRest676 and MultiWOZ benchmarks.
CPNov 23, 2023
FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character DesignYangyang Yu, Haohang Li, Zhi Chen et al.
Recent advancements in Large Language Models (LLMs) have exhibited notable efficacy in question-answering (QA) tasks across diverse domains. Their prowess in integrating extensive web knowledge has fueled interest in developing LLM-based autonomous agents. While LLMs are efficient in decoding human instructions and deriving solutions by holistically processing historical inputs, transitioning to purpose-driven agents requires a supplementary rational architecture to process multi-source information, establish reasoning chains, and prioritize critical tasks. Addressing this, we introduce \textsc{FinMem}, a novel LLM-based agent framework devised for financial decision-making. It encompasses three core modules: Profiling, to customize the agent's characteristics; Memory, with layered message processing, to aid the agent in assimilating hierarchical financial data; and Decision-making, to convert insights gained from memories into investment decisions. Notably, \textsc{FinMem}'s memory module aligns closely with the cognitive structure of human traders, offering robust interpretability and real-time tuning. Its adjustable cognitive span allows for the retention of critical information beyond human perceptual limits, thereby enhancing trading outcomes. This framework enables the agent to self-evolve its professional knowledge, react agilely to new investment cues, and continuously refine trading decisions in the volatile financial environment. We first compare \textsc{FinMem} with various algorithmic agents on a scalable real-world financial dataset, underscoring its leading trading performance in stocks. We then fine-tuned the agent's perceptual span and character setting to achieve a significantly enhanced trading performance. Collectively, \textsc{FinMem} presents a cutting-edge LLM agent framework for automated trading, boosting cumulative investment returns.
CLJul 9, 2024
FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decision MakingYangyang Yu, Zhiyuan Yao, Haohang Li et al.
Large language models (LLMs) have demonstrated notable potential in conducting complex tasks and are increasingly utilized in various financial applications. However, high-quality sequential financial investment decision-making remains challenging. These tasks require multiple interactions with a volatile environment for every decision, demanding sufficient intelligence to maximize returns and manage risks. Although LLMs have been used to develop agent systems that surpass human teams and yield impressive investment returns, opportunities to enhance multi-sourced information synthesis and optimize decision-making outcomes through timely experience refinement remain unexplored. Here, we introduce the FinCon, an LLM-based multi-agent framework with CONceptual verbal reinforcement tailored for diverse FINancial tasks. Inspired by effective real-world investment firm organizational structures, FinCon utilizes a manager-analyst communication hierarchy. This structure allows for synchronized cross-functional agent collaboration towards unified goals through natural language interactions and equips each agent with greater memory capacity than humans. Additionally, a risk-control component in FinCon enhances decision quality by episodically initiating a self-critiquing mechanism to update systematic investment beliefs. The conceptualized beliefs serve as verbal reinforcement for the future agent's behavior and can be selectively propagated to the appropriate node that requires knowledge updates. This feature significantly improves performance while reducing unnecessary peer-to-peer communication costs. Moreover, FinCon demonstrates strong generalization capabilities in various financial tasks, including single stock trading and portfolio management.
CLMay 25, 2022
DFM: Dialogue Foundation Model for Universal Large-Scale Dialogue-Oriented Task LearningZhi Chen, Jijia Bao, Lu Chen et al.
Building a universal conversational agent has been a long-standing goal of the dialogue research community. Most previous works only focus on a small set of dialogue tasks. In this work, we aim to build a unified dialogue foundation model (DFM) which can be used to solve massive diverse dialogue tasks. To achieve this goal, a large-scale well-annotated dialogue dataset with rich task diversity (DialogZoo) is collected. We introduce a framework to unify all dialogue tasks and propose novel auxiliary self-supervised tasks to achieve stable training of DFM on the highly diverse large scale DialogZoo corpus. Experiments show that, compared with models of the same size, DFM can achieve state-of-the-art or competitive performance on very rich cross-domain downstream dialogue tasks. This demonstrates that DFM largely extends the ability of unified dialogue pre-trained model.
CVSep 5, 2022
Federated Zero-Shot Learning for Visual RecognitionZhi Chen, Yadan Luo, Sen Wang et al.
Zero-shot learning is a learning regime that recognizes unseen classes by generalizing the visual-semantic relationship learned from the seen classes. To obtain an effective ZSL model, one may resort to curating training samples from multiple sources, which may inevitably raise the privacy concerns about data sharing across different organizations. In this paper, we propose a novel Federated Zero-Shot Learning FedZSL framework, which learns a central model from the decentralized data residing on edge devices. To better generalize to previously unseen classes, FedZSL allows the training data on each device sampled from the non-overlapping classes, which are far from the i.i.d. that traditional federated learning commonly assumes. We identify two key challenges in our FedZSL protocol: 1) the trained models are prone to be biased to the locally observed classes, thus failing to generalize to the unseen classes and/or seen classes appeared on other devices; 2) as each category in the training data comes from a single source, the central model is highly vulnerable to model replacement (backdoor) attacks. To address these issues, we propose three local objectives for visual-semantic alignment and cross-device alignment through relation distillation, which leverages the normalized class-wise covariance to regularize the consistency of the prediction logits across devices. To defend against the backdoor attacks, a feature magnitude defending technique is proposed. As malicious samples are less correlated to the given semantic attributes, the visual features of low magnitude will be discarded to stabilize model updates. The effectiveness and robustness of FedZSL are demonstrated by extensive experiments conducted on three zero-shot benchmark datasets.
CVJul 5, 2022
GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot LearningZhi Chen, Yadan Luo, Sen Wang et al.
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes. It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. However, due to the generation shifts, the synthesized samples by most existing methods may drift from the real distribution of the unseen data. To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation. Specifically, we discover and address three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder. First, to enhance the reflection of the semantic information in the generated samples, we explicitly embed the semantic information into the transformation in each conditional affine coupling layer. Second, to recover the intrinsic variance of the real unseen features, we introduce a boundary sample mining strategy with entropy maximization to discover more difficult visual variants of semantic prototypes and hereby adjust the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding them to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Extensive experimental results on four GZSL benchmark datasets demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.
CVAug 6, 2023
Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration ErrorZixin Wang, Yadan Luo, Zhi Chen et al.
The prevalence of domain adaptive semantic segmentation has prompted concerns regarding source domain data leakage, where private information from the source domain could inadvertently be exposed in the target domain. To circumvent the requirement for source data, source-free domain adaptation has emerged as a viable solution that leverages self-training methods to pseudo-label high-confidence regions and adapt the model to the target data. However, the confidence scores obtained are often highly biased due to over-confidence and class-imbalance issues, which render both model selection and optimization problematic. In this paper, we propose a novel calibration-guided source-free domain adaptive semantic segmentation (Cal-SFDA) framework. The core idea is to estimate the expected calibration error (ECE) from the segmentation predictions, serving as a strong indicator of the model's generalization capability to the unlabeled target domain. The estimated ECE scores, in turn, assist the model training and fair selection in both source training and target adaptation stages. During model pre-training on the source domain, we ensure the differentiability of the ECE objective by leveraging the LogSumExp trick and using ECE scores to select the best source checkpoints for adaptation. To enable ECE estimation on the target domain without requiring labels, we train a value net for ECE estimation and apply statistic warm-up on its BatchNorm layers for stability. The estimated ECE scores assist in determining the reliability of prediction and enable class-balanced pseudo-labeling by positively guiding the adaptation progress and inhibiting potential error accumulation. Extensive experiments on two widely-used synthetic-to-real transfer tasks show that the proposed approach surpasses previous state-of-the-art by up to 5.25% of mIoU with fair model selection criteria.
CLMar 26, 2024Code
InternLM2 Technical ReportZheng Cai, Maosong Cao, Haojiong Chen et al. · pku
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
LGApr 23, 2023
Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?Zhi Chen, Sarah Tan, Urszula Chajewska et al.
Missing values are a fundamental problem in data science. Many datasets have missing values that must be properly handled because the way missing values are treated can have large impact on the resulting machine learning model. In medical applications, the consequences may affect healthcare decisions. There are many methods in the literature for dealing with missing values, including state-of-the-art methods which often depend on black-box models for imputation. In this work, we show how recent advances in interpretable machine learning provide a new perspective for understanding and tackling the missing value problem. We propose methods based on high-accuracy glass-box Explainable Boosting Machines (EBMs) that can help users (1) gain new insights on missingness mechanisms and better understand the causes of missingness, and (2) detect -- or even alleviate -- potential risks introduced by imputation algorithms. Experiments on real-world medical datasets illustrate the effectiveness of the proposed methods.
CVAug 9, 2024
On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic SurveyJingcai Guo, Zhijie Rao, Zhi Chen et al.
Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data in the seen domain. The gist of ZSIR is constructing a well-aligned mapping between the input visual space and the target semantic space, which is a bottom-up paradigm inspired by the process by which humans observe the world. In recent years, ZSIR has witnessed significant progress on a broad spectrum, from theory to algorithm design, as well as widespread applications. However, to the best of our knowledge, there remains a lack of a systematic review of ZSIR from an element-wise perspective, i.e., learning fine-grained elements of data and their inferential associations. To fill the gap, this paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development. Concretely, we first integrate three basic ZSIR tasks, i.e., object recognition, compositional recognition, and foundation model-based open-world recognition, into a unified element-wise paradigm and provide a detailed taxonomy and analysis of the main approaches. Next, we summarize the benchmarks, covering technical implementations, standardized datasets, and some more details as a library. Last, we sketch out related applications, discuss vital challenges, and suggest potential future directions.
SEJan 29Code
AgentGuard: A Multi-Agent Framework for Robust Package Confusion Detection via Hybrid Search and Metadata-Content FusionYu Li, Wei Ma, Zhi Chen et al.
The proliferation of open-source software (OSS) has made software supply chains prime targets for attacks like Package Confusion, where adversaries publish malicious packages with names deceptively similar to legitimate ones. To protect against such attacks and safeguard the use of OSS, multiple confusion detection methods have been proposed. However, existing methods are limited to single-signal retrieval strategies (relying solely on lexical or semantic metrics), struggle with high false positive rates (FPR), and are vulnerable to adversarial evasion. Critically, as content-agnostic approaches, they fundamentally fail to distinguish benign packages with high naming similarity from malicious, code-dissimilar impersonations, leading to persistent high FPR. To address these limitations, we introduce AgentGuard, a novel multi-agents based framework for package confusion detection. Specifically, it first discovers potential confusion targets using fine-tuned word embedding models with hybrid similarity search. After that, It subsequently evaluates risk via a fused machine learning model that uniquely combines: (1) a multi-dimensional metadata group and (2) a novel package content analysis group, to reduce the FPR and mitigate the impact of adversarial evasion. To assess the effectiveness of AgentGuard, we evaluate it on challenging ConfuDB and NeupaneDB datasets. Our results demonstrate that AgentGuard significantly outperforms state-of-the-art baselines, ConfuGuard and Typomind, improving precision by 12\%-49\% while simultaneously reducing the FPR by 11\%-35\%, and effectively discovers the confused package.
AIJan 12, 2023
On the Structural Generalization in Text-to-SQLJieyu Li, Lu Chen, Ruisheng Cao et al.
Exploring the generalization of a text-to-SQL parser is essential for a system to automatically adapt the real-world databases. Previous works provided investigations focusing on lexical diversity, including the influence of the synonym and perturbations in both natural language questions and databases. However, research on the structure variety of database schema~(DS) is deficient. Specifically, confronted with the same input question, the target SQL is probably represented in different ways when the DS comes to a different structure. In this work, we provide in-deep discussions about the structural generalization of text-to-SQL tasks. We observe that current datasets are too templated to study structural generalization. To collect eligible test data, we propose a framework to generate novel text-to-SQL data via automatic and synchronous (DS, SQL) pair altering. In the experiments, significant performance reduction when evaluating well-trained text-to-SQL models on the synthetic samples demonstrates the limitation of current research regarding structural generalization. According to comprehensive analysis, we suggest the practical reason is the overfitting of (NL, SQL) patterns.
CLApr 7, 2024Code
Multilingual Large Language Model: A Survey of Resources, Taxonomy and FrontiersLibo Qin, Qiguang Chen, Yuhang Zhou et al.
Multilingual Large Language Models are capable of using powerful Large Language Models to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
CVSep 6, 2024
PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease SegmentationTianqi Wei, Zhi Chen, Xin Yu et al.
Plant diseases pose significant threats to agriculture. It necessitates proper diagnosis and effective treatment to safeguard crop yields. To automate the diagnosis process, image segmentation is usually adopted for precisely identifying diseased regions, thereby advancing precision agriculture. Developing robust image segmentation models for plant diseases demands high-quality annotations across numerous images. However, existing plant disease datasets typically lack segmentation labels and are often confined to controlled laboratory settings, which do not adequately reflect the complexity of natural environments. Motivated by this fact, we established PlantSeg, a large-scale segmentation dataset for plant diseases. PlantSeg distinguishes itself from existing datasets in three key aspects. (1) Annotation type: Unlike the majority of existing datasets that only contain class labels or bounding boxes, each image in PlantSeg includes detailed and high-quality segmentation masks, associated with plant types and disease names. (2) Image source: Unlike typical datasets that contain images from laboratory settings, PlantSeg primarily comprises in-the-wild plant disease images. This choice enhances the practical applicability, as the trained models can be applied for integrated disease management. (3) Scale: PlantSeg is extensive, featuring 11,400 images with disease segmentation masks and an additional 8,000 healthy plant images categorized by plant type. Extensive technical experiments validate the high quality of PlantSeg's annotations. This dataset not only allows researchers to evaluate their image classification methods but also provides a critical foundation for developing and benchmarking advanced plant disease segmentation algorithms.
CVAug 6, 2024
Benchmarking In-the-wild Multimodal Disease Recognition and A Versatile BaselineTianqi Wei, Zhi Chen, Zi Huang et al.
Existing plant disease classification models have achieved remarkable performance in recognizing in-laboratory diseased images. However, their performance often significantly degrades in classifying in-the-wild images. Furthermore, we observed that in-the-wild plant images may exhibit similar appearances across various diseases (i.e., small inter-class discrepancy) while the same diseases may look quite different (i.e., large intra-class variance). Motivated by this observation, we propose an in-the-wild multimodal plant disease recognition dataset that contains the largest number of disease classes but also text-based descriptions for each disease. Particularly, the newly provided text descriptions are introduced to provide rich information in textual modality and facilitate in-the-wild disease classification with small inter-class discrepancy and large intra-class variance issues. Therefore, our proposed dataset can be regarded as an ideal testbed for evaluating disease recognition methods in the real world. In addition, we further present a strong yet versatile baseline that models text descriptions and visual data through multiple prototypes for a given class. By fusing the contributions of multimodal prototypes in classification, our baseline can effectively address the small inter-class discrepancy and large intra-class variance issues. Remarkably, our baseline model can not only classify diseases but also recognize diseases in few-shot or training-free scenarios. Extensive benchmarking results demonstrate that our proposed in-the-wild multimodal dataset sets many new challenges to the plant disease recognition task and there is a large space to improve for future works.
LGFeb 2, 2025Code
Sundial: A Family of Highly Capable Time Series Foundation ModelsYong Liu, Guo Qin, Zhiyuan Shi et al.
We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial's pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at: https://github.com/thuml/Sundial.
CRMay 21
CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion ControllersZhi Chen, Shehab Sarar Ahmed, Chenkai Wang et al.
Congestion controllers (CCs) are critical to network performance, and yet their robustness under adverse conditions remains insufficiently understood. While recent learning-based CCs have demonstrated strong performance in controlled environments, it is unclear how they compare to traditional CCs when controllers' input signals are corrupted or when environmental conditions become systematically challenging. In this paper, we introduce CCLab, an adversarial testing framework for systematically evaluating the robustness of both learning-based and non-learning-based CCs. CCLab includes a reinforcement learning (RL)-based adversarial agent that operates in a closed loop with the congestion control policy, generating bounded perturbations either on input signals (feature-level) or on external network conditions (environment-level), while preserving realism through explicit constraints. Using this framework, we compare learning-based CCs with non-learning-based CCs under both feature-level and environment-level adversarial conditions. While both types of CCs suffer from performance degradation under adversarial testing, we find that learning-based CCs, in general, are more robust than traditional human-designed algorithms. Finally, we show that our adversarial traces can be used to train more robust CCs that outperform existing learning-based CCs under both challenging and normal conditions.
CVAug 6, 2024
FastEdit: Fast Text-Guided Single-Image Editing via Semantic-Aware Diffusion Fine-TuningZhi Chen, Zecheng Zhao, Yadan Luo et al.
Conventional Text-guided single-image editing approaches require a two-step process, including fine-tuning the target text embedding for over 1K iterations and the generative model for another 1.5K iterations. Although it ensures that the resulting image closely aligns with both the input image and the target text, this process often requires 7 minutes per image, posing a challenge for practical application due to its time-intensive nature. To address this bottleneck, we introduce FastEdit, a fast text-guided single-image editing method with semantic-aware diffusion fine-tuning, dramatically accelerating the editing process to only 17 seconds. FastEdit streamlines the generative model's fine-tuning phase, reducing it from 1.5K to a mere 50 iterations. For diffusion fine-tuning, we adopt certain time step values based on the semantic discrepancy between the input image and target text. Furthermore, FastEdit circumvents the initial fine-tuning step by utilizing an image-to-image model that conditions on the feature space, rather than the text embedding space. It can effectively align the target text prompt and input image within the same feature space and save substantial processing time. Additionally, we apply the parameter-efficient fine-tuning technique LoRA to U-net. With LoRA, FastEdit minimizes the model's trainable parameters to only 0.37\% of the original size. At the same time, we can achieve comparable editing outcomes with significantly reduced computational overhead. We conduct extensive experiments to validate the editing performance of our approach and show promising editing capabilities, including content addition, style transfer, background replacement, and posture manipulation, etc.
CVSep 13, 2024
CF-PRNet: Coarse-to-Fine Prototype Refining Network for Point Cloud Completion and ReconstructionZhi Chen, Tianqi Wei, Zecheng Zhao et al.
In modern agriculture, precise monitoring of plants and fruits is crucial for tasks such as high-throughput phenotyping and automated harvesting. This paper addresses the challenge of reconstructing accurate 3D shapes of fruits from partial views, which is common in agricultural settings. We introduce CF-PRNet, a coarse-to-fine prototype refining network, leverages high-resolution 3D data during the training phase but requires only a single RGB-D image for real-time inference. Our approach begins by extracting the incomplete point cloud data that constructed from a partial view of a fruit with a series of convolutional blocks. The extracted features inform the generation of scaling vectors that refine two sequentially constructed 3D mesh prototypes - one coarse and one fine-grained. This progressive refinement facilitates the detailed completion of the final point clouds, achieving detailed and accurate reconstructions. CF-PRNet demonstrates excellent performance metrics with a Chamfer Distance of 3.78, an F1 Score of 66.76%, a Precision of 56.56%, and a Recall of 85.31%, and win the first place in the Shape Completion and Reconstruction of Sweet Peppers Challenge.
IVSep 13, 2024
FiAt-Net: Detecting Fibroatheroma Plaque Cap in 3D Intravascular OCT ImagesYaopeng Peng, Zhi Chen, Andreas Wahle et al.
The key manifestation of coronary artery disease (CAD) is development of fibroatheromatous plaque, the cap of which may rupture and subsequently lead to coronary artery blocking and heart attack. As such, quantitative analysis of coronary plaque, its plaque cap, and consequently the cap's likelihood to rupture are of critical importance when assessing a risk of cardiovascular events. This paper reports a new deep learning based approach, called FiAt-Net, for detecting angular extent of fibroatheroma (FA) and segmenting its cap in 3D intravascular optical coherence tomography (IVOCT) images. IVOCT 2D image frames are first associated with distinct clusters and data from each cluster are used for model training. As plaque is typically focal and thus unevenly distributed, a binary partitioning method is employed to identify FA plaque areas to focus on to mitigate the data imbalance issue. Additional image representations (called auxiliary images) are generated to capture IVOCT intensity changes to help distinguish FA and non-FA areas on the coronary wall. Information in varying scales is derived from the original IVOCT and auxiliary images, and a multi-head self-attention mechanism is employed to fuse such information. Our FiAt-Net achieved high performance on a 3D IVOCT coronary image dataset, demonstrating its effectiveness in accurately detecting FA cap in IVOCT images.
CVAug 27, 2024
Snap and Diagnose: An Advanced Multimodal Retrieval System for Identifying Plant Diseases in the WildTianqi Wei, Zhi Chen, Xin Yu
Plant disease recognition is a critical task that ensures crop health and mitigates the damage caused by diseases. A handy tool that enables farmers to receive a diagnosis based on query pictures or the text description of suspicious plants is in high demand for initiating treatment before potential diseases spread further. In this paper, we develop a multimodal plant disease image retrieval system to support disease search based on either image or text prompts. Specifically, we utilize the largest in-the-wild plant disease dataset PlantWild, which includes over 18,000 images across 89 categories, to provide a comprehensive view of potential diseases relating to the query. Furthermore, cross-modal retrieval is achieved in the developed system, facilitated by a novel CLIP-based vision-language model that encodes both disease descriptions and disease images into the same latent space. Built on top of the retriever, our retrieval system allows users to upload either plant disease images or disease descriptions to retrieve the corresponding images with similar characteristics from the disease dataset to suggest candidate diseases for end users' consideration.
LGNov 14, 2025
Virtual Width NetworksSeed, Baisheng Li, Banggu Wu et al.
We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.
ITMay 7
Energy-Efficient Movable Antennas: Mechanical Power Modeling and Performance OptimizationXin Wei, Weidong Mei, Xuan Huang et al.
Movable antennas (MAs) offer additional spatial degrees of freedom (DoFs) to enhance communication performance through local antenna movement. However, to achieve accurate and fast antenna movement, MA drivers entail non-negligible mechanical power consumption, rendering energy efficiency (EE) optimization more critical compared to conventional fixed-position antenna (FPA) systems. To address this issue, we develop a fundamental power consumption model for stepper motor-driven multi-MA systems based on electric motor theory. Based on this model, we formulate an EE maximization problem from a multi-MA base station (BS) to multiple single-FPA users. We aim to jointly optimize the MAs' positions, moving speeds, and the BS's transmit precoding matrix subject to collision-avoidance constraints during the multi-MA movements. However, this problem is difficult to solve. To tackle this challenge, we first reveal that the collision-avoidance constraints can always be relaxed without loss of optimality by properly renumbering the MA indices. For the resulting relaxed problem, we first consider a simplified single-user setup and uncover a hidden monotonicity of the EE performance with respect to the MAs' moving speeds. To solve the remaining optimization problem, we develop a two-layer optimization framework. In the inner layer, the Dinkelbach algorithm is employed to derive the optimal beamforming solution for any given MA positions. In the outer layer, a sequential update algorithm is proposed to iteratively refine the MA positions based on the optimal values obtained from the inner layer. Next, we proceed to the general multi-user case and propose an alternating optimization (AO) algorithm. Numerical results demonstrate that despite the additional mechanical power consumption, the proposed algorithms can outperform both conventional FPA systems and other existing EE maximization benchmarks
LGJan 3, 2023
Risk-Averse MDPs under Reward AmbiguityHaolin Ruan, Zhi Chen, Chin Pang Ho
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
SEApr 9
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering AgentsZhi Chen, Zhensu Sun, Yuling Shi et al.
Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-ranking agents.This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget? To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.
ROMay 1
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot PoliciesYi Wang, Xinchen Li, Pengwei Xie et al.
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
SEMay 18
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering AgentsWei Ma, Zhi Chen, Jingxu Gu et al.
Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.
CVNov 11, 2025
Distributed Zero-Shot Learning for Visual RecognitionZhi Chen, Yadan Luo, Zi Huang et al.
In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.
AIDec 24, 2024Code
GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing AgentKangjia Zhao, Jiahui Song, Leigang Sha et al.
Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at https://github.com/ZJU-ACES-ISE/ChatUITest.
CVFeb 10, 2023
Deep Seam Prediction for Image Stitching Based on Selection Consistency LossSenmao Cheng, Fan Yang, Zhi Chen et al.
Image stitching is to construct panoramic images with wider field of vision (FOV) from some images captured from different viewing positions. To solve the problem of fusion ghosting in the stitched image, seam-driven methods avoid the misalignment area to fuse images by predicting the best seam. Currently, as standard tools of the OpenCV library, dynamic programming (DP) and GraphCut (GC) are still the only commonly used seam prediction methods despite the fact that they were both proposed two decades ago. However, GC can get excellent seam quality but poor real-time performance while DP method has good efficiency but poor seam quality. In this paper, we propose a deep learning based seam prediction method (DSeam) for the sake of high seam quality with high efficiency. To overcome the difficulty of the seam description in network and no GroundTruth for training we design a selective consistency loss combining the seam shape constraint and seam quality constraint to supervise the network learning. By the constraint of the selection of consistency loss, we implicitly defined the mask boundaries as seams and transform seam prediction into mask prediction. To our knowledge, the proposed DSeam is the first deep learning based seam prediction method for image stitching. Extensive experimental results well demonstrate the superior performance of our proposed Dseam method which is 15 times faster than the classic GC seam prediction method in OpenCV 2.4.9 with similar seam quality.
APP-PHAug 15, 2024
Phononic materials with effectively scale-separated hierarchical features using interpretable machine learningMary V. Bastawrous, Zhi Chen, Alexander C. Ogren et al.
Manipulating the dispersive characteristics of vibrational waves is beneficial for many applications, e.g., high-precision instruments. architected hierarchical phononic materials have sparked promise tunability of elastodynamic waves and vibrations over multiple frequency ranges. In this article, hierarchical unit-cells are obtained, where features at each length scale result in a band gap within a targeted frequency range. Our novel approach, the ``hierarchical unit-cell template method,'' is an interpretable machine-learning approach that uncovers global unit-cell shape/topology patterns corresponding to predefined band-gap objectives. A scale-separation effect is observed where the coarse-scale band-gap objective is mostly unaffected by the fine-scale features despite the closeness of their length scales, thus enabling an efficient hierarchical algorithm. Moreover, the hierarchical patterns revealed are not predefined or self-similar hierarchies as common in current hierarchical phononic materials. Thus, our approach offers a flexible and efficient method for the exploration of new regions in the hierarchical design space, extracting minimal effective patterns for inverse design in applications targeting multiple frequency ranges.
CEJul 2, 2024
CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial ApplicationsYupeng Cao, Zhiyuan Yao, Zhi Chen et al.
The integration of Large Language Models (LLMs) into financial analysis has garnered significant attention in the NLP community. This paper presents our solution to IJCAI-2024 FinLLM challenge, investigating the capabilities of LLMs within three critical areas of financial tasks: financial classification, financial text summarization, and single stock trading. We adopted Llama3-8B and Mistral-7B as base models, fine-tuning them through Parameter Efficient Fine-Tuning (PEFT) and Low-Rank Adaptation (LoRA) approaches. To enhance model performance, we combine datasets from task 1 and task 2 for data fusion. Our approach aims to tackle these diverse tasks in a comprehensive and integrated manner, showcasing LLMs' capacity to address diverse and complex financial tasks with improved accuracy and decision-making capabilities.
CVMar 13, 2025Code
Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware RoutingZecheng Zhao, Zhi Chen, Zi Huang et al.
Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.
SESep 18, 2024
Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and MemorizationZhi Chen, Lingxiao Jiang
In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.
IRMar 12
SRSUPM: Sequential Recommender System Based on User Psychological MotivationYicheng Di, Yuan Liu, Zhi Chen et al.
Sequential recommender infers users' evolving psychological motivations from historical interactions to recommend the next preferred items. Most existing methods compress recent behaviors into a single vector and optimize it toward a single observed target item, but lack explicit modeling of psychological motivation shift. As a result, they struggle to uncover the distributional patterns across different shift degrees and to capture collaborative knowledge that is sensitive to psychological motivation shift. We propose a general framework, the Sequential Recommender System Based on User Psychological Motivation, to enhance sequential recommenders with psychological motivation shift-aware user modeling. Specifically, the Psychological Motivation Shift Assessment quantitatively measures psychological motivation shift; guided by PMSA, the Shift Information Construction models dynamically evolving multi-level shift states, and the Psychological Motivation Shift-driven Information Decomposition decomposes and regularizes representations across shift levels. Moreover, the Psychological Motivation Shift Information Matching strengthens collaborative patterns related to psychological motivation shift to learn more discriminative user representations. Extensive experiments on three public benchmarks show that SRSUPM consistently outperforms representative baselines on diverse sequential recommender tasks.
CLMar 14, 2025Code
DeskVision: Large Scale Desktop Region Captioning for Advanced GUI AgentsYibin Xu, Liang Yang, Hao Chen et al.
The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.