CLJun 4, 2025Code
Seed-Coder: Let the Code Model Curate Data for ItselfByteDance Seed, Yuyu Zhang, Jing Su et al. · bytedance
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
LGMay 28
Convergence of Steepest Descent and Adam under Non-Uniform SmoothnessSharan Vaswani, Yifan Sun, Reza Babanezhad
Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.
CVMar 17, 2023Code
A Unified Continual Learning Framework with General Parameter-Efficient TuningQiankun Gao, Chen Zhao, Yifan Sun et al.
The "pre-training $\rightarrow$ downstream adaptation" presents both new opportunities and challenges for Continual Learning (CL). Although the recent state-of-the-art in CL is achieved through Parameter-Efficient-Tuning (PET) adaptation paradigm, only prompt has been explored, limiting its application to Transformers only. In this paper, we position prompting as one instantiation of PET, and propose a unified CL framework with general PET, dubbed as Learning-Accumulation-Ensemble (LAE). PET, e.g., using Adapter, LoRA, or Prefix, can adapt a pre-trained model to downstream tasks with fewer parameters and resources. Given a PET method, our LAE framework incorporates it for CL with three novel designs. 1) Learning: the pre-trained model adapts to the new task by tuning an online PET module, along with our adaptation speed calibration to align different PET modules, 2) Accumulation: the task-specific knowledge learned by the online PET module is accumulated into an offline PET module through momentum update, 3) Ensemble: During inference, we respectively construct two experts with online/offline PET modules (which are favored by the novel/historical tasks) for prediction ensemble. We show that LAE is compatible with a battery of PET methods and gains strong CL capability. For example, LAE with Adaptor PET surpasses the prior state-of-the-art by 1.3% and 3.6% in last-incremental accuracy on CIFAR100 and ImageNet-R datasets, respectively. Code is available at \url{https://github.com/gqk/LAE}.
CVApr 13, 2023Code
TransHP: Image Classification with Hierarchical PromptingWenhao Wang, Yifan Sun, Wei Li et al.
This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information. The code is available at: https://github.com/WangWenhao0716/TransHP.
CLApr 10, 2025
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement LearningByteDance Seed, Jiaze Chen, Tiantian Fan et al. · bytedance
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
CVJun 29, 2022Code
SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object Detection and TrackingYining Shi, Jingyan Shen, Yifan Sun et al. · tsinghua
Detection and tracking of moving objects is an essential component in environmental perception for autonomous driving. In the flourishing field of multi-view 3D camera-based detectors, different transformer-based pipelines are designed to learn queries in 3D space from 2D feature maps of perspective views, but the dominant dense BEV query mechanism is computationally inefficient. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-sparse detector that incorporates sparse queries, sparse attention with box-wise sampling, and sparse prediction. SRCN3D adopts a cascade structure with the twin-track update of both a fixed number of query boxes and latent query features. Our novel sparse feature sampling module only utilizes local 2D region of interest (RoI) features calculated by the projection of 3D query boxes for further box refinement, leading to a fully-convolutional and deployment-friendly pipeline. For multi-object tracking, motion features, query features and RoI features are comprehensively utilized in multi-hypotheses data association. Extensive experiments on nuScenes dataset demonstrate that SRCN3D achieves competitive performance in both 3D object detection and multi-object tracking tasks, while also exhibiting superior efficiency compared to transformer-based methods. Code and models are available at https://github.com/synsin0/SRCN3D.
AINov 30, 2024
FullStack Bench: Evaluating LLMs as Full Stack CodersBytedance-Seed-Foundation-Code-Team, Yao Cheng, Jianfeng Chen et al. · bytedance
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
CVJul 7, 2024Code
Replication in Visual Diffusion Models: A Survey and OutlookWenhao Wang, Yifan Sun, Zongxin Yang et al.
Visual diffusion models have revolutionized the field of creative AI, producing high-quality and diverse content. However, they inevitably memorize training images or videos, subsequently replicating their concepts, content, or styles during inference. This phenomenon raises significant concerns about privacy, security, and copyright within generated outputs. In this survey, we provide the first comprehensive review of replication in visual diffusion models, marking a novel contribution to the field by systematically categorizing the existing studies into unveiling, understanding, and mitigating this phenomenon. Specifically, unveiling mainly refers to the methods used to detect replication instances. Understanding involves analyzing the underlying mechanisms and factors that contribute to this phenomenon. Mitigation focuses on developing strategies to reduce or eliminate replication. Beyond these aspects, we also review papers focusing on its real-world influence. For instance, in the context of healthcare, replication is critically worrying due to privacy concerns related to patient data. Finally, the paper concludes with a discussion of the ongoing challenges, such as the difficulty in detecting and benchmarking replication, and outlines future directions including the development of more robust mitigation techniques. By synthesizing insights from diverse studies, this paper aims to equip researchers and practitioners with a deeper understanding at the intersection between AI technology and social good. We release this project at https://github.com/WangWenhao0716/Awesome-Diffusion-Replication.
CVJul 26, 2022Code
V$^2$L: Leveraging Vision and Vision-language Models into Large-scale Product RetrievalWenhao Wang, Yifan Sun, Zongxin Yang et al.
Product retrieval is of great importance in the ecommerce domain. This paper introduces our 1st-place solution in eBay eProduct Visual Search Challenge (FGVC9), which is featured for an ensemble of about 20 models from vision models and vision-language models. While model ensemble is common, we show that combining the vision models and vision-language models brings particular benefits from their complementarity and is a key factor to our superiority. Specifically, for the vision models, we use a two-stage training pipeline which first learns from the coarse labels provided in the training set and then conducts fine-grained self-supervised training, yielding a coarse-to-fine metric learning manner. For the vision-language models, we use the textual description of the training image as the supervision signals for fine-tuning the image-encoder (feature extractor). With these designs, our solution achieves 0.7623 MAR@10, ranking the first place among all the competitors. The code is available at: \href{https://github.com/WangWenhao0716/V2L}{V$^2$L}.
CVJul 17, 2023
Large-Scale Person Detection and Localization using Overhead Fisheye CamerasLu Yang, Liulei Li, Xueshi Xin et al.
Location determination finds wide applications in daily life. Instead of existing efforts devoted to localizing tourist photos captured by perspective cameras, in this article, we focus on devising person positioning solutions using overhead fisheye cameras. Such solutions are advantageous in large field of view (FOV), low cost, anti-occlusion, and unaggressive work mode (without the necessity of cameras carried by persons). However, related studies are quite scarce, due to the paucity of data. To stimulate research in this exciting area, we present LOAF, the first large-scale overhead fisheye dataset for person detection and localization. LOAF is built with many essential features, e.g., i) the data cover abundant diversities in scenes, human pose, density, and location; ii) it contains currently the largest number of annotated pedestrian, i.e., 457K bounding boxes with groundtruth location information; iii) the body-boxes are labeled as radius-aligned so as to fully address the positioning challenge. To approach localization, we build a fisheye person detection network, which exploits the fisheye distortions by a rotation-equivariant training strategy and predict radius-aligned human boxes end-to-end. Then, the actual locations of the detected persons are calculated by a numerical solution on the fisheye model and camera altitude data. Extensive experiments on LOAF validate the superiority of our fisheye detector w.r.t. previous methods, and show that our whole fisheye positioning solution is able to locate all persons in FOV with an accuracy of 0.5 m, within 0.1 s.
CVMay 24, 2022Code
A Benchmark and Asymmetrical-Similarity Learning for Practical Image Copy DetectionWenhao Wang, Yifan Sun, Yi Yang
Image copy detection (ICD) aims to determine whether a query image is an edited copy of any image from a reference set. Currently, there are very limited public benchmarks for ICD, while all overlook a critical challenge in real-world applications, i.e., the distraction from hard negative queries. Specifically, some queries are not edited copies but are inherently similar to some reference images. These hard negative queries are easily false recognized as edited copies, significantly compromising the ICD accuracy. This observation motivates us to build the first ICD benchmark featuring this characteristic. Based on existing ICD datasets, this paper constructs a new dataset by additionally adding 100, 000 and 24, 252 hard negative pairs into the training and test set, respectively. Moreover, this paper further reveals a unique difficulty for solving the hard negative problem in ICD, i.e., there is a fundamental conflict between current metric learning and ICD. This conflict is: the metric learning adopts symmetric distance while the edited copy is an asymmetric (unidirectional) process, e.g., a partial crop is close to its holistic reference image and is an edited copy, while the latter cannot be the edited copy of the former (in spite the distance is equally small). This insight results in an Asymmetrical-Similarity Learning (ASL) method, which allows the similarity in two directions (the query <-> the reference image) to be different from each other. Experimental results show that ASL outperforms state-of-the-art methods by a clear margin, confirming that solving the symmetric-asymmetric conflict is critical for ICD. The NDEC dataset and code are available at https://github.com/WangWenhao0716/ASL.
CVMar 18, 2023Code
Exploring Expression-related Self-supervised Learning for Affective Behaviour AnalysisFanglei Xue, Yifan Sun, Yi Yang
This paper explores an expression-related self-supervised learning (SSL) method (ContraWarping) to perform expression classification in the 5th Affective Behavior Analysis in-the-wild (ABAW) competition. Affective datasets are expensive to annotate, and SSL methods could learn from large-scale unlabeled data, which is more suitable for this task. By evaluating on the Aff-Wild2 dataset, we demonstrate that ContraWarping outperforms most existing supervised methods and shows great application potential in the affective analysis area. Codes will be released on: https://github.com/youqingxiaozhua/ABAW5.
CVApr 20, 2023Code
Feature-compatible Progressive Learning for Video Copy DetectionWenhao Wang, Yifan Sun, Yi Yang
Video Copy Detection (VCD) has been developed to identify instances of unauthorized or duplicated video content. This paper presents our second place solutions to the Meta AI Video Similarity Challenge (VSC22), CVPR 2023. In order to compete in this challenge, we propose Feature-Compatible Progressive Learning (FCPL) for VCD. FCPL trains various models that produce mutually-compatible features, meaning that the features derived from multiple distinct models can be directly compared with one another. We find this mutual compatibility enables feature ensemble. By implementing progressive learning and utilizing labeled ground truth pairs, we effectively gradually enhance performance. Experimental results demonstrate the superiority of the proposed FCPL over other competitors. Our code is available at https://github.com/WangWenhao0716/VSC-DescriptorTrack-Submission and https://github.com/WangWenhao0716/VSC-MatchingTrack-Submission.
CVOct 13, 2022
Feature-Proxy Transformer for Few-Shot SegmentationJian-Wei Zhang, Yifan Sun, Yi Yang et al.
Few-shot segmentation (FSS) aims at performing semantic segmentation on novel classes given a few annotated support samples. With a rethink of recent advances, we find that the current FSS framework has deviated far from the supervised segmentation framework: Given the deep features, FSS methods typically use an intricate decoder to perform sophisticated pixel-wise matching, while the supervised segmentation methods use a simple linear classification head. Due to the intricacy of the decoder and its matching pipeline, it is not easy to follow such an FSS framework. This paper revives the straightforward framework of "feature extractor $+$ linear classification head" and proposes a novel Feature-Proxy Transformer (FPTrans) method, in which the "proxy" is the vector representing a semantic class in the linear classification head. FPTrans has two keypoints for learning discriminative features and representative proxies: 1) To better utilize the limited support samples, the feature extractor makes the query interact with the support features from the bottom to top layers using a novel prompting strategy. 2) FPTrans uses multiple local background proxies (instead of a single one) because the background is not homogeneous and may contain some novel foreground regions. These two keypoints are easily integrated into the vision transformer backbone with the prompting mechanism in the transformer. Given the learned features and proxies, FPTrans directly compares their cosine similarity for segmentation. Although the framework is straightforward, we show that FPTrans achieves competitive FSS accuracy on par with state-of-the-art decoder-based methods.
CVJul 21, 2022
UFO: Unified Feature OptimizationTeng Xi, Yifan Sun, Deli Yu et al.
This paper proposes a novel Unified Feature Optimization (UFO) paradigm for training and deploying deep models under real-world and large-scale scenarios, which requires a collection of multiple AI functions. UFO aims to benefit each single task with a large-scale pretraining on all tasks. Compared with the well known foundation model, UFO has two different points of emphasis, i.e., relatively smaller model size and NO adaptation cost: 1) UFO squeezes a wide range of tasks into a moderate-sized unified model in a multi-task learning manner and further trims the model size when transferred to down-stream tasks. 2) UFO does not emphasize transfer to novel tasks. Instead, it aims to make the trimmed model dedicated for one or more already-seen task. With these two characteristics, UFO provides great convenience for flexible deployment, while maintaining the benefits of large-scale pretraining. A key merit of UFO is that the trimming process not only reduces the model size and inference consumption, but also even improves the accuracy on certain tasks. Specifically, UFO considers the multi-task training and brings two-fold impact on the unified model: some closely related tasks have mutual benefits, while some tasks have conflicts against each other. UFO manages to reduce the conflicts and to preserve the mutual benefits through a novel Network Architecture Search (NAS) method. Experiments on a wide range of deep representation learning tasks (i.e., face recognition, person re-identification, vehicle re-identification and product retrieval) show that the model trimmed from UFO achieves higher accuracy than its single-task-trained counterpart and yet has smaller model size, validating the concept of UFO. Besides, UFO also supported the release of 17 billion parameters computer vision (CV) foundation model which is the largest CV model in the industry.
CVMar 3, 2022
Bridging the Source-to-target Gap for Cross-domain Person Re-Identification with Intermediate DomainsYongxing Dai, Yifan Sun, Jun Liu et al.
Cross-domain person re-identification (re-ID), such as unsupervised domain adaptive (UDA) re-ID, aims to transfer the identity-discriminative knowledge from the source to the target domain. Existing methods commonly consider the source and target domains are isolated from each other, i.e., no intermediate status is modeled between both domains. Directly transferring the knowledge between two isolated domains can be very difficult, especially when the domain gap is large. From a novel perspective, we assume these two domains are not completely isolated, but can be connected through intermediate domains. Instead of directly aligning the source and target domains against each other, we propose to align the source and target domains against their intermediate domains for a smooth knowledge transfer. To discover and utilize these intermediate domains, we propose an Intermediate Domain Module (IDM) and a Mirrors Generation Module (MGM). IDM has two functions: 1) it generates multiple intermediate domains by mixing the hidden-layer features from source and target domains and 2) it dynamically reduces the domain gap between the source / target domain features and the intermediate domain features. While IDM achieves good domain alignment, it introduces a side effect, i.e., the mix-up operation may mix the identities into a new identity and lose the original identities. To compensate this, MGM is introduced by mapping the features into the IDM-generated intermediate domains without changing their original identity. It allows to focus on minimizing domain variations to promote the alignment between the source / target domain and intermediate domains, which reinforces IDM into IDM++. We extensively evaluate our method under both the UDA and domain generalization (DG) scenarios and observe that IDM++ yields consistent performance improvement for cross-domain re-ID, achieving new state of the art.
LGJul 29, 2024Code
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMsFeiyang Kang, Yifan Sun, Bingbing Wen et al.
Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales, challenging the existing practice of determining competitive mixtures in small-scale experiments and directly applying them at much larger scales. To address this, we propose AutoScale, a two-stage, scale-aware data composition framework. First, AutoScale fits a parametric model that predicts the model's loss under different data compositions, then uses it to find an approximate best allocation at smaller, more manageable budgets. Next, leveraging a novel theoretical analysis of how optimal compositions evolve with scale, AutoScale extrapolates that composition to larger budgets without further retraining. Empirically, AutoScale accelerates convergence and improves downstream performance. For instance, when pre-training GPT-2 Large, it achieves a 28% faster perplexity reduction than baselines and up to a 38% speed-up over unweighted training, while yielding best-average results on various downstream tasks. Overall, our findings illustrate how domain importance shifts with training scale, underscoring the need for scale-dependent data curation in LLM training. Our code is open-sourced.
CVSep 24, 2024
MonoFormer: One Transformer for Both Diffusion and AutoregressionChuyang Zhao, Yuxing Song, Wenhao Wang et al.
Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.
LGJun 21, 2023
State-wise Constrained Policy OptimizationWeiye Zhao, Rui Chen, Yifan Sun et al.
Reinforcement Learning (RL) algorithms have shown tremendous success in simulation environments, but their application to real-world problems faces significant challenges, with safety being a major concern. In particular, enforcing state-wise constraints is essential for many challenging tasks such as autonomous driving and robot manipulation. However, existing safe RL algorithms under the framework of Constrained Markov Decision Process (CMDP) do not consider state-wise constraints. To address this gap, we propose State-wise Constrained Policy Optimization (SCPO), the first general-purpose policy search algorithm for state-wise constrained reinforcement learning. SCPO provides guarantees for state-wise constraint satisfaction in expectation. In particular, we introduce the framework of Maximum Markov Decision Process, and prove that the worst-case safety violation is bounded under SCPO. We demonstrate the effectiveness of our approach on training neural network policies for extensive robot locomotion tasks, where the agent must satisfy a variety of state-wise safety constraints. Our results show that SCPO significantly outperforms existing methods and can handle state-wise constraints in high-dimensional robotics tasks.
CVNov 7, 2022
Generalizable Re-Identification from Videos with Cycle AssociationZhongdao Wang, Zhaopeng Dou, Jingwei Zhang et al.
In this paper, we are interested in learning a generalizable person re-identification (re-ID) representation from unlabeled videos. Compared with 1) the popular unsupervised re-ID setting where the training and test sets are typically under the same domain, and 2) the popular domain generalization (DG) re-ID setting where the training samples are labeled, our novel scenario combines their key challenges: the training samples are unlabeled, and collected form various domains which do no align with the test domain. In other words, we aim to learn a representation in an unsupervised manner and directly use the learned representation for re-ID in novel domains. To fulfill this goal, we make two main contributions: First, we propose Cycle Association (CycAs), a scalable self-supervised learning method for re-ID with low training complexity; and second, we construct a large-scale unlabeled re-ID dataset named LMP-video, tailored for the proposed method. Specifically, CycAs learns re-ID features by enforcing cycle consistency of instance association between temporally successive video frame pairs, and the training cost is merely linear to the data size, making large-scale training possible. On the other hand, the LMP-video dataset is extremely large, containing 50 million unlabeled person images cropped from over 10K Youtube videos, therefore is sufficient to serve as fertile soil for self-supervised learning. Trained on LMP-video, we show that CycAs learns good generalization towards novel domains. The achieved results sometimes even outperform supervised domain generalizable models. Remarkably, CycAs achieves 82.2% Rank-1 on Market-1501 and 49.0% Rank-1 on MSMT17 with zero human annotation, surpassing state-of-the-art supervised DG re-ID methods. Moreover, we also demonstrate the superiority of CycAs under the canonical unsupervised re-ID and the pretrain-and-finetune scenarios.
AIMar 30Code
MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language ModelsHan Wang, Yifan Sun, Brian Ko et al.
Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.
CVSep 30, 2024Code
Image Copy Detection for Diffusion ModelsWenhao Wang, Yifan Sun, Zhentao Tan et al.
Images produced by diffusion models are increasingly popular in digital artwork and visual marketing. However, such generated images might replicate content from existing ones and pose the challenge of content originality. Existing Image Copy Detection (ICD) models, though accurate in detecting hand-crafted replicas, overlook the challenge from diffusion models. This motivates us to introduce ICDiff, the first ICD specialized for diffusion models. To this end, we construct a Diffusion-Replication (D-Rep) dataset and correspondingly propose a novel deep embedding method. D-Rep uses a state-of-the-art diffusion model (Stable Diffusion V1.5) to generate 40, 000 image-replica pairs, which are manually annotated into 6 replication levels ranging from 0 (no replication) to 5 (total replication). Our method, PDF-Embedding, transforms the replication level of each image-replica pair into a probability density function (PDF) as the supervision signal. The intuition is that the probability of neighboring replication levels should be continuous and smooth. Experimental results show that PDF-Embedding surpasses protocol-driven methods and non-PDF choices on the D-Rep test set. Moreover, by utilizing PDF-Embedding, we find that the replication ratios of well-known diffusion models against an open-source gallery range from 10% to 20%. The project is publicly available at https://icdiff.github.io/.
LGMay 26
APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave PredictionYifan Sun, Lei Cheng, Sijie Chen et al.
Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently asymmetric: coarse amplitude structure remains relatively stable across frequencies, whereas phase-sensitive oscillatory structure deteriorates much more rapidly as frequency increases. Motivated by this asymmetry, we propose APEX, Amplitude-anchored and Phase-prior-guided Enhancement from eXtrapolated coarse predictions, a framework for target-scarce higher-frequency wave-field prediction. A lower-frequency neural operator first provides a coarse prediction in the target-frequency regime, from which we retain only the amplitude as a transferable structural anchor. A conditional flow-matching enhancer then reconstructs the target higher-frequency field under the guidance of a Green's-function-inspired phase prior. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks show that APEX consistently outperforms direct lower-to-higher extrapolation, target-adapted operator, and joint generative baselines under limited target-frequency supervision. Our results suggest that reliable higher-frequency prediction of oscillatory wave fields should not rely on direct end-to-end transfer of the full complex field, but instead on explicitly reusing transferable coarse structure while separately recovering the missing oscillatory detail.
LGNov 8, 2022
Clustered Federated Learning based on Nonconvex Pairwise FusionXue Yu, Ziyi Liu, Wu Wang et al.
This study investigates clustered federated learning (FL), one of the formulations of FL with non-i.i.d. data, where the devices are partitioned into clusters and each cluster optimally fits its data with a localized model. We propose a clustered FL framework that incorporates a nonconvex penalty to pairwise differences of parameters. Without a priori knowledge of the set of devices in each cluster and the number of clusters, this framework can autonomously estimate cluster structures. To implement the proposed framework, we introduce a novel clustered FL method called Fusion Penalized Federated Clustering (FPFC). Building upon the standard alternating direction method of multipliers (ADMM), FPFC can perform partial updates at each communication round and allows parallel computation with variable workload. These strategies significantly reduce the communication cost while ensuring privacy, making it practical for FL. We also propose a new warmup strategy for hyperparameter tuning in FL settings and explore the asynchronous variant of FPFC (asyncFPFC). Theoretical analysis provides convergence guarantees for FPFC with general losses and establishes the statistical convergence rate under a linear model with squared loss. Extensive experiments have demonstrated the superiority of FPFC compared to current methods, including robustness and generalization capability.
CVApr 14, 2023
DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object DetectionZongheng Tang, Yifan Sun, Si Liu et al.
This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD), aiming at adapting the detector from source to target domain through weak supervision. We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism and are thus capable of aggregating semantics across the entire image. The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment. Such motivated, we propose DETR with additional Global Aggregation (DETR-GA), a CDWSOD detector that simultaneously makes "instance-level + image-level" predictions and utilizes "strong + weak" supervisions. The key point of DETR-GA is very simple: for the encoder / decoder, we respectively add multiple class queries / a foreground query to aggregate the semantics into image-level predictions. Our query-based aggregation has two advantages. First, in the encoder, the weakly-supervised class queries are capable of roughly locating the corresponding positions and excluding the distraction from non-relevant regions. Second, through our design, the object queries and the foreground query in the decoder share consensus on the class semantics, therefore making the strong and weak supervision mutually benefit each other for domain alignment. Extensive experiments on four popular cross-domain benchmarks show that DETR-GA significantly improves CSWSOD and advances the states of the art (e.g., 29.0% --> 79.4% mAP on PASCAL VOC --> Clipart_all dataset).
AINov 22, 2023
Data Acquisition: A New Frontier in Data-centric AILingjiao Chen, Bilge Acun, Newsha Ardalani et al.
As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. The benchmark was released as a part of DataPerf. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.
CVMar 16, 2023
Unsupervised Facial Expression Representation Learning with Contrastive Local WarpingFanglei Xue, Yifan Sun, Yi Yang
This paper investigates unsupervised representation learning for facial expression analysis. We think Unsupervised Facial Expression Representation (UFER) deserves exploration and has the potential to address some key challenges in facial expression analysis, such as scaling, annotation bias, the discrepancy between discrete labels and continuous emotions, and model pre-training. Such motivated, we propose a UFER method with contrastive local warping (ContraWarping), which leverages the insight that the emotional expression is robust to current global transformation (affine transformation, color jitter, etc.) but can be easily changed by random local warping. Therefore, given a facial image, ContraWarping employs some global transformations and local warping to generate its positive and negative samples and sets up a novel contrastive learning framework. Our in-depth investigation shows that: 1) the positive pairs from global transformations may be exploited with general self-supervised learning (e.g., BYOL) and already bring some informative features, and 2) the negative pairs from local warping explicitly introduce expression-related variation and further bring substantial improvement. Based on ContraWarping, we demonstrate the benefit of UFER under two facial expression analysis scenarios: facial expression recognition and image retrieval. For example, directly using ContraWarping features for linear probing achieves 79.14% accuracy on RAF-DB, significantly reducing the gap towards the full-supervised counterpart (88.92% / 84.81% with/without pre-training).
CVJul 16, 2024
LaMI-DETR: Open-Vocabulary Detection with Language Model InstructionPenghui Du, Yu Wang, Yifan Sun et al.
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP.However, two main challenges emerge:(1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge.(2) An overfitting tendency towards base categories, with the open vocabulary knowledge biased towards base categories during the transfer from VLMs to detectors.To address these challenges, we propose the Language Model Instruction (LaMI) strategy, which leverages the relationships between visual concepts and applies them within a simple yet effective DETR-like detector, termed LaMI-DETR.LaMI utilizes GPT to construct visual concepts and employs T5 to investigate visual similarities across categories.These inter-category relationships refine concept representation and avoid overfitting to base categories.Comprehensive experiments validate our approach's superior performance over existing methods in the same rigorous setting without reliance on external training resources.LaMI-DETR achieves a rare box AP of 43.4 on OV-LVIS, surpassing the previous best by 7.8 rare box AP.
SEMar 26Code
Self-Organizing Multi-Agent Systems for Continuous Software DevelopmentWenhan Lyu, Yue Xiao, Yixuan Zhang et al.
Large Language Model-based multi-agent systems have shown promise in automating software development tasks. However, most vibe code systems focus on completing small tasks and incremental code changes, leaving persistent, continuous software development largely unexplored. We present TheBotCompany, an open-source orchestration framework for continuous multi-agent software development. TheBotCompany introduces three key innovations: (1) a three-phase state machine (Strategy to Execution to Verification) for milestone-driven development, (2) self-organizing agent teams where manager agents dynamically hire, assign, and fire worker agents based on project needs, and (3) asynchronous human oversight. We evaluate TheBotCompany on real-world software projects over multiple days of continuous development, measuring team adaptation patterns, milestone completion rates, cost efficiency, and code quality. Our results demonstrate that the self-organizing approach enables effective long-term software development with measurable progress, while the verification phase catches defects that would otherwise persist.
CVApr 7Code
Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR ImagesXuanguang Liu, Lei Ding, Yujie Li et al.
Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.
ROSep 5, 2023
A Lightweight and Transferable Design for Robust LEGO ManipulationRuixuan Liu, Yifan Sun, Changliu Liu
Lego is a well-known platform for prototyping pixelized objects. However, robotic Lego prototyping (i.e., manipulating Lego bricks) is challenging due to the tight connections and accuracy requirements. This paper investigates safe and efficient robotic Lego manipulation. In particular, this paper reduces the complexity of the manipulation by hardware-software co-design. An end-of-arm tool (EOAT) is designed, which reduces the problem dimension and allows large industrial robots to manipulate small Lego bricks. In addition, this paper uses evolution strategy to optimize the robot motion for Lego manipulation. Experiments demonstrate that the EOAT can reliably manipulate Lego bricks and the learning framework can effectively and safely improve the manipulation performance to a 100% success rate. The co-design is deployed to multiple robots (i.e., FANUC LR-mate 200id/7L and Yaskawa GP4) to demonstrate its generalizability and transferability. In the end, we show that the proposed solution enables sustainable robotic Lego prototyping, in which the robot can repeatedly assemble and disassemble different prototypes.
OCMay 24, 2022
Accelerating Frank-Wolfe via Averaging Step DirectionsZhaoyue Chen, Yifan Sun
The Frank-Wolfe method is a popular method in sparse constrained optimization, due to its fast per-iteration complexity. However, the tradeoff is that its worst case global convergence is comparatively slow, and importantly, is fundamentally slower than its flow rate--that is to say, the convergence rate is throttled by discretization error. In this work, we consider a modified Frank-Wolfe where the step direction is a simple weighted average of past oracle calls. This method requires very little memory and computational overhead, and provably decays this discretization error term. Numerically, we show that this method improves the convergence rate over several problems, especially after the sparse manifold has been detected. Theoretically, we show the method has an overall global convergence rate of $O(1/k^p)$, where $0< p < 1$; after manifold identification, this rate speeds to $O(1/k^{3p/2})$. We also observe that the method achieves this accelerated rate from a very early stage, suggesting a promising mode of acceleration for this family of methods.
LGOct 20, 2023
Absolute Policy OptimizationWeiye Zhao, Feihan Li, Yifan Sun et al.
In recent years, trust region on-policy reinforcement learning has achieved impressive results in addressing complex control tasks and gaming scenarios. However, contemporary state-of-the-art algorithms within this category primarily emphasize improvement in expected performance, lacking the ability to control over the worst-case performance outcomes. To address this limitation, we introduce a novel objective function, optimizing which leads to guaranteed monotonic improvement in the lower probability bound of performance with high confidence. Building upon this groundbreaking theoretical advancement, we further introduce a practical solution called Absolute Policy Optimization (APO). Our experiments demonstrate the effectiveness of our approach across challenging continuous control benchmark tasks and extend its applicability to mastering Atari games. Our findings reveal that APO as well as its efficient variation Proximal Absolute Policy Optimization (PAPO) significantly outperforms state-of-the-art policy gradient algorithms, resulting in substantial improvements in worst-case performance, as well as expected performance.
AIFeb 5, 2025Code
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem ProvingRan Xin, Chenguang Xi, Jie Yang et al.
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underlying large proof search spaces. While the existing approaches primarily rely on value functions and/or Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Tree Search (BFS) remains underexplored. In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present BFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. BFS-Prover achieves a state-of-the-art score of $72.95\%$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled. To facilitate further research and development in this area, we have open-sourced our model at https://huggingface.co/ByteDance-Seed/BFS-Prover-V1-7B.
LGNov 28, 2022
An adaptive shortest-solution guided decimation approach to sparse high-dimensional linear regressionXue Yu, Yifan Sun, Haijun Zhou
High-dimensional linear regression model is the most popular statistical model for high-dimensional data, but it is quite a challenging task to achieve a sparse set of regression coefficients. In this paper, we propose a simple heuristic algorithm to construct sparse high-dimensional linear regression models, which is adapted from the shortest solution-guided decimation algorithm and is referred to as ASSD. This algorithm constructs the support of regression coefficients under the guidance of the least-squares solution of the recursively decimated linear equations, and it applies an early-stopping criterion and a second-stage thresholding procedure to refine this support. Our extensive numerical results demonstrate that ASSD outperforms LASSO, vector approximate message passing, and two other representative greedy algorithms in solution accuracy and robustness. ASSD is especially suitable for linear regression problems with highly correlated measurement matrices encountered in real-world applications.
MTRL-SCINov 26, 2025
Lattice-to-total thermal conductivity ratio: a phonon-glass electron-crystal descriptor for data-driven thermoelectric designYifan Sun, Zhi Li, Tetsuya Imamura et al.
Thermoelectrics (TEs) are promising candidates for energy harvesting with performance quantified by figure of merit, $ZT$. To accelerate the discovery of high-$ZT$ materials, efforts have focused on identifying compounds with low thermal conductivity $κ$. Using a curated dataset of 71,913 entries, we show that high-$ZT$ materials reside not only in the low-$κ$ regime but also cluster near a lattice-to-total thermal conductivity ratio ($κ_\mathrm{L}/κ$) of approximately 0.5, consistent with the phonon-glass electron-crystal design concept. Building on this insight, we construct a framework consisting of two machine learning models for the lattice and electronic components of thermal conductivity that jointly provide both $κ$ and $κ_\mathrm{L}/κ$ for screening and guiding the optimization of TE materials. Among 104,567 compounds screened, our models identify 2,522 ultralow-$κ$ candidates. Follow-up case studies demonstrate that this framework can reliably provide optimization strategies by suggesting new dopants and alloys that shift pristine materials toward the $κ_\mathrm{L}/κ$ approaching 0.5 regime. Ultimately, by integrating rapid screening with PGEC-guided optimization, our data-driven framework effectively bridges the critical gap between materials discovery and performance enhancement.
CVMay 22, 2024Code
Dense Connector for MLLMsHuanjin Yao, Wenhao Wu, Taojiannan Yang et al.
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA-v1.5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://github.com/HJYao00/DenseConnector .
LGFeb 7, 2023
MMA-RNN: A Multi-level Multi-task Attention-based Recurrent Neural Network for Discrimination and Localization of Atrial FibrillationYifan Sun, Jingyan Shen, Yunfan Jiang et al.
The automatic detection of atrial fibrillation based on electrocardiograph (ECG) signals has received wide attention both clinically and practically. It is challenging to process ECG signals with cyclical pattern, varying length and unstable quality due to noise and distortion. Besides, there has been insufficient research on separating persistent atrial fibrillation from paroxysmal atrial fibrillation, and little discussion on locating the onsets and end points of AF episodes. It is even more arduous to perform well on these two distinct but interrelated tasks, while avoiding the mistakes inherent from stage-by-stage approaches. This paper proposes the Multi-level Multi-task Attention-based Recurrent Neural Network for three-class discrimination on patients and localization of the exact timing of AF episodes. Our model captures three-level sequential features based on a hierarchical architecture utilizing Bidirectional Long and Short-Term Memory Network (Bi-LSTM) and attention layers, and accomplishes the two tasks simultaneously with a multi-head classifier. The model is designed as an end-to-end framework to enhance information interaction and reduce error accumulation. Finally, we conduct experiments on CPSC 2021 dataset and the result demonstrates the superior performance of our method, indicating the potential application of MMA-RNN to wearable mobile devices for routine AF monitoring and early diagnosis.
LGMay 5, 2024Code
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMsFeiyang Kang, Hoang Anh Just, Yifan Sun et al.
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
CVNov 18, 2023
Hyperbolic Space with Hierarchical Margin Boosts Fine-Grained Learning from Coarse LabelsShu-Lin Xu, Yifan Sun, Faen Zhang et al.
Learning fine-grained embeddings from coarse labels is a challenging task due to limited label granularity supervision, i.e., lacking the detailed distinctions required for fine-grained tasks. The task becomes even more demanding when attempting few-shot fine-grained recognition, which holds practical significance in various applications. To address these challenges, we propose a novel method that embeds visual embeddings into a hyperbolic space and enhances their discriminative ability with a hierarchical cosine margins manner. Specifically, the hyperbolic space offers distinct advantages, including the ability to capture hierarchical relationships and increased expressive power, which favors modeling fine-grained objects. Based on the hyperbolic space, we further enforce relatively large/small similarity margins between coarse/fine classes, respectively, yielding the so-called hierarchical cosine margins manner. While enforcing similarity margins in the regular Euclidean space has become popular for deep embedding learning, applying it to the hyperbolic space is non-trivial and validating the benefit for coarse-to-fine generalization is valuable. Extensive experiments conducted on five benchmark datasets showcase the effectiveness of our proposed method, yielding state-of-the-art results surpassing competing methods.
CLMay 18
Code as Agent HarnessXuying Ning, Katherine Tieu, Dongqi Fu et al.
Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.
CVMay 18, 2024Code
Automated Multi-level Preference for MLLMsMengxi Zhang, Wenhao Wu, Yu Lu et al.
Current multimodal Large Language Models (MLLMs) suffer from ``hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary preferences (i.e., superior, inferior), and find that adopting multi-level preferences (e.g., superior, medium, inferior) is better for two benefits: 1) It narrows the gap between adjacent levels, thereby encouraging MLLMs to discern subtle differences. 2) It further integrates cross-level comparisons (beyond adjacent-level comparisons), thus providing a broader range of comparisons with hallucination examples. To verify our viewpoint, we present the Automated Multi-level Preference (AMP) framework for MLLMs. To facilitate this framework, we first develop an automated dataset generation pipeline that provides high-quality multi-level preference datasets without any human annotators. Furthermore, we design the Multi-level Direct Preference Optimization (MDPO) algorithm to robustly conduct complex multi-level preference learning. Additionally, we propose a new hallucination benchmark, MRHal-Bench. Extensive experiments across public hallucination and general benchmarks, as well as our MRHal-Bench, demonstrate the effectiveness of our proposed method. Code is available at https://github.com/takomc/amp.
LGNov 28, 2022
Semisoft Task Clustering for Multi-Task LearningYuzhao Zhang, Yifan Sun
Multi-task learning (MTL) aims to improve the performance of multiple related prediction tasks by leveraging useful information from them. Due to their flexibility and ability to reduce unknown coefficients substantially, the task-clustering-based MTL approaches have attracted considerable attention. Motivated by the idea of semisoft clustering of data, we propose a semisoft task clustering approach, which can simultaneously reveal the task cluster structure for both pure and mixed tasks as well as select the relevant features. The main assumption behind our approach is that each cluster has some pure tasks, and each mixed task can be represented by a linear combination of pure tasks in different clusters. To solve the resulting non-convex constrained optimization problem, we design an efficient three-step algorithm. The experimental results based on synthetic and real-world datasets validate the effectiveness and efficiency of the proposed approach. Finally, we extend the proposed approach to a robust task clustering problem.
LGJun 5, 2025Code
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout ReplayYifan Sun, Jingyan Shen, Yibin Wang et al.
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.
CLMay 24, 2022
Continual Learning with Global AlignmentXueying Bai, Jinghuan Shang, Yifan Sun et al.
Continual learning aims to sequentially learn new tasks without forgetting previous tasks' knowledge (catastrophic forgetting). One factor that can cause forgetting is the interference between the gradients on losses from different tasks. When the gradients on the current task's loss are in opposing directions to those on previous tasks' losses, updating the model for the current task may cause performance degradation on previous tasks. In this paper, we first identify causes of the above interference, and hypothesize that correlations between data representations are a key factor of interference. We then propose a method for promoting appropriate correlations between arbitrary tasks' data representations (i.e., global alignment) in individual task learning. Specifically, we learn the data representation as a task-specific composition of pre-trained token representations shared across all tasks. Then the correlations between different tasks' data representations are grounded by correlations between pre-trained token representations. We explore different ways to learn such compositions. Without experience replay, our model achieves SOTA performance in continual learning tasks. It also achieves advanced class-incremental performance through task-incremental training.
CVFeb 25
CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image GenerationYuXin Song, Yu Lu, Haoyuan Sun et al.
Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
CVSep 10, 2024
Knowledge Distillation via Query Selection for Detection TransformerYi Liu, Luting Wang, Zongheng Tang et al.
Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model's intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.
CRApr 11
Mask-Free Privacy Extraction and Rewriting: A Domain-Aware Approach via Prototype LearningXiaodong Li, Yuhua Wang, Qingchen Yu et al.
Client-side privacy rewriting is crucial for deploying LLMs in privacy-sensitive domains. However, existing approaches struggle to balance privacy and utility. Full-text methods often distort context, while span-level approaches rely on impractical manual masks or brittle static dictionaries. Attempts to automate localization via prompt-based LLMs prove unreliable, as they suffer from unstable instruction following that leads to privacy leakage and excessive context scrubbing. To address these limitations, we propose DAMPER (Domain-Aware Mask-free Privacy Extraction and Rewriting). DAMPER operationalizes latent privacy semantics into compact Domain Privacy Prototypes via contrastive learning, enabling precise, autonomous span localization. Furthermore, we introduce a Prototype-Guided Preference Alignment, which leverages learned prototypes as semantic anchors to construct preference pairs, optimizing a domain-compliant rewriting policy without human annotations. At inference time, DAMPER integrates a sampling-based Exponential Mechanism to provide rigorous span-level Differential Privacy (DP) guarantees. Extensive experiments demonstrate that DAMPER significantly outperforms existing baselines, achieving a superior privacy-utility trade-off.
CVMar 11
Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepDQinxin Wu, Fucheng Niu, Hengchuan Zhu et al.
Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.
AIMar 20, 2025Code
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data ContaminationYifan Sun, Han Wang, Dongbai Li et al.
Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.