h-index117
166papers
10,484citations
Novelty51%
AI Score62

166 Papers

CVOct 23, 2022Code
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Yang Zhan, Zhitong Xiong, Yuan Yuan · mit

In this paper, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, like RS image visual question answering, RS image captioning, and RS image-text retrieval have been investigated a lot. However, the object-level visual grounding on RS images is still under-explored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. 1) We build the new large-scale benchmark dataset of RSVG, termed RSVGD, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. 2) We benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed RSVGD dataset, and some insightful analyses are provided based on the results. 3) A novel transformer-based Multi-Level Cross-Modal feature learning (MLCM) module is proposed. Remotely-sensed images are usually with large scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MLCM module takes advantage of multi-scale visual features and multi-granularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MLCM adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multi-level and multi-modal features to boost performance. Furthermore, this work also provides useful insights for developing better RSVG models. The dataset and code will be publicly available at https://github.com/ZhanYang-nwpu/RSVG-pytorch.

CVSep 1, 2022
TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut

Yangtao Wang, Xi Shen, Yuan Yuan et al. · mit

In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, where the edge between each pair of patches is labeled with a similarity score between patches using features learned by the transformer. Detection and segmentation of salient objects is then formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6%, respectively, when tested with the VOC07, VOC12, and COCO20K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets, respectively, compared to current state-of-the-art techniques. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.

CLDec 19, 2025
OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman et al. · berkeley, mila

This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

LGFeb 9, 2023Code
Learning to Simulate Daily Activities via Modeling Dynamic Human Needs

Yuan Yuan, Huandong Wang, Jingtao Ding et al.

Daily activity data that records individuals' various types of activities in daily life are widely used in many applications such as activity scheduling, activity recommendation, and policymaking. Though with high value, its accessibility is limited due to high collection costs and potential privacy issues. Therefore, simulating human activities to produce massive high-quality data is of great importance to benefit practical applications. However, existing solutions, including rule-based methods with simplified assumptions of human behavior and data-driven methods directly fitting real-world data, both cannot fully qualify for matching reality. In this paper, motivated by the classic psychological theory, Maslow's need theory describing human motivation, we propose a knowledge-driven simulation framework based on generative adversarial imitation learning. To enhance the fidelity and utility of the generated activity data, our core idea is to model the evolution of human needs as the underlying mechanism that drives activity generation in the simulation model. Specifically, this is achieved by a hierarchical model structure that disentangles different need levels, and the use of neural stochastic differential equations that successfully captures piecewise-continuous characteristics of need dynamics. Extensive experiments demonstrate that our framework outperforms the state-of-the-art baselines in terms of data fidelity and utility. Besides, we present the insightful interpretability of the need modeling. The code is available at https://github.com/tsinghua-fib-lab/SAND.

CVAug 3, 2023Code
Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution

Yeying Jin, Beibei Lin, Wending Yan et al.

Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Atmospheric Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13% on GTA5 nighttime haze dataset. Our data and code is available at https://github.com/jinyeying/nighttime_dehaze.

CVNov 15, 2022Code
DeS3: Adaptive Attention-driven Self and Soft Shadow Removal using ViT Similarity

Yeying Jin, Wei Ye, Wenhan Yang et al.

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16\% of the RMSE of the whole image on the LRSS dataset. Our data and code is available at: \url{https://github.com/jinyeying/DeS3_Deshadow}

CVNov 24, 2022Code
Object Detection in Foggy Scenes by Embedding Depth and Reconstruction into Domain Adaptation

Xin Yang, Michael Bi Mi, Yuan Yuan et al.

Most existing domain adaptation (DA) methods align the features based on the domain feature distributions and ignore aspects related to fog, background and target objects, rendering suboptimal performance. In our DA framework, we retain the depth and background information during the domain feature alignment. A consistency loss between the generated depth and fog transmission map is introduced to strengthen the retention of the depth information in the aligned features. To address false object features potentially generated during the DA process, we propose an encoder-decoder framework to reconstruct the fog-free background image. This reconstruction loss also reinforces the encoder, i.e., our DA backbone, to minimize false object features.Moreover, we involve our target data in training both our DA module and our detection module in a semi-supervised manner, so that our detection module is also exposed to the unlabeled target data, the type of data used in the testing stage. Using these ideas, our method significantly outperforms the state-of-the-art method (47.6 mAP against the 44.3 mAP on the Foggy Cityscapes dataset), and obtains the best performance on multiple real-image public datasets. Code is available at: https://github.com/VIML-CVDL/Object-Detection-in-Foggy-Scenes

IVApr 9, 2022
Dual-Stage Approach Toward Hyperspectral Image Super-Resolution

Qiang Li, Yuan Yuan, Xiuping Jia et al.

Hyperspectral image produces high spectral resolution at the sacrifice of spatial resolution. Without reducing the spectral resolution, improving the resolution in the spatial domain is a very challenging problem. Motivated by the discovery that hyperspectral image exhibits high similarity between adjacent bands in a large spectral range, in this paper, we explore a new structure for hyperspectral image super-resolution (DualSR), leading to a dual-stage design, i.e., coarse stage and fine stage. In coarse stage, five bands with high similarity in a certain spectral range are divided into three groups, and the current band is guided to study the potential knowledge. Under the action of alternative spectral fusion mechanism, the coarse SR image is super-resolved in band-by-band. In order to build model from a global perspective, an enhanced back-projection method via spectral angle constraint is developed in fine stage to learn the content of spatial-spectral consistency, dramatically improving the performance gain. Extensive experiments demonstrate the effectiveness of the proposed coarse stage and fine stage. Besides, our network produces state-of-the-art results against existing works in terms of spatial reconstruction and spectral fidelity.

CVJul 6, 2022
Unsupervised Learning for Human Sensing Using Radio Signals

Tianhong Li, Lijie Fan, Yuan Yuan et al.

There is a growing literature demonstrating the feasibility of using Radio Frequency (RF) signals to enable key computer vision tasks in the presence of occlusions and poor lighting. It leverages that RF signals traverse walls and occlusions to deliver through-wall pose estimation, action recognition, scene captioning, and human re-identification. However, unlike RGB datasets which can be labeled by human workers, labeling RF signals is a daunting task because such signals are not human interpretable. Yet, it is fairly easy to collect unlabelled RF signals. It would be highly beneficial to use such unlabeled RF data to learn useful representations in an unsupervised manner. Thus, in this paper, we explore the feasibility of adapting RGB-based unsupervised representation learning to RF signals. We show that while contrastive learning has emerged as the main technique for unsupervised representation learning from images and videos, such methods produce poor performance when applied to sensing humans using RF signals. In contrast, predictive unsupervised learning methods learn high-quality representations that can be used for multiple downstream RF-based sensing tasks. Our empirical results show that this approach outperforms state-of-the-art RF-based human sensing on various tasks, opening the possibility of unsupervised representation learning from this novel modality.

96.3CLMay 27Code
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Xinle Deng, Ruobin Zhong, Hujin Peng et al.

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

MLAug 18, 2023
A Two-Part Machine Learning Approach to Characterizing Network Interference in A/B Testing

Yuan Yuan, Kristen M. Altenburger · mit

The reliability of controlled experiments, commonly referred to as "A/B tests," is often compromised by network interference, where the outcomes of individual units are influenced by interactions with others. Significant challenges in this domain include the lack of accounting for complex social network structures and the difficulty in suitably characterizing network interference. To address these challenges, we propose a machine learning-based method. We introduce "causal network motifs" and utilize transparent machine learning models to characterize network interference patterns underlying an A/B test on networks. Our method's performance has been demonstrated through simulations on both a synthetic experiment and a large-scale test on Instagram. Our experiments show that our approach outperforms conventional methods such as design-based cluster randomization and conventional analysis-based neighborhood exposure mapping. Our approach provides a comprehensive and automated solution to address network interference for A/B testing practitioners. This aids in informing strategic business decisions in areas such as marketing effectiveness and product customization.

LGAug 20, 2023
Adaptive pruning-based Newton's method for distributed learning

Shuzhen Chen, Yuan Yuan, Youming Tao et al. · mit

Newton's method leverages curvature information to boost performance, and thus outperforms first-order methods for distributed learning problems. However, Newton's method is not practical in large-scale and heterogeneous learning environments, due to obstacles such as high computation and communication costs of the Hessian matrix, sub-model diversity, staleness of training, and data heterogeneity. To overcome these obstacles, this paper presents a novel and efficient algorithm named Distributed Adaptive Newton Learning (\texttt{DANL}), which solves the drawbacks of Newton's method by using a simple Hessian initialization and adaptive allocation of training regions. The algorithm exhibits remarkable convergence properties, which are rigorously examined under standard assumptions in stochastic optimization. The theoretical analysis proves that \texttt{DANL} attains a linear convergence rate while efficiently adapting to available resources and keeping high efficiency. Furthermore, \texttt{DANL} shows notable independence from the condition number of the problem and removes the necessity for complex parameter tuning. Experiments demonstrate that \texttt{DANL} achieves linear convergence with efficient communication and strong performance across different datasets.

SYJun 17, 2023
Multi-Scale Simulation of Complex Systems: A Perspective of Integrating Knowledge and Data

Huandong Wang, Huan Yan, Can Rong et al.

Complex system simulation has been playing an irreplaceable role in understanding, predicting, and controlling diverse complex systems. In the past few decades, the multi-scale simulation technique has drawn increasing attention for its remarkable ability to overcome the challenges of complex system simulation with unknown mechanisms and expensive computational costs. In this survey, we will systematically review the literature on multi-scale simulation of complex systems from the perspective of knowledge and data. Firstly, we will present background knowledge about simulating complex system simulation and the scales in complex systems. Then, we divide the main objectives of multi-scale modeling and simulation into five categories by considering scenarios with clear scale and scenarios with unclear scale, respectively. After summarizing the general methods for multi-scale simulation based on the clues of knowledge and data, we introduce the adopted methods to achieve different objectives. Finally, we introduce the applications of multi-scale simulation in typical matter systems and social systems.

LGFeb 10, 2023
Near-Optimal Experimental Design Under the Budget Constraint in Online Platforms

Yongkang Guo, Yuan Yuan, Jinshan Zhang et al. · mit

A/B testing, or controlled experiments, is the gold standard approach to causally compare the performance of algorithms on online platforms. However, conventional Bernoulli randomization in A/B testing faces many challenges such as spillover and carryover effects. Our study focuses on another challenge, especially for A/B testing on two-sided platforms -- budget constraints. Buyers on two-sided platforms often have limited budgets, where the conventional A/B testing may be infeasible to be applied, partly because two variants of allocation algorithms may conflict and lead some buyers to exceed their budgets if they are implemented simultaneously. We develop a model to describe two-sided platforms where buyers have limited budgets. We then provide an optimal experimental design that guarantees small bias and minimum variance. Bias is lower when there is more budget and a higher supply-demand rate. We test our experimental design on both synthetic data and real-world data, which verifies the theoretical results and shows our advantage compared to Bernoulli randomization.

CVJun 12, 2022
Crowd Localization from Gaussian Mixture Scoped Knowledge and Scoped Teacher

Juncheng Wang, Junyu Gao, Yuan Yuan et al.

Crowd localization is to predict each instance head position in crowd scenarios. Since the distance of instances being to the camera are variant, there exists tremendous gaps among scales of instances within an image, which is called the intrinsic scale shift. The core reason of intrinsic scale shift being one of the most essential issues in crowd localization is that it is ubiquitous in crowd scenes and makes scale distribution chaotic. To this end, the paper concentrates on access to tackle the chaos of the scale distribution incurred by intrinsic scale shift. We propose Gaussian Mixture Scope (GMS) to regularize the chaotic scale distribution. Concretely, the GMS utilizes a Gaussian mixture distribution to adapt to scale distribution and decouples the mixture model into sub-normal distributions to regularize the chaos within the sub-distributions. Then, an alignment is introduced to regularize the chaos among sub-distributions. However, despite that GMS is effective in regularizing the data distribution, it amounts to dislodging the hard samples in training set, which incurs overfitting. We assert that it is blamed on the block of transferring the latent knowledge exploited by GMS from data to model. Therefore, a Scoped Teacher playing a role of bridge in knowledge transform is proposed. What' s more, the consistency regularization is also introduced to implement knowledge transform. To that effect, the further constraints are deployed on Scoped Teacher to derive feature consistence between teacher and student end. With proposed GMS and Scoped Teacher implemented on five mainstream datasets of crowd localization, the extensive experiments demonstrate the superiority of our work. Moreover, comparing with existing crowd locators, our work achieves state-of-the-art via F1-meansure comprehensively on five datasets.

LGDec 6, 2022
Contactless Oxygen Monitoring with Gated Transformer

Hao He, Yuan Yuan, Ying-Cong Chen et al.

With the increasing popularity of telehealth, it becomes critical to ensure that basic physiological signals can be monitored accurately at home, with minimal patient overhead. In this paper, we propose a contactless approach for monitoring patients' blood oxygen at home, simply by analyzing the radio signals in the room, without any wearable devices. We extract the patients' respiration from the radio signals that bounce off their bodies and devise a novel neural network that infers a patient's oxygen estimates from their breathing signal. Our model, called \emph{Gated BERT-UNet}, is designed to adapt to the patient's medical indices (e.g., gender, sleep stages). It has multiple predictive heads and selects the most suitable head via a gate controlled by the person's physiological indices. Extensive empirical results show that our model achieves high accuracy on both medical and radio datasets.

CVAug 24, 2023
Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

Yuan Yuan, Yang Zhan, Zhitong Xiong

Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.

CVAug 15, 2024Code
Quantum-inspired Interpretable Deep Learning Architecture for Text Sentiment Analysis

Bingyu Li, Da Zhang, Zhiyuan Zhao et al.

Text has become the predominant form of communication on social media, embedding a wealth of emotional nuances. Consequently, the extraction of emotional information from text is of paramount importance. Despite previous research making some progress, existing text sentiment analysis models still face challenges in integrating diverse semantic information and lack interpretability. To address these issues, we propose a quantum-inspired deep learning architecture that combines fundamental principles of quantum mechanics (QM principles) with deep learning models for text sentiment analysis. Specifically, we analyze the commonalities between text representation and QM principles to design a quantum-inspired text representation method and further develop a quantum-inspired text embedding layer. Additionally, we design a feature extraction layer based on long short-term memory (LSTM) networks and self-attention mechanisms (SAMs). Finally, we calculate the text density matrix using the quantum complex numbers principle and apply 2D-convolution neural networks (CNNs) for feature condensation and dimensionality reduction. Through a series of visualization, comparative, and ablation experiments, we demonstrate that our model not only shows significant advantages in accuracy and efficiency compared to previous related models but also achieves a certain level of interpretability by integrating QM principles. Our code is available at QISA.

IVDec 12, 2022
Z-SSMNet: Zonal-aware Self-supervised Mesh Network for Prostate Cancer Detection and Diagnosis with Bi-parametric MRI

Yuan Yuan, Euijoon Ahn, Dagan Feng et al.

Bi-parametric magnetic resonance imaging (bpMRI) has become a pivotal modality in the detection and diagnosis of clinically significant prostate cancer (csPCa). Developing AI-based systems to identify csPCa using bpMRI can transform PCa management by improving efficiency and cost-effectiveness. However, current state-of-the-art methods using convolutional neural networks (CNNs) are limited in learning in-plane and three-dimensional spatial information from anisotropic images. Their performances also depend on the availability of large, diverse, and well-annotated bpMRI datasets. We propose a Zonal-aware Self-supervised Mesh Network (Z-SSMNet) that adaptively integrates multi-dimensional (2D/2.5D/3D) convolutions to learn dense intra-slice information and sparse inter-slice information of the anisotropic bpMRI in a balanced manner. A self-supervised learning (SSL) technique is proposed to pre-train our network using large-scale unlabeled data to learn the appearance, texture, and structure semantics of bpMRI. It aims to capture both intra-slice and inter-slice information during the pre-training stage. Furthermore, we constrained our network to focus on the zonal anatomical regions to further improve the detection and diagnosis capability of csPCa. We conducted extensive experiments on the PI-CAI dataset comprising 10000+ multi-center and multi-scanner data. Our Z-SSMNet excelled in both lesion-level detection (AP score of 0.633) and patient-level diagnosis (AUROC score of 0.881), securing the top position in the Open Development Phase of the PI-CAI challenge and maintained strong performance, achieving an AP score of 0.690 and an AUROC score of 0.909, and securing the second-place ranking in the Closed Testing Phase.

CVDec 2, 2022
Counting Like Human: Anthropoid Crowd Counting on Modeling the Similarity of Objects

Qi Wang, Juncheng Wang, Junyu Gao et al.

The mainstream crowd counting methods regress density map and integrate it to obtain counting results. Since the density representation to one head accords to its adjacent distribution, it embeds the same category objects with variant values, while human beings counting models the invariant features namely similarity to objects. Inspired by this, we propose a rational and anthropoid crowd counting framework. To begin with, we leverage counting scalar as supervision signal, which provides global and implicit guidance to similar matters. Then, the large kernel CNN is utilized to imitate the paradigm of human beings which models invariant knowledge firstly and slides to compare similarity. Later, re-parameterization on pre-trained paralleled parameters is presented to cater to the inner-class variance on similarity comparison. Finally, the Random Scaling patches Yield (RSY) is proposed to facilitate similarity modeling on long distance dependencies. Extensive experiments on five challenging benchmarks in crowd counting show the proposed framework achieves state-of-the-art.

71.7CEMay 4
Generative Adaptation of Dynamics to Environmental Shifts via Weight-space Diffusion

Ruikun Li, Huandong Wang, Jingtao Ding et al.

Data-driven dynamics prediction often fails under environmental shifts, while traditional fine-tuning remains computationally prohibitive for hardware-constrained or data-scarce applications. We propose DynaDiff, a generative meta-learning framework that transitions the paradigm from gradient-based tuning or modulation to direct weight-space generation. Specifically, we first abstract expert weights as novel weight graphs, utilizing multi-head attention to explicitly capture topological coupling within weights. Subsequently, we design a functional loss to ensure that the generated models achieve consistency with expert models in physical behavior. Finally, we develop a dynamics-informed prompter that extracts cross-domain physical and spectral features from observation sequences to condition the diffusion model. Experiments demonstrate that DynaDiff boosts average prediction accuracy by 10.78% over competitive baselines. Furthermore, by pre-constructing a model zoo of expert predictors, we amortize the fine-tuning overhead into a one-time offline cost, significantly boosting deployment efficiency in new environments.

LGOct 9, 2023
Continuous Invariance Learning

Yong Lin, Fan Zhou, Lu Tan et al.

Invariance learning methods aim to learn invariant features in the hope that they generalize under distributional shifts. Although many tasks are naturally characterized by continuous domains, current invariance learning techniques generally assume categorically indexed domains. For example, auto-scaling in cloud computing often needs a CPU utilization prediction model that generalizes across different times (e.g., time of a day and date of a year), where `time' is a continuous domain index. In this paper, we start by theoretically showing that existing invariance learning methods can fail for continuous domain problems. Specifically, the naive solution of splitting continuous domains into discrete ones ignores the underlying relationship among domains, and therefore potentially leads to suboptimal performance. To address this challenge, we then propose Continuous Invariance Learning (CIL), which extracts invariant features across continuously indexed domains. CIL is a novel adversarial procedure that measures and controls the conditional independence between the labels and continuous domain indices given the extracted features. Our theoretical analysis demonstrates the superiority of CIL over existing invariance learning methods. Empirical results on both synthetic and real-world datasets (including data collected from production systems) show that CIL consistently outperforms strong baselines among all the tasks.

77.8AIMay 28
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

Shijie Cao, Yuan Yuan, Jing Liu

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

LGAug 6, 2024
Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method

Jinghui Yuan, Weijin Jiang, Zhe Cao et al.

Ensemble learning is a method that leverages weak learners to produce a strong learner. However, obtaining a large number of base learners requires substantial time and computational resources. Therefore, it is meaningful to study how to achieve the performance typically obtained with many base learners using only a few. We argue that to achieve this, it is essential to enhance both classification performance and generalization ability during the ensemble process. To increase model accuracy, each weak base learner needs to be more efficiently integrated. It is observed that different base learners exhibit varying levels of accuracy in predicting different classes. To capitalize on this, we introduce confidence tensors $\tilde{\mathbfΘ}$ and $\tilde{\mathbfΘ}_{rst}$ signifies the degree of confidence that the $t$-th base classifier assigns the sample to class $r$ while it actually belongs to class $s$. To the best of our knowledge, this is the first time an evaluation of the performance of base classifiers across different classes has been proposed. The proposed confidence tensor compensates for the strengths and weaknesses of each base classifier in different classes, enabling the method to achieve superior results with a smaller number of base learners. To enhance generalization performance, we design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative. Furthermore, it is proved that in gradient matrix of the loss function, the sum of each column's elements is zero, allowing us to solve a constrained optimization problem using gradient-based methods. We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets, demonstrating the superiority of our approach.

LGAug 6, 2024
Doubly Stochastic Adaptive Neighbors Clustering via the Marcus Mapping

Jinghui Yuan, Chusheng Zeng, Fangyuan Xie et al.

Clustering is a fundamental task in machine learning and data science, and similarity graph-based clustering is an important approach within this domain. Doubly stochastic symmetric similarity graphs provide numerous benefits for clustering problems and downstream tasks, yet learning such graphs remains a significant challenge. Marcus theorem states that a strictly positive symmetric matrix can be transformed into a doubly stochastic symmetric matrix by diagonal matrices. However, in clustering, learning sparse matrices is crucial for computational efficiency. We extend Marcus theorem by proposing the Marcus mapping, which indicates that certain sparse matrices can also be transformed into doubly stochastic symmetric matrices via diagonal matrices. Additionally, we introduce rank constraints into the clustering problem and propose the Doubly Stochastic Adaptive Neighbors Clustering algorithm based on the Marcus Mapping (ANCMM). This ensures that the learned graph naturally divides into the desired number of clusters. We validate the effectiveness of our algorithm through extensive comparisons with state-of-the-art algorithms. Finally, we explore the relationship between the Marcus mapping and optimal transport. We prove that the Marcus mapping solves a specific type of optimal transport problem and demonstrate that solving this problem through Marcus mapping is more efficient than directly applying optimal transport methods.

CLDec 7, 2025
PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang, Yuan Yuan, Maohao Shen et al. · uw

Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time. In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.

81.5AIMay 26
DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Shijie Cao, Yuan Yuan, Jing Liu

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

CVAug 14, 2022
MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting

Pengyu Chen, Junyu Gao, Yuan Yuan et al.

RGB-Thermal (RGB-T) crowd counting is a challenging task, which uses thermal images as complementary information to RGB images to deal with the decreased performance of unimodal RGB-based methods in scenes with low-illumination or similar backgrounds. Most existing methods propose well-designed structures for cross-modal fusion in RGB-T crowd counting. However, these methods have difficulty in encoding cross-modal contextual semantic information in RGB-T image pairs. Considering the aforementioned problem, we propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet), which aims to fully capture long-range contextual information from the RGB and thermal modalities based on the attention mechanism. Specifically, in the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion at the global level. In addition, a Multi-modal Multi-scale Aggregation (MMA) regression head is introduced to make full use of the multi-scale and contextual information across modalities to generate high-quality crowd density maps. Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting and achieves the state-of-the-art performance.

LGFeb 19, 2024Code
UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction

Yuan Yuan, Jingtao Ding, Jie Feng et al.

Urban spatio-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, and emergence response. Despite remarkable breakthroughs in pretrained natural language models that enable one model to handle diverse tasks, a universal solution for spatio-temporal prediction remains challenging Existing prediction approaches are typically tailored for specific spatio-temporal scenarios, requiring task-specific model designs and extensive domain-specific training data. In this study, we introduce UniST, a universal model designed for general urban spatio-temporal prediction across a wide range of scenarios. Inspired by large language models, UniST achieves success through: (i) utilizing diverse spatio-temporal data from different scenarios, (ii) effective pre-training to capture complex spatio-temporal dynamics, (iii) knowledge-guided prompts to enhance generalization capabilities. These designs together unlock the potential of building a universal model for various scenarios Extensive experiments on more than 20 spatio-temporal scenarios demonstrate UniST's efficacy in advancing state-of-the-art performance, especially in few-shot and zero-shot prediction. The datasets and code implementation are released on https://github.com/tsinghua-fib-lab/UniST.

63.0CEMay 24
Samudra 2: Scaling Ocean Emulators across Resolutions

Yuan Yuan, Jesse Rusak, Alexander Merose et al.

Ocean general circulation models (OGCMs) are essential to climate science but computationally expensive, limiting ensemble size and forcing scenarios. Neural emulators promise orders-of-magnitude speedups, yet existing ocean emulators have not combined fine spatial resolution with multi-year autoregressive rollouts. Samudra, the first autoregressive neural ocean emulator to produce multi-decade global rollouts, is limited to $1^\circ$ resolution and exhibits two long-horizon failure modes: \emph{variance collapse}, the loss of temporal variability, and \emph{imprinting artifacts}, in which velocity patterns leak into deep-ocean fields. We present Samudra 2, which introduces a wider U-Net backbone with modified ConvNeXt-style blocks and a reduced block-internal expansion factor, together with a dynamic loss that reweights output channels according to their prediction errors, strengthening gradients for slow-evolving deep-ocean fields. At $1^\circ$, Samudra 2 increases upper-ocean global-mean temperature $R^2$ from 0.56 to 0.87 and reduces deep-ocean temperature error by roughly sevenfold. The same architecture scales to $1/2^\circ$ and $1/4^\circ$ over approximately 8-year autoregressive rollouts, recovering mesoscale eddies and sharp western boundary currents. Running on a single GPU, Samudra 2 enables larger ensembles for sea-level projections, ocean heat uptake, and climate variability studies. We provide code, documentation, and benchmark resources at https://openathena.ai/Ocean_Emulator/.

CLNov 21, 2024Code
Understanding World or Predicting Future? A Comprehensive Survey of World Models

Jingtao Ding, Yunke Zhang, Yu Shang et al.

The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in https://github.com/tsinghua-fib-lab/World-Model.

LGJul 19, 2024
Improving GBDT Performance on Imbalanced Datasets: An Empirical Study of Class-Balanced Loss Functions

Jiaqi Luo, Yuan Yuan, Shixin Xu

Class imbalance remains a significant challenge in machine learning, particularly for tabular data classification tasks. While Gradient Boosting Decision Trees (GBDT) models have proven highly effective for such tasks, their performance can be compromised when dealing with imbalanced datasets. This paper presents the first comprehensive study on adapting class-balanced loss functions to three GBDT algorithms across various tabular classification tasks, including binary, multi-class, and multi-label classification. We conduct extensive experiments on multiple datasets to evaluate the impact of class-balanced losses on different GBDT models, establishing a valuable benchmark. Our results demonstrate the potential of class-balanced loss functions to enhance GBDT performance on imbalanced datasets, offering a robust approach for practitioners facing class imbalance challenges in real-world applications. Additionally, we introduce a Python package that facilitates the integration of class-balanced loss functions into GBDT workflows, making these advanced techniques accessible to a wider audience.

CVDec 13, 2023Code
Mono3DVG: 3D Visual Grounding in Monocular Images

Yang Zhan, Yuan Yuan, Zhitong Xiong

We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be publicly available at: https://github.com/ZhanYang-nwpu/Mono3DVG.

LGFeb 19, 2024Code
Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation

Yuan Yuan, Chenyang Shao, Jingtao Ding et al.

Spatio-temporal modeling is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPD, for spatio-temporal few-shot learning with urban knowledge transfer. Unlike conventional approaches that heavily rely on common feature extraction or intricate few-shot learning designs, our solution takes a novel approach by performing generative pre-training on a collection of neural network parameters optimized with data from source cities. We recast spatio-temporal few-shot learning as pre-training a generative diffusion model, which generates tailored neural networks guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPD employs a Transformer-based denoising diffusion model, which is model-agnostic to integrate with powerful spatio-temporal neural networks. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/tsinghua-fib-lab/GPD.

CLJan 15
A Concise Agent is Less Expert: Revealing Side Effects of Using Style Features on Conversational Agents

Young-Min Cho, Yuan Yuan, Sharath Chandra Guntuku et al.

Style features such as friendly, helpful, or concise are widely used in prompts to steer the behavior of Large Language Model (LLM) conversational agents, yet their unintended side effects remain poorly understood. In this work, we present the first systematic study of cross-feature stylistic side effects. We conduct a comprehensive survey of 127 conversational agent papers from ACL Anthology and identify 12 frequently used style features. Using controlled, synthetic dialogues across task-oriented and open domain settings, we quantify how prompting for one style feature causally affects others via a pairwise LLM as a Judge evaluation framework. Our results reveal consistent and structured side effects, such as prompting for conciseness significantly reduces perceived expertise. They demonstrate that style features are deeply entangled rather than orthogonal. To support future research, we introduce CASSE (Conversational Agent Stylistic Side Effects), a dataset capturing these complex interactions. We further evaluate prompt based and activation steering based mitigation strategies and find that while they can partially restore suppressed traits, they often degrade the primary intended style. These findings challenge the assumption of faithful style control in LLMs and highlight the need for multi-objective and more principled approaches to safe, targeted stylistic steering in conversational agents.

CVSep 25, 2024
Spotlight Text Detector: Spotlight on Candidate Regions Like a Camera

Xu Han, Junyu Gao, Chuang Yang et al.

The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.

CLApr 19, 2025Code
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

Bowen Jiang, Zhuoqun Hao, Young-Min Cho et al.

Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks -- from offering writing support to delivering tailored recommendations or consultations. Over time, the interaction history between a user and an LLM can provide extensive information about an individual's traits and preferences. However, open questions remain on how well LLMs today can effectively leverage such history to (1) internalize the user's inherent traits and preferences, (2) track how the user profiling and preferences evolve over time, and (3) generate personalized responses accordingly in new scenarios. In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories, each containing up to 60 sessions of multi-turn conversations across 15 real-world tasks that require personalization. Given an in-situ user query, i.e. query issued by the user from the first-person perspective, we evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile. We observe that current LLMs still struggle to recognize the dynamic evolution in users' profiles over time through direct prompting approaches. As a consequence, LLMs often fail to deliver responses that align with users' current situations and preferences, with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0 achieving only around 50% overall accuracy, suggesting room for improvement. We hope that PERSONAMEM, along with the user profile and conversation simulation pipeline, can facilitate future research in the development of truly user-aware chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.

AIFeb 15, 2024Code
Chain-of-Planned-Behaviour Workflow Elicits Few-Shot Mobility Generation in LLMs

Chenyang Shao, Fengli Xu, Bingbing Fan et al.

The powerful reasoning capabilities of large language models (LLMs) have brought revolutionary changes to many fields, but their performance in human behaviour generation has not yet been extensively explored. This gap likely emerges because the internal processes governing behavioral intentions cannot be solely explained by abstract reasoning. Instead, they are also influenced by a multitude of factors, including social norms and personal preference. Inspired by the Theory of Planned Behaviour (TPB), we develop a LLM workflow named Chain-of-Planned Behaviour (CoPB) for mobility behaviour generation, which reflects the important spatio-temporal dynamics of human activities. Through exploiting the cognitive structures of attitude, subjective norms, and perceived behaviour control in TPB, CoPB significantly enhance the ability of LLMs to reason the intention of next movement. Specifically, CoPB substantially reduces the error rate of mobility intention generation from 57.8% to 19.4%. To improve the scalability of the proposed CoPB workflow, we further explore the synergy between LLMs and mechanistic models. We find mechanistic mobility models, such as gravity model, can effectively map mobility intentions to physical mobility behaviours. The strategy of integrating CoPB with gravity model can reduce the token cost by 97.7% and achieve better performance simultaneously. Besides, the proposed CoPB workflow can facilitate GPT-4-turbo to automatically generate high quality labels for mobility behavior reasoning. We show such labels can be leveraged to fine-tune the smaller-scale, open source LLaMA 3-8B, which significantly reduces usage costs without sacrificing the quality of the generated behaviours.

39.0CVApr 24
HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

Xu Lu, Qianhong Peng, Qihao Zhou et al.

Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.

CVJan 12, 2024Code
SamLP: A Customized Segment Anything Model for License Plate Detection

Haoxuan Ding, Junyu Gao, Yuan Yuan et al.

With the emergence of foundation model, this novel paradigm of deep learning has encouraged many powerful achievements in natural language processing and computer vision. There are many advantages of foundation model, such as excellent feature extraction power, mighty generalization ability, great few-shot and zero-shot learning capacity, etc. which are beneficial to vision tasks. As the unique identity of vehicle, different countries and regions have diverse license plate (LP) styles and appearances, and even different types of vehicles have different LPs. However, recent deep learning based license plate detectors are mainly trained on specific datasets, and these limited datasets constrain the effectiveness and robustness of LP detectors. To alleviate the negative impact of limited data, an attempt to exploit the advantages of foundation model is implement in this paper. We customize a vision foundation model, i.e. Segment Anything Model (SAM), for LP detection task and propose the first LP detector based on vision foundation model, named SamLP. Specifically, we design a Low-Rank Adaptation (LoRA) fine-tuning strategy to inject extra parameters into SAM and transfer SAM into LP detection task. And then, we further propose a promptable fine-tuning step to provide SamLP with prompatable segmentation capacity. The experiments show that our proposed SamLP achieves promising detection performance compared to other LP detectors. Meanwhile, the proposed SamLP has great few-shot and zero-shot learning ability, which shows the potential of transferring vision foundation model. The code is available at https://github.com/Dinghaoxuan/SamLP

CLDec 4, 2025
ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation

Yiming Xu, Yuan Yuan, Vijay Viswanathan et al.

Text clustering is a fundamental task in natural language processing, yet traditional clustering algorithms with pre-trained embeddings often struggle in domain-specific contexts without costly fine-tuning. Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries. We propose ClusterFusion, a hybrid framework that instead treats the LLM as the clustering core, guided by lightweight embedding methods. The framework proceeds in three stages: embedding-guided subset partition, LLM-driven topic summarization, and LLM-based topic assignment. This design enables direct incorporation of domain knowledge and user preferences, fully leveraging the contextual adaptability of LLMs. Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion not only achieves state-of-the-art performance on standard tasks but also delivers substantial gains in specialized domains. To support future work, we release our newly constructed dataset and results on all benchmarks.

LGNov 19, 2024Code
Diffusion Transformers as Open-World Spatiotemporal Foundation Models

Yuan Yuan, Chonghua Han, Jingtao Ding et al.

The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.

CLFeb 3
One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

Bowen Jiang, Taiwei Shi, Ryo Kamoi et al.

This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.

CVSep 25, 2024
Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

Xu Han, Junyu Gao, Chuang Yang et al.

Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.

SDNov 13, 2023
Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation

Zhaojian Li, Bin Zhao, Yuan Yuan

Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial approach for generating binaural stereo audio from mono audio. Specifically, we develop a Stereo Audio Generation Model (SAGM), which utilizes shared spatio-temporal visual information to guide the generator and the discriminator to work separately. The shared visual information is updated alternately in the generative adversarial stage, allowing the generator and discriminator to deliver their respective guided knowledge while visually sharing. The proposed method learns bidirectional complementary visual information, which facilitates the expression of visual guidance in generation. In addition, spatial perception is a crucial attribute of binaural stereo audio, and thus the evaluation of stereo spatial perception is essential. However, previous metrics failed to measure the spatial perception of audio. To this end, a metric to measure the spatial perception of audio is proposed for the first time. The proposed metric is capable of measuring the magnitude and direction of spatial perception in the temporal dimension. Further, considering its function, it is feasible to utilize it instead of demanding user studies to some extent. The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics. Qualitative experiments and user studies demonstrate that the method generates space-realistic stereo audio.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

94.6CVMar 15Code
UAVBench and UAVIT-1M: Benchmarking and Enhancing MLLMs for Low-Altitude UAV Vision-Language Understanding

Yang Zhan, Yuan Yuan

Multimodal Large Language Models (MLLMs) have made significant strides in natural images and satellite remote sensing images. However, understanding low-altitude drone scenarios remains a challenge. Existing datasets primarily focus on a few specific low-altitude visual tasks, which cannot fully assess the ability of MLLMs in real-world low-altitude UAV applications. Therefore, we introduce UAVBench, a comprehensive benchmark, and UAVIT-1M, a large-scale instruction tuning dataset, designed to evaluate and improve MLLMs' abilities in low-altitude vision-language tasks. UAVBench comprises 43 test units and 966k high-quality data samples across 10 tasks at the image-level and region-level. UAVIT-1M consists of approximately 1.24 million diverse instructions, covering 789k multi-scene images and about 2,000 types of spatial resolutions with 11 distinct tasks. UAVBench and UAVIT-1M feature pure real-world visual images and rich weather conditions, and involve manual verification to ensure high quality. Our in-depth analysis of 11 state-of-the-art MLLMs using UAVBench reveals that open-source MLLMs cannot generate accurate conversations about low-altitude visual content, lagging behind closed-source MLLMs. Extensive experiments demonstrate that fine-tuning open-source MLLMs on UAVIT-1M significantly addresses this gap. Our contributions pave the way for bridging the gap between current MLLMs and low-altitude UAV real-world application demands. (Project page: https://UAVBench.github.io/)

LGNov 20, 2024Code
UniFlow: A Foundation Model for Unified Urban Spatio-Temporal Flow Prediction

Yuan Yuan, Jingtao Ding, Chonghua Han et al.

Urban spatio-temporal flow prediction, encompassing traffic flows and crowd flows, is crucial for optimizing city infrastructure and managing traffic and emergency responses. Traditional approaches have relied on separate models tailored to either grid-based data, representing cities as uniform cells, or graph-based data, modeling cities as networks of nodes and edges. In this paper, we build UniFlow, a foundational model for general urban flow prediction that unifies both grid-based and graphbased data. We first design a multi-view spatio-temporal patching mechanism to standardize different data into a consistent sequential format and then introduce a spatio-temporal transformer architecture to capture complex correlations and dynamics. To leverage shared spatio-temporal patterns across different data types and facilitate effective cross-learning, we propose SpatioTemporal Memory Retrieval Augmentation (ST-MRA). By creating structured memory modules to store shared spatio-temporal patterns, ST-MRA enhances predictions through adaptive memory retrieval. Extensive experiments demonstrate that UniFlow outperforms existing models in both grid-based and graph-based flow prediction, excelling particularly in scenarios with limited data availability, showcasing its superior performance and broad applicability. The datasets and code implementation have been released on https://github.com/YuanYuan98/UniFlow.

CLDec 16, 2024Code
Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

Xuanming Zhang, Yuxuan Chen, Yiming Zheng et al.

In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open-source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Block, and Distorted Handling Solution. These problems are widespread across real world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose Seeker, a multi-agent framework inspired by expert developer strategies for exception handling. Seeker uses agents: Scanner, Detector, Predator, Ranker, and Handler to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices in real development scenarios, providing valuable insights for future improvements in code reliability.

LGDec 15, 2024Code
Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model

Yujun Li, Hongyuan Zhang, Yuan Yuan

Graph contrastive learning (GCL) aims to learn representations from unlabeled graph data in a self-supervised manner and has developed rapidly in recent years. However, edgelevel contrasts are not well explored by most existing GCL methods. Most studies in GCL only regard edges as auxiliary information while updating node features. One of the primary obstacles of edge-based GCL is the heavy computation burden. To tackle this issue, we propose a model that can efficiently learn edge features for GCL, namely AugmentationFree Edge Contrastive Learning (AFECL) to achieve edgeedge contrast. AFECL depends on no augmentation consisting of two parts. Firstly, we design a novel edge feature generation method, where edge features are computed by embedding concatenation of their connected nodes. Secondly, an edge contrastive learning scheme is developed, where edges connecting the same nodes are defined as positive pairs, and other edges are defined as negative pairs. Experimental results show that compared with recent state-of-the-art GCL methods or even some supervised GNNs, AFECL achieves SOTA performance on link prediction and semi-supervised node classification of extremely scarce labels. The source code is available at https://github.com/YujunLi361/AFECL.