39 Papers

CVFeb 18, 2023Code
StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning

Yuqian Fu, Yu Xie, Yanwei Fu et al.

Cross-Domain Few-Shot Learning (CD-FSL) is a recently emerging task that tackles few-shot learning across different domains. It aims at transferring prior knowledge learned on the source dataset to novel target datasets. The CD-FSL task is especially challenged by the huge domain gap between different datasets. Critically, such a domain gap actually comes from the changes of visual styles, and wave-SAN empirically shows that spanning the style distribution of the source data helps alleviate this issue. However, wave-SAN simply swaps styles of two images. Such a vanilla operation makes the generated styles ``real'' and ``easy'', which still fall into the original set of the source styles. Thus, inspired by vanilla adversarial learning, a novel model-agnostic meta Style Adversarial training (StyleAdv) method together with a novel style adversarial attack method is proposed for CD-FSL. Particularly, our style attack method synthesizes both ``virtual'' and ``hard'' adversarial styles for model training. This is achieved by perturbing the original style with the signed style gradients. By continually attacking styles and forcing the model to recognize these challenging adversarial styles, our model is gradually robust to the visual styles, thus boosting the generalization ability for novel target datasets. Besides the typical CNN-based backbone, we also employ our StyleAdv method on large-scale pretrained vision transformer. Extensive experiments conducted on eight various target datasets show the effectiveness of our method. Whether built upon ResNet or ViT, we achieve the new state of the art for CD-FSL. Code is available at https://github.com/lovelyqian/StyleAdv-CDFSL.

CVOct 11, 2022Code
ME-D2N: Multi-Expert Domain Decompositional Network for Cross-Domain Few-Shot Learning

Yuqian Fu, Yu Xie, Yanwei Fu et al.

Recently, Cross-Domain Few-Shot Learning (CD-FSL) which aims at addressing the Few-Shot Learning (FSL) problem across different domains has attracted rising attention. The core challenge of CD-FSL lies in the domain gap between the source and novel target datasets. Though many attempts have been made for CD-FSL without any target data during model training, the huge domain gap makes it still hard for existing CD-FSL methods to achieve very satisfactory results. Alternatively, learning CD-FSL models with few labeled target domain data which is more realistic and promising is advocated in previous work~\cite{fu2021meta}. Thus, in this paper, we stick to this setting and technically contribute a novel Multi-Expert Domain Decompositional Network (ME-D2N). Concretely, to solve the data imbalance problem between the source data with sufficient examples and the auxiliary target data with limited examples, we build our model under the umbrella of multi-expert learning. Two teacher models which can be considered to be experts in their corresponding domain are first trained on the source and the auxiliary target sets, respectively. Then, the knowledge distillation technique is introduced to transfer the knowledge from two teachers to a unified student model. Taking a step further, to help our student model learn knowledge from different domain teachers simultaneously, we further present a novel domain decomposition module that learns to decompose the student model into two domain-related sub parts. This is achieved by a novel domain-specific gate that learns to assign each filter to only one specific domain in a learnable way. Extensive experiments demonstrate the effectiveness of our method. Codes and models are available at https://github.com/lovelyqian/ME-D2N_for_CDFSL.

CVMar 15, 2022
Wave-SAN: Wavelet based Style Augmentation Network for Cross-Domain Few-Shot Learning

Yuqian Fu, Yu Xie, Yanwei Fu et al.

Previous few-shot learning (FSL) works mostly are limited to natural images of general concepts and categories. These works assume very high visual similarity between the source and target classes. In contrast, the recently proposed cross-domain few-shot learning (CD-FSL) aims at transferring knowledge from general nature images of many labeled examples to novel domain-specific target categories of only a few labeled examples. The key challenge of CD-FSL lies in the huge data shift between source and target domains, which is typically in the form of totally different visual styles. This makes it very nontrivial to directly extend the classical FSL methods to address the CD-FSL task. To this end, this paper studies the problem of CD-FSL by spanning the style distributions of the source dataset. Particularly, wavelet transform is introduced to enable the decomposition of visual representations into low-frequency components such as shape and style and high-frequency components e.g., texture. To make our model robust to visual styles, the source images are augmented by swapping the styles of their low-frequency components with each other. We propose a novel Style Augmentation (StyleAug) module to implement this idea. Furthermore, we present a Self-Supervised Learning (SSL) module to ensure the predictions of style-augmented images are semantically similar to the unchanged ones. This avoids the potential semantic drift problem in exchanging the styles. Extensive experiments on two CD-FSL benchmarks show the effectiveness of our method. Our codes and models will be released.

CVMay 29
Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

Jiayi Zhu, Fuxiang Huang, Yu Xie et al.

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

LGJul 10, 2024
Resource Allocation for Twin Maintenance and Computing Task Processing in Digital Twin Vehicular Edge Computing Network

Yu Xie, Qiong Wu, Pingyi Fan et al.

As a promising technology, vehicular edge computing (VEC) can provide computing and caching services by deploying VEC servers near vehicles. However, VEC networks still face challenges such as high vehicle mobility. Digital twin (DT), an emerging technology, can predict, estimate, and analyze real-time states by digitally modeling objects in the physical world. By integrating DT with VEC, a virtual vehicle DT can be created in the VEC server to monitor the real-time operating status of vehicles. However, maintaining the vehicle DT model requires ongoing attention from the VEC server, which also needs to offer computing services for the vehicles. Therefore, effective allocation and scheduling of VEC server resources are crucial. This study focuses on a general VEC network with a single VEC service and multiple vehicles, examining the two types of delays caused by twin maintenance and computational processing within the network. By transforming the problem using satisfaction functions, we propose an optimization problem aimed at maximizing each vehicle's resource utility to determine the optimal resource allocation strategy. Given the non-convex nature of the issue, we employ multi-agent Markov decision processes to reformulate the problem. Subsequently, we propose the twin maintenance and computing task processing resource collaborative scheduling (MADRL-CSTC) algorithm, which leverages multi-agent deep reinforcement learning. Through experimental comparisons with alternative algorithms, it demonstrates that our proposed approach is effective in terms of resource allocation.

CYAug 27, 2024
Measuring Human Contribution in AI-Assisted Content Generation

Yueqi Xie, Tao Qi, Jingwei Yi et al.

With the growing prevalence of generative artificial intelligence (AI), an increasing amount of content is no longer exclusively generated by humans but by generative AI models with human guidance. This shift presents notable challenges for the delineation of originality due to the varying degrees of human contribution in AI-assisted works. This study raises the research question of measuring human contribution in AI-assisted content generation and introduces a framework to address this question that is grounded in information theory. By calculating mutual information between human input and AI-assisted output relative to self-information of AI-assisted output, we quantify the proportional information contribution of humans in content generation. Our experimental results demonstrate that the proposed measure effectively discriminates between varying degrees of human contribution across multiple creative domains. We hope that this work lays a foundation for measuring human contributions in AI-assisted content generation in the era of generative AI.

LGJul 16, 2024
Digital Twin Vehicular Edge Computing Network: Task Offloading and Resource Allocation

Yu Xie, Qiong Wu, Pingyi Fan

With the increasing demand for multiple applications on internet of vehicles. It requires vehicles to carry out multiple computing tasks in real time. However, due to the insufficient computing capability of vehicles themselves, offloading tasks to vehicular edge computing (VEC) servers and allocating computing resources to tasks becomes a challenge. In this paper, a multi task digital twin (DT) VEC network is established. By using DT to develop offloading strategies and resource allocation strategies for multiple tasks of each vehicle in a single slot, an optimization problem is constructed. To solve it, we propose a multi-agent reinforcement learning method on the task offloading and resource allocation. Numerous experiments demonstrate that our method is effective compared to other benchmark algorithms.

CVAug 1, 2024
DNTextSpotter: Arbitrary-Shaped Scene Text Spotting via Improved Denoising Training

Yu Xie, Qian Qiao, Jun Gao et al.

More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To address this issue, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text spotting. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize noised content queries, thereby assisting in the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part.Although DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of 11.3% against the best approach in Inverse-Text dataset.

CLJan 19, 2025Code
InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models

Jing Ding, Kai Feng, Binbin Lin et al.

The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at https://github.com/HaileyFamo/InsQABench.git.

CRJun 11, 2024Code
VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Yu Liu, Lang Gao, Mingxin Yang et al.

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.

CLMar 6, 2025Code
Uncovering inequalities in new knowledge learning by large language models across different languages

Chenglong Wang, Haoyu Tang, Xiyuan Yang et al.

As large language models (LLMs) gradually become integral tools for problem solving in daily life worldwide, understanding linguistic inequality is becoming increasingly important. Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledge to generate up-to-date, domain-specific responses. Investigating linguistic inequalities within this dynamic process is, therefore, also essential. In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness. Through extensive experiments under two settings (in-context learning and fine-tuning) using both proprietary and open-source models, we demonstrate that low-resource languages consistently face disadvantages across all four dimensions. By shedding light on these disparities, we aim to raise awareness of linguistic inequalities in LLMs' new knowledge learning, fostering the development of more inclusive and equitable future LLMs.

MLFeb 18
Enhanced Diffusion Sampling: Efficient Rare Event Sampling and Free Energy Calculation with Diffusion Models

Yu Xie, Ludwig Winkler, Lixin Sun et al.

The rare-event sampling problem has long been the central limiting factor in molecular dynamics (MD), especially in biomolecular simulation. Recently, diffusion models such as BioEmu have emerged as powerful equilibrium samplers that generate independent samples from complex molecular distributions, eliminating the cost of sampling rare transition events. However, a sampling problem remains when computing observables that rely on states which are rare in equilibrium, for example folding free energies. Here, we introduce enhanced diffusion sampling, enabling efficient exploration of rare-event regions while preserving unbiased thermodynamic estimators. The key idea is to perform quantitatively accurate steering protocols to generate biased ensembles and subsequently recover equilibrium statistics via exact reweighting. We instantiate our framework in three algorithms: UmbrellaDiff (umbrella sampling with diffusion models), $Δ$G-Diff (free-energy differences via tilted ensembles), and MetaDiff (a batchwise analogue for metadynamics). Across toy systems, protein folding landscapes and folding free energies, our methods achieve fast, accurate, and scalable estimation of equilibrium properties within GPU-minutes to hours per system -- closing the rare-event sampling gap that remained after the advent of diffusion-model equilibrium samplers.

CVAug 12, 2024
A Simple Task-aware Contrastive Local Descriptor Selection Strategy for Few-shot Learning between inter class and intra class

Qian Qiao, Yu Xie, Shaoyao Huang et al.

Few-shot image classification aims to classify novel classes with few labeled samples. Recent research indicates that deep local descriptors have better representational capabilities. These studies recognize the impact of background noise on classification performance. They typically filter query descriptors using all local descriptors in the support classes or engage in bidirectional selection between local descriptors in support and query sets. However, they ignore the fact that background features may be useful for the classification performance of specific tasks. This paper proposes a novel task-aware contrastive local descriptor selection network (TCDSNet). First, we calculate the contrastive discriminative score for each local descriptor in the support class, and select discriminative local descriptors to form a support descriptor subset. Finally, we leverage support descriptor subsets to adaptively select discriminative query descriptors for specific tasks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on both general and fine-grained datasets.

LGMay 6
Hierarchical Multi-Fidelity Learning for Predicting Three-Dimensional Flame Wrinkling and Turbulent Burning Velocity

Saghar Zolfaghari, Yu Xie, Junfeng Yang et al.

High-fidelity experimental characterization of turbulent premixed flames remains limited by the cost and complexity of advanced diagnostics, particularly under elevated pressures and intense turbulence where measurements of coupled flame morphology and burning dynamics are sparse. Here, we develop a hierarchical multi-fidelity neural network framework (MuFiNNs) to address this challenge by integrating sparse high-fidelity experimental data with structured low-fidelity representations encoding dominant physical trends. The framework combines hierarchical low-fidelity construction with nonlinear multi-fidelity correction to learn coupled geometric and reactive flame behavior while recovering discrepancies that simplified models alone cannot capture. The methodology is applied to expanding turbulent premixed flames to predict three-dimensional flame wrinkling dynamics and turbulent mass burning velocity across varying fuels, pressures, and turbulence intensities. Using experimentally informed low-fidelity trend models with sparse high-fidelity measurements, MuFiNNs accurately reconstruct observed flame behavior, enable interpolation across unseen operating conditions, and demonstrate robust extrapolation beyond the training domain. Importantly, the framework remains effective in noisy, weakly structured, or experimentally inaccessible regimes where conventional data-driven approaches often fail. These results show that hierarchical multi-fidelity learning provides a scalable and physically grounded strategy for predictive combustion modeling in data-limited regimes. More broadly, this work establishes multi-fidelity scientific machine learning as a practical framework for extracting physically meaningful predictive models from sparse experiments, particularly for instability-dominated and turbulence-sensitive reactive flows where high-fidelity data acquisition is demanding.

LGJan 29
SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

Yu Xie, Xing Kai Ren, Ying Qi et al.

Reinforcement learning-based preference optimization is increasingly used to align list-wise generative recommenders with complex, multi-objective user feedback, yet existing optimizers such as Gradient-Bounded Policy Optimization (GBPO) exhibit structural limitations in recommendation settings. We identify a Symmetric Conservatism failure mode in which symmetric update bounds suppress learning from rare positive signals (e.g., cold-start items), static negative-sample constraints fail to prevent diversity collapse under rejection-dominated feedback, and group-normalized multi-objective rewards lead to low-resolution training signals. To address these issues, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimizer designed for list-wise generative recommendation. SAGE introduces sequence-level signal alignment via a geometric-mean importance ratio and a decoupled multi-objective advantage estimator to reduce token-level variance and mitigate reward collapse, together with asymmetric adaptive bounding that applies positive Boost updates to successful slates and an entropy-aware penalty to discourage low-diversity failures. Experiments on Amazon Product Reviews and the large-scale RecIF-Bench demonstrate consistent improvements in top-K accuracy, cold-start recall, and diversity across both Semantic-ID and native-text action spaces, while preserving numerical stability during training. These results suggest that asymmetric, sequence-aware policy optimization provides a principled and effective framework for addressing optimization failures in generative recommendation.

CRMar 14
IdentityGuard: Context-Aware Restriction and Provenance for Personalized Synthesis

Lingyun Zhang, Yu Xie, Ping Chen

The nature of personalized text-to-image models poses a unique safety challenge that generic context-blind methods are ill-equipped to handle. Such global filters create a dilemma: to prevent misuse, they are forced to damage the model's broader utility by erasing concepts entirely, causing unacceptable collateral damage.Our work presents a more precisely targeted approach, built on the principle that security should be as context-aware as the threat itself, intrinsically bound to the personalized concept. We present IDENTITYGUARD, which realizes this principle through a conditional restriction that blocks harmful content only when combined with the personalized identity, and a concept-specific watermark for precise traceability. Experiments show our approach prevents misuse while preserving the model's utility and enabling robust traceability. By moving beyond blunt, global filters, our work demonstrates a more effective and responsible path toward AI safety.

CYOct 11, 2024
The Social Impact of Generative LLM-Based AI

Yu Xie, Sofia Avila

Liking it or not, ready or not, we are likely to enter a new phase of human history in which Artificial Intelligence (AI) will dominate economic production and social life -- the AI Revolution. Before the actual arrival of the AI Revolution, it is time for us to speculate on how AI will impact the social world. In this article, we focus on the social impact of generative LLM-based AI (GELLMAI), discussing societal factors that contribute to its technological development and its potential roles in enhancing both between-country and within-country social inequality. There are good indications that the US and China will lead the field and will be the main competitors for domination of AI in the world. We conjecture the AI Revolution will likely give rise to a post-knowledge society in which knowledge per se will become less important than in today's world. Instead, individual relationships and social identity will become more important. So will soft skills.

CVDec 2, 2024
Concept Replacer: Replacing Sensitive Concepts in Diffusion Models via Precision Localization

Lingyun Zhang, Yu Xie, Yanwei Fu et al.

As large-scale diffusion models continue to advance, they excel at producing high-quality images but often generate unwanted content, such as sexually explicit or violent content. Existing methods for concept removal generally guide the image generation process but can unintentionally modify unrelated regions, leading to inconsistencies with the original model. We propose a novel approach for targeted concept replacing in diffusion models, enabling specific concepts to be removed without affecting non-target areas. Our method introduces a dedicated concept localizer for precisely identifying the target concept during the denoising process, trained with few-shot learning to require minimal labeled data. Within the identified region, we introduce a training-free Dual Prompts Cross-Attention (DPCA) module to substitute the target concept, ensuring minimal disruption to surrounding content. We evaluate our method on concept localization precision and replacement efficiency. Experimental results demonstrate that our method achieves superior precision in localizing target concepts and performs coherent concept replacement with minimal impact on non-target areas, outperforming existing approaches.

CVMay 23, 2025
TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie, Jielei Zhang, Pengyu Chen et al.

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

LGMar 9, 2025
UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials

Gongbo Zhang, Yanting Li, Renqian Luo et al. · microsoft-research

Function in natural systems arises from one-dimensional sequences forming three-dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under-modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co-generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder-only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task-specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state-of-the-art or competitive performance for the function-aware generation across domains: in materials, it achieves "conflicted" multi-property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 Å) by over 23-fold and enhances EC-conditioned enzyme design. Ablation studies and cross-domain transfer substantiate the benefits of joint discrete-continuous training, establishing UniGenX as a significant advance from prediction to controllable, function-aware generation.

CVDec 9, 2023
TALDS-Net: Task-Aware Adaptive Local Descriptors Selection for Few-shot Image Classification

Qian Qiao, Yu Xie, Ziyin Zeng et al.

Few-shot image classification aims to classify images from unseen novel classes with few samples. Recent works demonstrate that deep local descriptors exhibit enhanced representational capabilities compared to image-level features. However, most existing methods solely rely on either employing all local descriptors or directly utilizing partial descriptors, potentially resulting in the loss of crucial information. Moreover, these methods primarily emphasize the selection of query descriptors while overlooking support descriptors. In this paper, we propose a novel Task-Aware Adaptive Local Descriptors Selection Network (TALDS-Net), which exhibits the capacity for adaptive selection of task-aware support descriptors and query descriptors. Specifically, we compare the similarity of each local support descriptor with other local support descriptors to obtain the optimal support descriptor subset and then compare the query descriptors with the optimal support subset to obtain discriminative query descriptors. Extensive experiments demonstrate that our TALDS-Net outperforms state-of-the-art methods on both general and fine-grained datasets.

LGJun 25, 2025
Directed Link Prediction using GNN with Local and Global Feature Fusion

Yuyang Zhang, Xu Shen, Yu Xie et al.

Link prediction is a classical problem in graph analysis with many practical applications. For directed graphs, recently developed deep learning approaches typically analyze node similarities through contrastive learning and aggregate neighborhood information through graph convolutions. In this work, we propose a novel graph neural network (GNN) framework to fuse feature embedding with community information. We theoretically demonstrate that such hybrid features can improve the performance of directed link prediction. To utilize such features efficiently, we also propose an approach to transform input graphs into directed line graphs so that nodes in the transformed graph can aggregate more information during graph convolutions. Experiments on benchmark datasets show that our approach outperforms the state-of-the-art in most cases when 30%, 40%, 50%, and 60% of the connected links are used as training data, respectively.

CVApr 5
SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Lingyun Zhang, Yu Xie, Zhongli Fang et al.

The widespread deployment of text-to-image diffusion models is significantly challenged by the generation of visually harmful content, such as sexually explicit content, violence, and horror imagery. Common safety interventions, ranging from input filtering to model concept erasure, often suffer from two critical limitations: (1) a severe trade-off between safety and context preservation, where removing unsafe concepts degrades the fidelity of the safe content, and (2) vulnerability to adversarial attacks, where safety mechanisms are easily bypassed. To address these challenges, we propose SafeCtrl, a Region-Aware safety control framework operating on a Detect-Then-Suppress paradigm. Unlike global safety interventions, SafeCtrl first employs an attention-guided Detect module to precisely localize specific risk regions. Subsequently, a localized Suppress module, optimized via image-level Direct Preference Optimization (DPO), neutralizes harmful semantics only within the detected areas, effectively transforming unsafe objects into safe alternatives while leaving the surrounding context intact. Extensive experiments across multiple risk categories demonstrate that SafeCtrl achieves a superior trade-off between safety and fidelity compared to state-of-the-art methods. Crucially, our approach exhibits improved resilience against adversarial prompt attacks, offering a precise and robust solution for responsible generation.

AIJun 24, 2025
RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1

Yu Xie, Xingkai Ren, Ying Qi et al.

Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi-faceted item attributes into LLM-interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two-stage training paradigm is employed: the initial stage involves Supervised Fine-Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented with a Chain-of-Thought (CoT) mechanism. This stage guides the model through multi-step reasoning and holistic decision-making via a flexibly defined reward function, aiming to concurrently optimize recommendation accuracy, diversity, and other bespoke business objectives. Empirical evaluations on a real-world user behavior dataset from a large-scale social media platform demonstrate that RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty. It effectively mitigates the filter bubble effect and presents a promising avenue for the integrated optimization of recommendation models and policies under intricate business goals.

MTRL-SCIJun 12, 2025
Coupled reaction and diffusion governing interface evolution in solid-state batteries

Jingxuan Ding, Laura Zichi, Matteo Carli et al.

Understanding and controlling the atomistic-level reactions governing the formation of the solid-electrolyte interphase (SEI) is crucial for the viability of next-generation solid state batteries. However, challenges persist due to difficulties in experimentally characterizing buried interfaces and limits in simulation speed and accuracy. We conduct large-scale explicit reactive simulations with quantum accuracy for a symmetric battery cell, {\symcell}, enabled by active learning and deep equivariant neural network interatomic potentials. To automatically characterize the coupled reactions and interdiffusion at the interface, we formulate and use unsupervised classification techniques based on clustering in the space of local atomic environments. Our analysis reveals the formation of a previously unreported crystalline disordered phase, Li$_2$S$_{0.72}$P$_{0.14}$Cl$_{0.14}$, in the SEI, that evaded previous predictions based purely on thermodynamics, underscoring the importance of explicit modeling of full reaction and transport kinetics. Our simulations agree with and explain experimental observations of the SEI formations and elucidate the Li creep mechanisms, critical to dendrite initiation, characterized by significant Li motion along the interface. Our approach is to crease a digital twin from first principles, without adjustable parameters fitted to experiment. As such, it offers capabilities to gain insights into atomistic dynamics governing complex heterogeneous processes in solid-state synthesis and electrochemistry.

CYMar 2, 2025
Variance reduction in output from generative AI

Yu Xie, Yueqi Xie

Generative AI models, such as ChatGPT, will increasingly replace humans in producing output for a variety of important tasks. While much prior work has mostly focused on the improvement in the average performance of generative AI models relative to humans' performance, much less attention has been paid to the significant reduction of variance in output produced by generative AI models. In this Perspective, we demonstrate that generative AI models are inherently prone to the phenomenon of "regression toward the mean" whereby variance in output tends to shrink relative to that in real-world distributions. We discuss potential social implications of this phenomenon across three levels-societal, group, and individual-and two dimensions-material and non-material. Finally, we discuss interventions to mitigate negative effects, considering the roles of both service providers and users. Overall, this Perspective aims to raise awareness of the importance of output variance in generative AI and to foster collaborative efforts to meet the challenges posed by the reduction of variance in output generated by AI models.

HCMar 30
From Passive Feeds to Guided Discovery: AI-Initiated Interaction for Vague Intent in Content Exploration

Yu Xie, Ying Qi

Recommendation feeds work well when people are simply browsing, and search works well when they can formulate a query. Between these two cases is a common but poorly supported state: users feel that their feed has become repetitive, yet cannot clearly specify what they want instead. We refer to this state as vague intent. We present Red-Rec, an AI-supported exploration interface for this middle ground. After a period of browsing, the system summarizes patterns in the current feed (e.g., dominant content categories and possible latent interests), offers clickable exploration options, asks at most one follow-up question, and then gradually blends new content into the feed. The design is motivated by a formative study which found that users often recognize feed staleness but struggle to articulate alternatives, suggesting the need for proactive and low-effort interaction.We evaluated Red-Rec in a mixed-design lab study against three comparison conditions: a passive feed, search, and a user-initiated chat interface. Compared with user-initiated chat, Red-Rec led to broader exploration, higher serendipity ratings, and lower interaction effort. Participants in the AI-initiated condition typed very little , relying mainly on option selection, whereas participants in the user-initiated chat condition typed substantially more . We discuss how proactive, option-based AI support can help users move beyond repetitive feeds without undermining their sense of control, and we outline design implications for recommendation interfaces that support open-ended exploration.

CVSep 24, 2025
A Versatile Foundation Model for AI-enabled Mammogram Interpretation

Fuxiang Huang, Jiayi Zhu, Yunfang Yu et al.

Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.

CVAug 16, 2025
SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress

Lingyun Zhang, Yu Xie, Yanwei Fu et al.

The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introduce a trade-off between safety and fidelity. Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative. A key aspect of our work is a novel training strategy using Direct Preference Optimization (DPO). We leverage readily available, image-level preference data to train our module, enabling it to learn nuanced suppression behaviors and perform region-guided interventions at inference without requiring costly, pixel-level annotations. Extensive experiments show that SafeCtrl significantly outperforms state-of-the-art methods in both safety efficacy and fidelity preservation. Our findings suggest that decoupled, suppression-based control is a highly effective and scalable direction for building more responsible generative models.

CVJun 23, 2025
NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

Yu Xie, Chengjie Zeng, Lingyun Zhang et al.

The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.

MTRL-SCIMay 13, 2025
Self-Optimizing Machine Learning Potential Assisted Automated Workflow for Highly Efficient Complex Systems Material Design

Jiaxiang Li, Junwei Feng, Jie Luo et al.

Machine learning interatomic potentials have revolutionized complex materials design by enabling rapid exploration of material configurational spaces via crystal structure prediction with ab initio accuracy. However, critical challenges persist in ensuring robust generalization to unknown structures and minimizing the requirement for substantial expert knowledge and time-consuming manual interventions. Here, we propose an automated crystal structure prediction framework built upon the attention-coupled neural networks potential to address these limitations. The generalizability of the potential is achieved by sampling regions across the local minima of the potential energy surface, where the self-evolving pipeline autonomously refines the potential iteratively while minimizing human intervention. The workflow is validated on Mg-Ca-H ternary and Be-P-N-O quaternary systems by exploring nearly 10 million configurations, demonstrating substantial speedup compared to first-principles calculations. These results underscore the effectiveness of our approach in accelerating the exploration and discovery of complex multi-component functional materials.

SIAug 25, 2021
How COVID-19 has Impacted American Attitudes Toward China: A Study on Twitter

Gavin Cook, Junming Huang, Yu Xie

Past research has studied social determinants of attitudes toward foreign countries. Confounded by potential endogeneity biases due to unobserved factors or reverse causality, the causal impact of these factors on public opinion is usually difficult to establish. Using social media data, we leverage the suddenness of the COVID-19 pandemic to examine whether a major global event has causally changed American views of another country. We collate a database of more than 297 million posts on the social media platform Twitter about China or COVID-19 up to June 2020, and we treat tweeting about COVID-19 as a proxy for individual awareness of COVID-19. Using regression discontinuity and difference-in-difference estimation, we find that awareness of COVID-19 causes a sharp rise in anti-China attitudes. Our work has implications for understanding how self-interest affects policy preference and how Americans view migrant communities.

LGApr 14, 2021
Multi-Party Dual Learning

Maoguo Gong, Yuan Gao, Yu Xie et al.

The performance of machine learning algorithms heavily relies on the availability of a large amount of training data. However, in reality, data usually reside in distributed parties such as different institutions and may not be directly gathered and integrated due to various data policy constraints. As a result, some parties may suffer from insufficient data available for training machine learning models. In this paper, we propose a multi-party dual learning (MPDL) framework to alleviate the problem of limited data with poor quality in an isolated party. Since the knowledge sharing processes for multiple parties always emerge in dual forms, we show that dual learning is naturally suitable to handle the challenge of missing data, and explicitly exploits the probabilistic correlation and structural relationship between dual tasks to regularize the training process. We introduce a feature-oriented differential privacy with mathematical proof, in order to avoid possible privacy leakage of raw features in the dual inference process. The approach requires minimal modifications to the existing multi-party learning structure, and each party can build flexible and powerful models separately, whose accuracy is no less than non-distributed self-learning approaches. The MPDL framework achieves significant improvement compared with state-of-the-art multi-party learning methods, as we demonstrated through simulations on real-world datasets.

LGApr 14, 2021
Towards Explainable Multi-Party Learning: A Contrastive Knowledge Sharing Framework

Yuan Gao, Jiawei Li, Maoguo Gong et al.

Multi-party learning provides solutions for training joint models with decentralized data under legal and practical constraints. However, traditional multi-party learning approaches are confronted with obstacles such as system heterogeneity, statistical heterogeneity, and incentive design. How to deal with these challenges and further improve the efficiency and performance of multi-party learning has become an urgent problem to be solved. In this paper, we propose a novel contrastive multi-party learning framework for knowledge refinement and sharing with an accountable incentive mechanism. Since the existing naive model parameter averaging method is contradictory to the learning paradigm of neural networks, we simulate the process of human cognition and communication, and analogy multi-party learning as a many-to-one knowledge sharing problem. The approach is capable of integrating the acquired explicit knowledge of each client in a transparent manner without privacy disclosure, and it reduces the dependence on data distribution and communication environments. The proposed scheme achieves significant improvement in model performance in a variety of scenarios, as we demonstrated through experiments on several real-world datasets.

MMApr 2, 2021
An attention-based unsupervised adversarial model for movie review spam detection

Yuan Gao, Maoguo Gong, Yu Xie et al.

With the prevalence of the Internet, online reviews have become a valuable information resource for people. However, the authenticity of online reviews remains a concern, and deceptive reviews have become one of the most urgent network security problems to be solved. Review spams will mislead users into making suboptimal choices and inflict their trust in online reviews. Most existing research manually extracted features and labeled training samples, which are usually complicated and time-consuming. This paper focuses primarily on a neglected emerging domain - movie review, and develops a novel unsupervised spam detection model with an attention mechanism. By extracting the statistical features of reviews, it is revealed that users will express their sentiments on different aspects of movies in reviews. An attention mechanism is introduced in the review embedding, and the conditional generative adversarial network is exploited to learn users' review style for different genres of movies. The proposed model is evaluated on movie reviews crawled from Douban, a Chinese online community where people could express their feelings about movies. The experimental results demonstrate the superior performance of the proposed approach.

CLDec 8, 2020
Large-scale Quantitative Evidence of Media Impact on Public Opinion toward China

Junming Huang, Gavin Cook, Yu Xie

Do mass media influence people's opinion of other countries? Using BERT, a deep neural network-based natural language processing model, we analyze a large corpus of 267,907 China-related articles published by The New York Times since 1970. We then compare our output from The New York Times to a longitudinal data set constructed from 101 cross-sectional surveys of the American public's views on China. We find that the reporting of The New York Times on China in one year explains 54% of the variance in American public opinion on China in the next. Our result confirms hypothesized links between media and public opinion and helps shed light on how mass media can influence public opinion of foreign countries.

COMP-PHDec 7, 2020
Multitask machine learning of collective variables for enhanced sampling of rare events

Lixin Sun, Jonathan Vandermause, Simon Batzner et al.

Computing accurate reaction rates is a central challenge in computational chemistry and biology because of the high cost of free energy estimation with unbiased molecular dynamics. In this work, a data-driven machine learning algorithm is devised to learn collective variables with a multitask neural network, where a common upstream part reduces the high dimensionality of atomic configurations to a low dimensional latent space, and separate downstream parts map the latent space to predictions of basin class labels and potential energies. The resulting latent space is shown to be an effective low-dimensional representation, capturing the reaction progress and guiding effective umbrella sampling to obtain accurate free energy landscapes. This approach is successfully applied to model systems including a 5D Müller Brown model, a 5D three-well model, and alanine dipeptide in vacuum. This approach enables automated dimensionality reduction for energy controlled reactions in complex systems, offers a unified framework that can be trained with limited data, and outperforms single-task learning approaches, including autoencoders.

COMP-PHAug 26, 2020
Bayesian Force Fields from Active Learning for Simulation of Inter-Dimensional Transformation of Stanene

Yu Xie, Jonathan Vandermause, Lixin Sun et al.

We present a way to dramatically accelerate Gaussian process models for interatomic force fields based on many-body kernels by mapping both forces and uncertainties onto functions of low-dimensional features. This allows for automated active learning of models combining near-quantum accuracy, built-in uncertainty, and constant cost of evaluation that is comparable to classical analytical models, capable of simulating millions of atoms. Using this approach, we perform large scale molecular dynamics simulations of the stability of the stanene monolayer. We discover an unusual phase transformation mechanism of 2D stanene, where ripples lead to nucleation of bilayer defects, densification into a disordered multilayer structure, followed by formation of bulk liquid at high temperature or nucleation and growth of the 3D bcc crystal at low temperature. The presented method opens possibilities for rapid development of fast accurate uncertainty-aware models for simulating long-time large-scale dynamics of complex materials.

LGNov 27, 2019
Multi-Range Attentive Bicomponent Graph Convolutional Network for Traffic Forecasting

Weiqi Chen, Ling Chen, Yu Xie et al.

Traffic forecasting is of great importance to transportation management and public safety, and very challenging due to the complicated spatial-temporal dependency and essential uncertainty brought about by the road network and traffic conditions. Latest studies mainly focus on modeling the spatial dependency by utilizing graph convolutional networks (GCNs) throughout a fixed weighted graph. However, edges, i.e., the correlations between pair-wise nodes, are much more complicated and interact with each other. In this paper, we propose the Multi-Range Attentive Bicomponent GCN (MRA-BGCN), a novel deep learning model for traffic forecasting. We first build the node-wise graph according to the road network distance and the edge-wise graph according to various edge interaction patterns. Then, we implement the interactions of both nodes and edges using bicomponent graph convolution. The multi-range attention mechanism is introduced to aggregate information in different neighborhood ranges and automatically learn the importance of different ranges. Extensive experiments on two real-world road network traffic datasets, METR-LA and PEMS-BAY, show that our MRA-BGCN achieves the state-of-the-art results.