Xin Yan

CV
h-index25
18papers
916citations
Novelty57%
AI Score60

18 Papers

56.5LGJun 3
STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models

Xin Yan, Aqiang Wang, Zhenglin Wan et al.

Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.

CVMar 8, 2023
Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Xin Yan, Zuchao Li, Lefei Zhang

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model, which only takes seconds to create. This non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. The centroids can represent both patch pixels and index tokens with the property of local invariance. Specifically, we adopt patch masking and centroid replacing strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that our CCViT achieves 84.4% top-1 accuracy on ImageNet-1K classification with ViT-B and 86.0% with ViT-L. We also transfer our pre-trained model to other downstream tasks. Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.

IRAug 24, 2023
Out of the Box Thinking: Improving Customer Lifetime Value Modelling via Expert Routing and Game Whale Detection

Shijie Zhang, Xin Yan, Xuejiao Yang et al.

Customer lifetime value (LTV) prediction is essential for mobile game publishers trying to optimize the advertising investment for each user acquisition based on the estimated worth. In mobile games, deploying microtransactions is a simple yet effective monetization strategy, which attracts a tiny group of game whales who splurge on in-game purchases. The presence of such game whales may impede the practicality of existing LTV prediction models, since game whales' purchase behaviours always exhibit varied distribution from general users. Consequently, identifying game whales can open up new opportunities to improve the accuracy of LTV prediction models. However, little attention has been paid to applying game whale detection in LTV prediction, and existing works are mainly specialized for the long-term LTV prediction with the assumption that the high-quality user features are available, which is not applicable in the UA stage. In this paper, we propose ExpLTV, a novel multi-task framework to perform LTV prediction and game whale detection in a unified way. In ExpLTV, we first innovatively design a deep neural network-based game whale detector that can not only infer the intrinsic order in accordance with monetary value, but also precisely identify high spenders (i.e., game whales) and low spenders. Then, by treating the game whale detector as a gating network to decide the different mixture patterns of LTV experts assembling, we can thoroughly leverage the shared information and scenario-specific information (i.e., game whales modelling and low spenders modelling). Finally, instead of separately designing a purchase rate estimator for two tasks, we design a shared estimator that can preserve the inner task relationships. The superiority of ExpLTV is further validated via extensive experiments on three industrial datasets.

92.5IRApr 14
R3-VAE: Reference Vector-Guided Rating Residual Quantization VAE for Generative Recommendation

Qiang Wan, Ze Yang, Dawei Yang et al.

Generative Recommendation (GR) has gained traction for its merits of superior performance and cold-start capability. As the vital role in GR, Semantic Identifiers (SIDs) represent item semantics through discrete tokens. However, current techniques for SID generation based on vector quantization face two main challenges: (i) training instability, stemming from insufficient gradient propagation through the straight-through estimator and sensitivity to initialization; and (ii) inefficient SID quality assessment, where industrial practice still depends on costly GR training and A/B testing. To address these challenges, we propose Reference Vector-Guided Rating Residual Quantization VAE (R3-VAE). This framework incorporates three key innovations: (i) a reference vector that functions as a semantic anchor for the initial features, thereby mitigating sensitivity to initialization; (ii) a dot product-based rating mechanism designed to stabilize the training process and prevent codebook collapse; and (iii) two SID evaluation metrics, Semantic Cohesion and Preference Discrimination, serving as regularization terms during training. Empirical results on six benchmarks demonstrate that R3-VAE outperforms state-of-the-art methods, achieving an average improvement of 14.2% in Recall@10 and 15.5% in NDCG@10 across three Amazon datasets. Furthermore, we perform GR training and online A/B tests on Toutiao. Our method achieves a 1.62% improvement in MRR and a 0.83% gain in StayTime/U versus baselines. Additionally, we employ R3-VAE to replace the item ID of CTR model, resulting in significant improvements in content cold start by 15.36%, corroborating the strong applicability and business value in industry-scale recommendation scenarios.

60.4AIMar 28
AutoMS: Multi-Agent Evolutionary Search for Cross-Physics Inverse Microstructure Design

Zhenyuan Zhao, Yu Xing, Tianyang Xue et al.

Designing microstructures that satisfy coupled cross-physics objectives is a fundamental challenge in material science. This inverse design problem involves a vast, discontinuous search space where traditional topology optimization is computationally prohibitive, and deep generative models often suffer from "physical hallucinations," lacking the capability to ensure rigorous validity. To address this limitation, we introduce AutoMS, a multi-agent neuro-symbolic framework that reformulates inverse design as an LLM-driven evolutionary search. Unlike methods that treat LLMs merely as interfaces, AutoMS integrates them as "semantic navigators" to initialize search spaces and break local optima, while our novel Simulation-Aware Evolutionary Search (SAES) addresses the "blindness" of traditional evolutionary strategies. Specifically, SAES utilizes simulation feedback to perform local gradient approximation and directed parameter updates, effectively guiding the search toward physically valid Pareto frontiers. Orchestrating specialized agents (Manager, Parser, Generator, and Simulator), AutoMS achieves a state-of-the-art 83.8\% success rate on 17 diverse cross-physics tasks, nearly doubling the performance of traditional NSGA-II (43.7\%) and significantly outperforming ReAct-based LLM baselines (53.3\%). Furthermore, our hierarchical architecture reduces total execution time by 23.3\%. AutoMS demonstrates that autonomous agent systems can effectively navigate complex physical landscapes, bridging the gap between semantic design intent and rigorous physical validity.

CVJan 1
ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition

Feng-Qi Cui, Jinyang Huang, Sirui Zhao et al.

Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.

AIJan 5
Towards Privacy-Preserving Mental Health Support with Large Language Models

Dong Xue, Jicheng Tu, Ming Wang et al.

Large language models (LLMs) have shown promise for mental health support, yet training such models is constrained by the scarcity and sensitivity of real counseling dialogues. In this article, we present MindChat, a privacy-preserving LLM for mental health support, together with MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework. To synthesize high-quality counseling data, the developed dialogue-construction framework employs a dual closed-loop feedback design to integrate psychological expertise and counseling techniques through role-playing: (i) turn-level critique-and-revision to improve coherence and counseling appropriateness within a session, and (ii) session-level strategy refinement to progressively enrich counselor behaviors across sessions. To mitigate privacy risks under decentralized data ownership, we fine-tune the base model using federated learning with parameter-efficient LoRA adapters and incorporate differentially private optimization to reduce membership and memorization risks. Experiments on synthetic-data quality assessment and counseling capability evaluation show that MindCorpus improves training effectiveness and that MindChat is competitive with existing general and counseling-oriented LLM baselines under both automatic LLM-judge and human evaluation protocols, while exhibiting reduced privacy leakage under membership inference attacks.

CVNov 14, 2025
Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis

Feng-Qi Cui, Jinyang Huang, Ziyu Jia et al.

Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

CVSep 24, 2025
Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao et al.

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.

CVFeb 9, 2024
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Zhicheng Zheng, Xin Yan, Zhenfang Chen et al.

We introduce the Continuum Physical Dataset (ContPhy), a novel benchmark for assessing machine physical commonsense. ContPhy complements existing physical reasoning benchmarks by encompassing the inference of diverse physical properties, such as mass and density, across various scenarios and predicting corresponding dynamics. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy, which shows that the current AI models still lack physical commonsense for the continuum, especially soft-bodies, and illustrates the value of the proposed dataset. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models, which enjoy the advantages of both models, precise dynamic predictions, and interpretable reasoning. ContPhy aims to spur progress in perception and reasoning within diverse physical settings, narrowing the divide between human and machine intelligence in understanding the physical world. Project page: https://physical-reasoning-project.github.io

CVDec 2, 2024
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Xin Yan, Yuxuan Cai, Qiuyue Wang et al.

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

CVMay 2, 2024
Estimate the building height at a 10-meter resolution based on Sentinel data

Xin Yan

Building height is an important indicator for scientific research and practical application. However, building height products with a high spatial resolution (10m) are still very scarce. To meet the needs of high-resolution building height estimation models, this study established a set of spatial-spectral-temporal feature databases, combining SAR data provided by Sentinel-1, optical data provided by Sentinel-2, and shape data provided by building footprints. The statistical indicators on the time scale are extracted to form a rich database of 160 features. This study combined with permutation feature importance, Shapley Additive Explanations, and Random Forest variable importance, and the final stable features are obtained through an expert scoring system. This study took 12 large, medium, and small cities in the United States as the training data. It used moving windows to aggregate the pixels to solve the impact of SAR image displacement and building shadows. This study built a building height model based on a random forest model and compared three model ensemble methods of bagging, boosting, and stacking. To evaluate the accuracy of the prediction results, this study collected Lidar data in the test area, and the evaluation results showed that its R-Square reached 0.78, which can prove that the building height can be obtained effectively. The fast production of high-resolution building height data can support large-scale scientific research and application in many fields.

OCMar 2, 2025
DualMS: Implicit Dual-Channel Minimal Surface Optimization for Heat Exchanger Design

Weizheng Zhang, Hao Pan, Lin Lu et al.

Heat exchangers are critical components in a wide range of engineering applications, from energy systems to chemical processing, where efficient thermal management is essential. The design objectives for heat exchangers include maximizing the heat exchange rate while minimizing the pressure drop, requiring both a large interface area and a smooth internal structure. State-of-the-art designs, such as triply periodic minimal surfaces (TPMS), have proven effective in optimizing heat exchange efficiency. However, TPMS designs are constrained by predefined mathematical equations, limiting their adaptability to freeform boundary shapes. Additionally, TPMS structures do not inherently control flow directions, which can lead to flow stagnation and undesirable pressure drops. This paper presents DualMS, a novel computational framework for optimizing dual-channel minimal surfaces specifically for heat exchanger designs in freeform shapes. To the best of our knowledge, this is the first attempt to directly optimize minimal surfaces for two-fluid heat exchangers, rather than relying on TPMS. Our approach formulates the heat exchange maximization problem as a constrained connected maximum cut problem on a graph, with flow constraints guiding the optimization process. To address undesirable pressure drops, we model the minimal surface as a classification boundary separating the two fluids, incorporating an additional regularization term for area minimization. We employ a neural network that maps spatial points to binary flow types, enabling it to classify flow skeletons and automatically determine the surface boundary. DualMS demonstrates greater flexibility in surface topology compared to TPMS and achieves superior thermal performance, with lower pressure drops while maintaining a similar heat exchange rate under the same material cost.

HCOct 17, 2025
MedBuild AI: An Agent-Based Hybrid Intelligence Framework for Reshaping Agency in Healthcare Infrastructure Planning through Generative Design for Medical Architecture

Yiming Zhang, Yuejia Xu, Ziyao Wang et al.

Globally, disparities in healthcare infrastructure remain stark, leaving countless communities without access to even basic services. Traditional infrastructure planning is often slow and inaccessible, and although many architects are actively delivering humanitarian and aid-driven hospital projects worldwide, these vital efforts still fall far short of the sheer scale and urgency of demand. This paper introduces MedBuild AI, a hybrid-intelligence framework that integrates large language models (LLMs) with deterministic expert systems to rebalance the early design and conceptual planning stages. As a web-based platform, it enables any region with satellite internet access to obtain guidance on modular, low-tech, low-cost medical building designs. The system operates through three agents: the first gathers local health intelligence via conversational interaction; the second translates this input into an architectural functional program through rule-based computation; and the third generates layouts and 3D models. By embedding computational negotiation into the design process, MedBuild AI fosters a reciprocal, inclusive, and equitable approach to healthcare planning, empowering communities and redefining agency in global healthcare architecture.

LGApr 9, 2025
Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Wei Chen, Xin Yan, Bin Wen et al.

Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO's hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

CVMar 14, 2024
3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen et al.

Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

LGNov 25, 2021
Clustering Effect of (Linearized) Adversarial Robust Models

Yang Bai, Xin Yan, Yong Jiang et al.

Adversarial robustness has received increasing attention along with the study of adversarial examples. So far, existing works show that robust models not only obtain robustness against various adversarial attacks but also boost the performance in some downstream tasks. However, the underlying mechanism of adversarial robustness is still not clear. In this paper, we interpret adversarial robustness from the perspective of linear components, and find that there exist some statistical properties for comprehensively robust models. Specifically, robust models show obvious hierarchical clustering effect on their linearized sub-networks, when removing or replacing all non-linear components (e.g., batch normalization, maximum pooling, or activation layers). Based on these observations, we propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.

CVMar 14, 2020
Counterfactual Samples Synthesizing for Robust Visual Question Answering

Long Chen, Xin Yan, Jun Xiao et al.

Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.