CHEM-PHJun 21, 2022Code
DeePKS+ABACUS as a Bridge between Expensive Quantum Mechanical Models and Machine Learning PotentialsWenfei Li, Qi Ou, Yixiao Chen et al.
Recently, the development of machine learning (ML) potentials has made it possible to perform large-scale and long-time molecular simulations with the accuracy of quantum mechanical (QM) models. However, for high-level QM methods, such as density functional theory (DFT) at the meta-GGA level and/or with exact exchange, quantum Monte Carlo, etc., generating a sufficient amount of data for training a ML potential has remained computationally challenging due to their high cost. In this work, we demonstrate that this issue can be largely alleviated with Deep Kohn-Sham (DeePKS), a ML-based DFT model. DeePKS employs a computationally efficient neural network-based functional model to construct a correction term added upon a cheap DFT model. Upon training, DeePKS offers closely-matched energies and forces compared with high-level QM method, but the number of training data required is orders of magnitude less than that required for training a reliable ML potential. As such, DeePKS can serve as a bridge between expensive QM models and ML potentials: one can generate a decent amount of high-accuracy QM data to train a DeePKS model, and then use the DeePKS model to label a much larger amount of configurations to train a ML potential. This scheme for periodic systems is implemented in a DFT package ABACUS, which is open-source and ready for use in various applications.
97.0LGJun 3
Flash-WAM: Modality-Aware Distillation for World Action ModelsArman Akbari, Ci Zhang, Arash Akbari et al.
World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.
96.2CVMay 25
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language ModelsYiming Liang, Yixiao Chen, Yiyang Zhou et al.
Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.
86.5ROApr 17
Human Cognition in Machines: A Unified Perspective of World ModelsTimothy Rupprecht, Pu Zhao, Amir Taherin et al.
This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.
LGOct 26, 2023
Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient DescentYixiao Chen, Hao Xie, Han Wang
We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.
CLDec 2, 2025
DeepSeek-V3.2: Pushing the Frontier of Open Large Language ModelsDeepSeek-AI, Aixin Liu, Aoxue Mei et al.
We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.
SEDec 24, 2025
AInsteinBench: Benchmarking Coding Agents on Scientific RepositoriesTitouan Duston, Shuo Xin, Yang Sun et al.
We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
CLFeb 4, 2025Code
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level RoutingWenhao Zheng, Yixiao Chen, Weitong Zhang et al.
Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.
75.6CVMay 19
ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action ModelsArash Akbari, Arman Akbari, Masih Eskandar et al.
Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.
CHEM-PHNov 3, 2025
Spin-Adapted Neural Network Wavefunctions in Real SpaceRuichen Li, Yuzhi Liu, Du Jiang et al.
Spin plays a fundamental role in understanding electronic structure, yet many real-space wavefunction methods fail to adequately consider it. We introduce the Spin-Adapted Antisymmetrization Method (SAAM), a general procedure that enforces exact total spin symmetry for antisymmetric many-electron wavefunctions in real space. In the context of neural network-based quantum Monte Carlo (NNQMC), SAAM leverages the expressiveness of deep neural networks to capture electron correlation while enforcing exact spin adaptation via group representation theory. This framework provides a principled route to embed physical priors into otherwise black-box neural network wavefunctions, yielding a compact representation of correlated system with neural network orbitals. Compared with existing treatments of spin in NNQMC, SAAM is more accurate and efficient, achieving exact spin purity without any additional tunable hyperparameters. To demonstrate its effectiveness, we apply SAAM to study the spin ladder of iron-sulfur clusters, a long-standing challenge for many-body methods due to their dense spectrum of nearly degenerate spin states. Our results reveal accurate resolution of low-lying spin states and spin gaps in [Fe$_2$S$_2$] and [Fe$_4$S$_4$] clusters, offering new insights into their electronic structures. In sum, these findings establish SAAM as a robust, hyperparameter-free standard for spin-adapted NNQMC, particularly for strongly correlated systems.
45.8AIMay 6
Agentic Discovery of Exchange-Correlation Density FunctionalsTitouan Duston, Jiashu Liang, Yuanheng Wang et al.
The development of accurate exchange-correlation (XC) functionals remains a longstanding challenge in density functional theory (DFT). The vast majority of XC functionals have been hand designed by human researchers combining physical insight, exact constraints, and empirical fitting. Recent advances in large language models enable a systematic, automated alternative to this human-driven design loop. This report presents an agentic search system in which an LLM proposes structured functional-form changes guided by evolutionary history. The system attempts to improve functional performance through an iterative plan-execute-summarize loop, where improvements are measurable by optimizing functional parameters against a standard thermochemistry dataset, then evaluating performance on a held-out subset. The strongest discovered functional, SAFS26-a (Seed Agentic Functional Search 2026), improves upon the gold-standard ωB97M-V baseline by ~9%. These results also surface a cautionary lesson for AI-assisted science: models powerful enough to discover genuine improvements are equally capable of exploiting unphysical shortcuts to game the benchmark; domain expertise translated into explicitly enforced constraints remains essential to keeping results scientifically grounded.
CVFeb 4, 2025Code
Dual-Flow: Transferable Multi-Target, Instance-Agnostic Attacks via In-the-wild Cascading Flow OptimizationYixiao Chen, Shikun Sun, Jianshu Li et al.
Adversarial attacks are widely used to evaluate model robustness, and in black-box scenarios, the transferability of these attacks becomes crucial. Existing generator-based attacks have excellent generalization and transferability due to their instance-agnostic nature. However, when training generators for multi-target tasks, the success rate of transfer attacks is relatively low due to the limitations of the model's capacity. To address these challenges, we propose a novel Dual-Flow framework for multi-target instance-agnostic adversarial attacks, utilizing Cascading Distribution Shift Training to develop an adversarial velocity function. Extensive experiments demonstrate that Dual-Flow significantly improves transferability over previous multi-target generative attacks. For example, it increases the success rate from Inception-v3 to ResNet-152 by 34.58\%. Furthermore, our attack method shows substantially stronger robustness against defense mechanisms, such as adversarially trained models. The code of Dual-Flow is available at: $\href{https://github.com/Chyxx/Dual-Flow}{https://github.com/Chyxx/Dual-Flow}$.
CVSep 20, 2025Code
Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?Xin Chen, Jia He, Maozheng Li et al.
Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some temporal questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.
GRDec 1, 2025
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and TrackingHanzhi Guo, Dongdong Weng, Mo Su et al.
Topology-consistent dynamic model sequences are essential for applications such as animation and model editing. However, existing 4D reconstruction methods face challenges in generating high-quality topology-consistent meshes. To address this, we propose a topology-aware dynamic reconstruction framework based on Gaussian Splatting. We introduce a Gaussian topological structure that explicitly encodes spatial connectivity. This structure enables topology-aware densification and pruning, preserving the manifold consistency of the Gaussian representation. Temporal regularization terms further ensure topological coherence over time, while differentiable mesh rasterization improves mesh quality. Experimental results demonstrate that our method reconstructs topology-consistent mesh sequences with significantly higher accuracy than existing approaches. Moreover, the resulting meshes enable precise 3D keypoint tracking. Project page: https://haza628.github.io/tagSplat/
LGMar 3
From Complex Dynamics to DynFormer: Rethinking Transformers for PDEsPengyu Lai, Yixiao Chen, Dewu Yang et al.
Partial differential equations (PDEs) are fundamental for modeling complex physical systems, yet classical numerical solvers face prohibitive computational costs in high-dimensional and multi-scale regimes. While Transformer-based neural operators have emerged as powerful data-driven alternatives, they conventionally treat all discretized spatial points as uniform, independent tokens. This monolithic approach ignores the intrinsic scale separation of physical fields, applying computationally prohibitive global attention that redundantly mixes smooth large-scale dynamics with high-frequency fluctuations. Rethinking Transformers through the lens of complex dynamics, we propose DynFormer, a novel dynamics-informed neural operator. Rather than applying a uniform attention mechanism across all scales, DynFormer explicitly assigns specialized network modules to distinct physical scales. It leverages a Spectral Embedding to isolate low-frequency modes, enabling a Kronecker-structured attention mechanism to efficiently capture large-scale global interactions with reduced complexity. Concurrently, we introduce a Local-Global-Mixing transformation. This module utilizes nonlinear multiplicative frequency mixing to implicitly reconstruct the small-scale, fast-varying turbulent cascades that are slaved to the macroscopic state, without incurring the cost of global attention. Integrating these modules into a hybrid evolutionary architecture ensures robust long-term temporal stability. Extensive memory-aligned evaluations across four PDE benchmarks demonstrate that DynFormer achieves up to a 95% reduction in relative error compared to state-of-the-art baselines, while significantly reducing GPU memory consumption. Our results establish that embedding first-principles physical dynamics into Transformer architectures yields a highly scalable, theoretically grounded blueprint for PDE surrogate modeling.
LGFeb 10
Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information DensityZhendong Mi, Yixiao Chen, Pu Zhao et al.
Mixture-of-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance, yet the massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment. Singular Value Decomposition (SVD)-based compression has emerged as a promising post-training technique; however, most existing methods apply uniform rank allocation or rely solely on static weight properties. This overlooks the substantial heterogeneity in expert utilization observed in MoE models, where frequent routing patterns and intrinsic information density vary significantly across experts. In this work, we propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density. We first introduce a fused metric that combines expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under a fixed budget. Moreover, instead of discarding compression residuals, we reconstruct them via a parameter-efficient sparse projection mechanism to recover lost information with minimal parameter overhead. Extensive experiments on representative MoE LLMs (e.g., Qwen3, DeepSeekMoE) across multiple compression ratios demonstrate that RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Notably, RFID-MoE achieves a perplexity of 16.92 on PTB with the Qwen3-30B model at a 60% compression ratio, reducing perplexity by over 8.0 compared to baselines, and improves zero-shot accuracy on HellaSwag by approximately 8%.
CVFeb 21, 2025
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion TransformersMin Zhao, Guande He, Yixiao Chen et al.
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.
CVJan 5, 2025
Unsupervised Search for Ethnic Minorities' Medical Segmentation Training SetYixiao Chen, Yue Yao, Ruining Yang et al.
This article investigates the critical issue of dataset bias in medical imaging, with a particular emphasis on racial disparities caused by uneven population distribution in dataset collection. Our analysis reveals that medical segmentation datasets are significantly biased, primarily influenced by the demographic composition of their collection sites. For instance, Scanning Laser Ophthalmoscopy (SLO) fundus datasets collected in the United States predominantly feature images of White individuals, with minority racial groups underrepresented. This imbalance can result in biased model performance and inequitable clinical outcomes, particularly for minority populations. To address this challenge, we propose a novel training set search strategy aimed at reducing these biases by focusing on underrepresented racial groups. Our approach utilizes existing datasets and employs a simple greedy algorithm to identify source images that closely match the target domain distribution. By selecting training data that aligns more closely with the characteristics of minority populations, our strategy improves the accuracy of medical segmentation models on specific minorities, i.e., Black. Our experimental results demonstrate the effectiveness of this approach in mitigating bias. We also discuss the broader societal implications, highlighting how addressing these disparities can contribute to more equitable healthcare outcomes.
CVMay 18, 2025
Context-Aware Autoregressive Models for Multi-Conditional Image GenerationYixiao Chen, Zhiyuan Ma, Guoli Jia et al.
Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$
IVApr 5, 2025
Performance Analysis of Deep Learning Models for Femur Segmentation in MRI ScanMengyuan Liu, Yixiao Chen, Anning Tian et al.
Convolutional neural networks like U-Net excel in medical image segmentation, while attention mechanisms and KAN enhance feature extraction. Meta's SAM 2 uses Vision Transformers for prompt-based segmentation without fine-tuning. However, biases in these models impact generalization with limited data. In this study, we systematically evaluate and compare the performance of three CNN-based models, i.e., U-Net, Attention U-Net, and U-KAN, and one transformer-based model, i.e., SAM 2 for segmenting femur bone structures in MRI scan. The dataset comprises 11,164 MRI scans with detailed annotations of femoral regions. Performance is assessed using the Dice Similarity Coefficient, which ranges from 0.932 to 0.954. Attention U-Net achieves the highest overall scores, while U-KAN demonstrated superior performance in anatomical regions with a smaller region of interest, leveraging its enhanced learning capacity to improve segmentation accuracy.
ROOct 3, 2025
A Trajectory Generator for High-Density Traffic and Diverse Agent-Interaction ScenariosRuining Yang, Yi Xu, Yixiao Chen et al.
Accurate trajectory prediction is fundamental to autonomous driving, as it underpins safe motion planning and collision avoidance in complex environments. However, existing benchmark datasets suffer from a pronounced long-tail distribution problem, with most samples drawn from low-density scenarios and simple straight-driving behaviors. This underrepresentation of high-density scenarios and safety critical maneuvers such as lane changes, overtaking and turning is an obstacle to model generalization and leads to overly optimistic evaluations. To address these challenges, we propose a novel trajectory generation framework that simultaneously enhances scenarios density and enriches behavioral diversity. Specifically, our approach converts continuous road environments into a structured grid representation that supports fine-grained path planning, explicit conflict detection, and multi-agent coordination. Built upon this representation, we introduce behavior-aware generation mechanisms that combine rule-based decision triggers with Frenet-based trajectory smoothing and dynamic feasibility constraints. This design allows us to synthesize realistic high-density scenarios and rare behaviors with complex interactions that are often missing in real data. Extensive experiments on the large-scale Argoverse 1 and Argoverse 2 datasets demonstrate that our method significantly improves both agent density and behavior diversity, while preserving motion realism and scenario-level safety. Our synthetic data also benefits downstream trajectory prediction models and enhances performance in challenging high-density scenarios.
AISep 30, 2025
Collaborative Compression for Large-Scale MoE Deployment on EdgeYixiao Chen, Yanyue Xie, Ruining Yang et al.
The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.
CVSep 28, 2025
From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous DrivingYixiao Chen, Ruining Yang, Xin Chen et al.
The key to achieving autonomous driving lies in topology-aware perception, the structured understanding of the driving environment with an emphasis on lane topology and road semantics. This survey systematically reviews four core research directions under this theme: vectorized map construction, topological structure modeling, prior knowledge fusion, and language model-based perception. Across these directions, we observe a unifying trend: a paradigm shift from static, pre-built maps to dynamic, sensor-driven perception. Specifically, traditional static maps have provided semantic context for autonomous systems. However, they are costly to construct, difficult to update in real time, and lack generalization across regions, limiting their scalability. In contrast, dynamic representations leverage on-board sensor data for real-time map construction and topology reasoning. Each of the four research directions contributes to this shift through compact spatial modeling, semantic relational reasoning, robust domain knowledge integration, and multimodal scene understanding powered by pre-trained language models. Together, they pave the way for more adaptive, scalable, and explainable autonomous driving systems.
LGAug 19, 2025
DyMixOp: Guiding Neural Operator Design for PDEs from a Complex Dynamics Perspective with Local-Global-MixingPengyu Lai, Yixiao Chen, Hui Xu
A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) is transforming these systems into a suitable format, especially when dealing with non-linearizable dynamics or the need for infinite-dimensional spaces for linearization. This paper introduces DyMixOp, a novel neural operator framework for PDEs that integrates insights from complex dynamical systems to address this challenge. Grounded in inertial manifold theory, DyMixOp transforms infinite-dimensional nonlinear PDE dynamics into a finite-dimensional latent space, establishing a structured foundation that maintains essential nonlinear interactions and enhances physical interpretability. A key innovation is the Local-Global-Mixing (LGM) transformation, inspired by convection dynamics in turbulence. This transformation effectively captures both fine-scale details and nonlinear interactions, while mitigating spectral bias commonly found in existing neural operators. The framework is further strengthened by a dynamics-informed architecture that connects multiple LGM layers to approximate linear and nonlinear dynamics, reflecting the temporal evolution of dynamical systems. Experimental results across diverse PDE benchmarks demonstrate that DyMixOp achieves state-of-the-art performance, significantly reducing prediction errors, particularly in convection-dominated scenarios reaching up to 86.7\%, while maintaining computational efficiency and scalability.
GRJul 24, 2025
PS-GS: Gaussian Splatting for Multi-View Photometric StereoYixiao Chen, Bin Liang, Hanzhi Guo et al.
Integrating inverse rendering with multi-view photometric stereo (MVPS) yields more accurate 3D reconstructions than the inverse rendering approaches that rely on fixed environment illumination. However, efficient inverse rendering with MVPS remains challenging. To fill this gap, we introduce the Gaussian Splatting for Multi-view Photometric Stereo (PS-GS), which efficiently and jointly estimates the geometry, materials, and lighting of the object that is illuminated by diverse directional lights (multi-light). Our method first reconstructs a standard 2D Gaussian splatting model as the initial geometry. Based on the initialization model, it then proceeds with the deferred inverse rendering by the full rendering equation containing a lighting-computing multi-layer perceptron. During the whole optimization, we regularize the rendered normal maps by the uncalibrated photometric stereo estimated normals. We also propose the 2D Gaussian ray-tracing for single directional light to refine the incident lighting. The regularizations and the use of multi-view and multi-light images mitigate the ill-posed problem of inverse rendering. After optimization, the reconstructed object can be used for novel-view synthesis, relighting, and material and shape editing. Experiments on both synthetic and real datasets demonstrate that our method outperforms prior works in terms of reconstruction accuracy and computational efficiency.
CLMay 30, 2025
Verbal Werewolf: Engage Users with Verbalized Agentic Werewolf Game FrameworkQihui Fan, Wenbo Li, Enfu Nan et al.
The growing popularity of social deduction games has created an increasing need for intelligent frameworks where humans can collaborate with AI agents, particularly in post-pandemic contexts with heightened psychological and social pressures. Social deduction games like Werewolf, traditionally played through verbal communication, present an ideal application for Large Language Models (LLMs) given their advanced reasoning and conversational capabilities. Prior studies have shown that LLMs can outperform humans in Werewolf games, but their reliance on external modules introduces latency that left their contribution in academic domain only, and omit such game should be user-facing. We propose \textbf{Verbal Werewolf}, a novel LLM-based Werewolf game system that optimizes two parallel pipelines: gameplay powered by state-of-the-art LLMs and a fine-tuned Text-to-Speech (TTS) module that brings text output to life. Our system operates in near real-time without external decision-making modules, leveraging the enhanced reasoning capabilities of modern LLMs like DeepSeek V3 to create a more engaging and anthropomorphic gaming experience that significantly improves user engagement compared to existing text-only frameworks.
GRMar 7, 2025
STGA: Selective-Training Gaussian Head AvatarsHanzhi Guo, Yixiao Chen, Dongye Xiaonuo et al.
We propose selective-training Gaussian head avatars (STGA) to enhance the details of dynamic head Gaussian. The dynamic head Gaussian model is trained based on the FLAME parameterized model. Each Gaussian splat is embedded within the FLAME mesh to achieve mesh-based animation of the Gaussian model. Before training, our selection strategy calculates the 3D Gaussian splat to be optimized in each frame. The parameters of these 3D Gaussian splats are optimized in the training of each frame, while those of the other splats are frozen. This means that the splats participating in the optimization process differ in each frame, to improve the realism of fine details. Compared with network-based methods, our method achieves better results with shorter training time. Compared with mesh-based methods, our method produces more realistic details within the same training time. Additionally, the ablation experiment confirms that our method effectively enhances the quality of details.
COMP-PHAug 1, 2020
DeePKS: a comprehensive data-driven approach towards chemically accurate density functional theoryYixiao Chen, Linfeng Zhang, Han Wang et al.
We propose a general machine learning-based framework for building an accurate and widely-applicable energy functional within the framework of generalized Kohn-Sham density functional theory. To this end, we develop a way of training self-consistent models that are capable of taking large datasets from different systems and different kinds of labels. We demonstrate that the functional that results from this training procedure gives chemically accurate predictions on energy, force, dipole, and electron density for a large class of molecules. It can be continuously improved when more and more data are available.
COMP-PHNov 27, 2019
Deep Density: circumventing the Kohn-Sham equations via symmetry preserving neural networksLeonardo Zepeda-Núñez, Yixiao Chen, Jiefu Zhang et al.
The recently developed Deep Potential [Phys. Rev. Lett. 120, 143001, 2018] is a powerful method to represent general inter-atomic potentials using deep neural networks. The success of Deep Potential rests on the proper treatment of locality and symmetry properties of each component of the network. In this paper, we leverage its network structure to effectively represent the mapping from the atomic configuration to the electron density in Kohn-Sham density function theory (KS-DFT). By directly targeting at the self-consistent electron density, we demonstrate that the adapted network architecture, called the Deep Density, can effectively represent the electron density as the linear combination of contributions from many local clusters. The network is constructed to satisfy the translation, rotation, and permutation symmetries, and is designed to be transferable to different system sizes. We demonstrate that using a relatively small number of training snapshots, Deep Density achieves excellent performance for one-dimensional insulating and metallic systems, as well as systems with mixed insulating and metallic characters. We also demonstrate its performance for real three-dimensional systems, including small organic molecules, as well as extended systems such as water (up to $512$ molecules) and aluminum (up to $256$ atoms).