Chenxi Song

CV
h-index15
9papers
85citations
Novelty51%
AI Score55

9 Papers

94.7CVMar 26Code
Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction

Jiahao Tian, Chenxi Song, Wei Cheng et al.

Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose FreeLOC, a novel training-free, layer-adaptive framework that introduces two core techniques: Video-based Relative Position Re-encoding (VRPR) for frame-level relative position O.O.D, a multi-granularity strategy that hierarchically re-encodes temporal relative positions to align with the model's pre-trained distribution, and Tiered Sparse Attention (TSA) for context-length O.O.D, which preserves both local detail and long-range dependencies by structuring attention density across different temporal scales. Crucially, we introduce a layer-adaptive probing mechanism that identifies the sensitivity of each transformer layer to these O.O.D issues, allowing for the selective and efficient application of our methods. Extensive experiments demonstrate that our approach significantly outperforms existing training-free methods, achieving state-of-the-art results in both temporal consistency and visual quality. Code is available at https://github.com/Westlake-AGI-Lab/FreeLOC.

86.7CVMar 12Code
DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling

Tong Zhao, Mingkun Lei, Liangyu Yuan et al.

Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight

AIMar 4, 2025Code
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Wenjia Jiang, Yangyang Zhuang, Chenxi Song et al.

Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.

CLSep 20, 2024
Transfer Learning with Clinical Concept Embeddings from Large Language Models

Yuhe Gao, Runxue Bao, Yuelyu Ji et al.

Knowledge sharing is crucial in healthcare, especially when leveraging data from multiple clinical sites to address data scarcity, reduce costs, and enable timely interventions. Transfer learning can facilitate cross-site knowledge transfer, but a major challenge is heterogeneity in clinical concepts across different sites. Large Language Models (LLMs) show significant potential of capturing the semantic meaning of clinical concepts and reducing heterogeneity. This study analyzed electronic health records from two large healthcare systems to assess the impact of semantic embeddings from LLMs on local, shared, and transfer learning models. Results indicate that domain-specific LLMs, such as Med-BERT, consistently outperform in local and direct transfer scenarios, while generic models like OpenAI embeddings require fine-tuning for optimal performance. However, excessive tuning of models with biomedical embeddings may reduce effectiveness, emphasizing the need for balance. This study highlights the importance of domain-specific embeddings and careful model tuning for effective knowledge transfer in healthcare.

15.7CVApr 24
Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals

Zhiwei Wei, Chenxi Song, Tazhu Wang et al.

Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.

CVJun 5, 2025
FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing

Guangzhao Li, Yanming Yang, Chenxi Song et al.

Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.

GRSep 18, 2025
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Chenxi Song, Yanming Yang, Tong Zhao et al.

Recent video diffusion models show immense potential for spatial intelligence tasks due to their rich world priors, but this is undermined by limited controllability, poor spatial-temporal consistency, and entangled scene-camera dynamics. Existing solutions, such as model fine-tuning and warping-based repainting, struggle with scalability, generalization, and robustness against artifacts. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. 1) Intra-Step Recursive Refinement injects fine-grained trajectory guidance at denoising steps through a recursive correction loop, ensuring motion remains aligned with the target path. 2) Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. 3) Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Our framework is plug-and-play and model-agnostic, enabling broad applicability across various 3D/4D tasks. Extensive experiments demonstrate that our method achieves state-of-the-art performance in trajectory adherence, geometric consistency, and perceptual quality, outperforming both training-intensive and inference-only baselines.

LGFeb 3, 2024
Online Transfer Learning for RSV Case Detection

Yiming Sun, Yuhe Gao, Runxue Bao et al.

Transfer learning has become a pivotal technique in machine learning and has proven to be effective in various real-world applications. However, utilizing this technique for classification tasks with sequential data often faces challenges, primarily attributed to the scarcity of class labels. To address this challenge, we introduce Multi-Source Adaptive Weighting (MSAW), an online multi-source transfer learning method. MSAW integrates a dynamic weighting mechanism into an ensemble framework, enabling automatic adjustment of weights based on the relevance and contribution of each source (representing historical knowledge) and target model (learning from newly acquired data). We demonstrate the effectiveness of MSAW by applying it to detect Respiratory Syncytial Virus cases within Emergency Department visits, utilizing multiple years of electronic health records from the University of Pittsburgh Medical Center. Our method demonstrates performance improvements over many baselines, including refining pre-trained models with online learning as well as three static weighting approaches, showing MSAW's capacity to integrate historical knowledge with progressively accumulated new data. This study indicates the potential of online transfer learning in healthcare, particularly for developing machine learning models that dynamically adapt to evolving situations where new data is incrementally accumulated.

CVNov 27, 2025
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Mengyu Yang, Yanming Yang, Chenyi Xu et al.

Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).