Xin Chen

h-index1

3papers

Novelty42%

AI Score40

Ranked #74,692 of 194,257 authors (top 38%)#25,362 in CV (top 43%)

3 Papers

3.6CVDec 29, 2025Code

Bridging Your Imagination with Audio-Video Generation via a Unified Director

Jiaxu Zhang, Tianshu Hu, Yuan Zhang et al.

Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

2.7CLDec 27, 2025

Accounting Reasoning in Large Language Models: Concepts, Evaluation, and Empirical Analysis

Jie Zhou, Xin Chen, Jie Zhang et al.

Large language models (LLMs) are increasingly reshaping learning paradigms, cognitive processes, and research methodologies across diverse domains. As their adoption expands, effectively integrating LLMs into professional fields and clarifying their role in domain-specific applications has become a key challenge for enterprise digital transformation and broader societal development. In the accounting domain, successful integration requires a systematic understanding of LLMs' domain-specific reasoning capabilities. In this study, we introduce the concept of accounting reasoning and propose a set of evaluation criteria grounded in an analysis of the training data characteristics of representative GLM-series models. These criteria establish a foundation for studying accounting-oriented reasoning paradigms and provide benchmarks for assessing and improving model performance. Building on this framework, we evaluate several representative LLMs, including GLM-6B, GLM-130B, GLM-4, and GPT-4, across a range of accounting reasoning tasks. Our experimental results show that prompt engineering strategies can yield varying degrees of performance improvement across models, with GPT-4 demonstrating the strongest overall accounting reasoning capability. Nevertheless, the results indicate that current LLMs remain insufficient for real-world accounting applications. In particular, further optimization is required for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.

3.6CVDec 27, 2025

SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration

Xin Chen, Kang Luo, Yangyi Xiao et al.

Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.