Wei He

CL
h-index34
4papers
89citations
Novelty60%
AI Score49

4 Papers

16.4CVDec 2, 2025Code
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen, Chaoqun Du, Feng Gu et al.

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

16.4CLFeb 16, 2024Code
LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Yi Lu, Xin Zhou, Wei He et al.

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .

16.2CLFeb 18, 2024Code
LongAgent: Scaling Language Models to 128k Context through Multi-Agent Collaboration

Jun Zhao, Can Zu, Hao Xu et al.

Large language models (LLMs) have demonstrated impressive performance in understanding language and executing complex reasoning tasks. However, LLMs with long context windows have been notorious for their expensive training costs and high inference latency. Even the most advanced models such as GPT-4 and Claude2 often make mistakes when processing inputs of over $100k$ tokens, a phenomenon also known as \textit{lost in the middle}. In this paper, we propose \textsc{LongAgent}, a method based on multi-agent collaboration, which scales LLMs (e.g., LLaMA) to a context of 128K and demonstrates potential superiority in long-text processing compared to GPT-4. In \textsc{LongAgent}, a leader is responsible for understanding user intent and directing team members to acquire information from documents. Due to members' hallucinations, it is non-trivial for a leader to obtain accurate information from the responses of dozens to hundreds of members. To address this, we develop an \textit{inter-member communication} mechanism to resolve response conflicts caused by hallucinations through information sharing. Our experimental results indicate that \textsc{LongAgent} offers a promising alternative for long-text processing. The agent team instantiated with LLaMA-7B achieves significant improvements in tasks such as 128k-long text retrieval, multi-hop question answering, compared to GPT-4.

4.1ROFeb 26, 2024
Online Efficient Safety-Critical Control for Mobile Robots in Unknown Dynamic Multi-Obstacle Environments

Yu Zhang, Guangyao Tian, Long Wen et al.

This paper proposes a LiDAR-based goal-seeking and exploration framework, addressing the efficiency of online obstacle avoidance in unstructured environments populated with static and moving obstacles. This framework addresses two significant challenges associated with traditional dynamic control barrier functions (D-CBFs): their online construction and the diminished real-time performance caused by utilizing multiple D-CBFs. To tackle the first challenge, the framework's perception component begins with clustering point clouds via the DBSCAN algorithm, followed by encapsulating these clusters with the minimum bounding ellipses (MBEs) algorithm to create elliptical representations. By comparing the current state of MBEs with those stored from previous moments, the differentiation between static and dynamic obstacles is realized, and the Kalman filter is utilized to predict the movements of the latter. Such analysis facilitates the D-CBF's online construction for each MBE. To tackle the second challenge, we introduce buffer zones, generating Type-II D-CBFs online for each identified obstacle. Utilizing these buffer zones as activation areas substantially reduces the number of D-CBFs that need to be activated. Upon entering these buffer zones, the system prioritizes safety, autonomously navigating safe paths, and hence referred to as the exploration mode. Exiting these buffer zones triggers the system's transition to goal-seeking mode. We demonstrate that the system's states under this framework achieve safety and asymptotic stabilization. Experimental results in simulated and real-world environments have validated our framework's capability, allowing a LiDAR-equipped mobile robot to efficiently and safely reach the desired location within dynamic environments containing multiple obstacles.