Mingxu Tao

CL
h-index14
12papers
400citations
Novelty42%
AI Score41

12 Papers

CLNov 14, 2023Code
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Chen Zhang, Mingxu Tao, Quzhe Huang et al. · pku

Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.

CLMar 2, 2023
Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

Mingxu Tao, Yansong Feng, Dongyan Zhao

Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this problem, recent works enhance existing models by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two types of NLP tasks, text classification and extractive question answering. Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of novel methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting.

CLJul 4, 2024
Unlocking the Potential of Model Merging for Low-Resource Languages

Mingxu Tao, Chen Zhang, Quzhe Huang et al. · pku

Adapting large language models (LLMs) to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT). However, this CT-then-SFT approach struggles with limited data in the context of low-resource languages, failing to balance language modeling and task-solving capabilities. We thus propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. We use model merging to develop task-solving LLMs for low-resource languages without SFT data in the target languages. Our experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data. Observing performance saturation in model merging with more training tokens, we further analyze the merging process and introduce a slack variable to the model merging algorithm to mitigate the loss of important parameters, thereby enhancing performance. We hope that model merging can benefit more human languages suffering from data scarcity with its higher data efficiency.

LGMar 12, 2024Code
Harder Tasks Need More Experts: Dynamic Routing in MoE Models

Quzhe Huang, Zhenwei An, Nan Zhuang et al. · pku

In this paper, we introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input difficulty. Unlike traditional MoE approaches that rely on fixed Top-K routing, which activates a predetermined number of experts regardless of the input's complexity, our method dynamically selects experts based on the confidence level in expert selection for each input. This allows for a more efficient utilization of computational resources, activating more experts for complex tasks requiring advanced reasoning and fewer for simpler tasks. Through extensive evaluations, our dynamic routing method demonstrates substantial improvements over conventional Top-2 routing across various benchmarks, achieving an average improvement of 0.7% with less than 90% activated parameters. Further analysis shows our model dispatches more experts to tasks requiring complex reasoning skills, like BBH, confirming its ability to dynamically allocate computational resources in alignment with the input's complexity. Our findings also highlight a variation in the number of experts needed across different layers of the transformer model, offering insights into the potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.

CLFeb 27, 2024Code
Probing Multimodal Large Language Models for Global and Local Semantic Representations

Mingxu Tao, Quzhe Huang, Kun Xu et al. · pku

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.

CLFeb 26, 2024Code
Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Mingxu Tao, Dongyan Zhao, Yansong Feng

Open-ended question answering requires models to find appropriate evidence to form wellreasoned, comprehensive and helpful answers. In practical applications, models also need to engage in extended discussions on potential scenarios closely relevant to the question. With augmentation of retrieval module, open-source Large Language Models (LLMs) can produce coherent answers often with different focuses, but are still sub-optimal in terms of reliable evidence selection and in-depth question analysis. In this paper, we propose a novel Chain-ofDiscussion framework to leverage the synergy among multiple open-source LLMs aiming to provide more correct and more comprehensive answers for open-ended QA, although they are not strong enough individually. Our experiments show that discussions among multiple LLMs play a vital role in enhancing the quality of answers.

CLMar 3, 2025Code
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages

Chen Zhang, Mingxu Tao, Zhiyuan Liao et al. · pku

Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems. Its parallelism between tasks and languages can provide a faithful and fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that open-source LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.

CLJun 4, 2025
EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding

Mingxu Tao, Jie Hu, Mingchuan Yang et al.

The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three tasks over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps us better understand the effectiveness of EpiCoDe.

CLMay 13, 2025
ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval

Mingxu Tao, Bowen Tang, Mingxuan Ma et al.

The rise of Large Language Models~(LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM's lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.

CLMay 9, 2024
Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?

Yutong Hu, Quzhe Huang, Mingxu Tao et al.

Recent studies have shown that Large Language Models (LLMs) have the potential to process extremely long text. Many works only evaluate LLMs' long-text processing ability on the language modeling task, with perplexity (PPL) as the evaluation metric. However, in our study, we find that there is no correlation between PPL and LLMs' long-text understanding ability. Besides, PPL may only reflect the model's ability to model local information instead of catching long-range dependency. Therefore, only using PPL to prove the model could process long text is inappropriate. The local focus feature of PPL could also explain some existing phenomena, such as the great extrapolation ability of the position method ALiBi. When evaluating a model's ability in long text, we might pay more attention to PPL's limitation and avoid overly relying on it.

CLMay 24, 2023
Lawyer LLaMA Technical Report

Quzhe Huang, Mingxu Tao, Chen Zhang et al.

Large Language Models (LLMs), like LLaMA, have exhibited remarkable performance across various tasks. Nevertheless, when deployed to specific domains such as law or medicine, the models still confront the challenge of a deficiency in domain-specific knowledge and an inadequate capability to leverage that knowledge to resolve domain-related problems. In this paper, we propose a new framework to adapt LLMs to specific domains and build Lawyer LLaMA, a legal domain LLM, based on this framework. Specifically, we inject domain knowledge during the continual training stage and teach the model to learn professional skills using properly designed supervised fine-tuning tasks. Moreover, to alleviate the hallucination problem during the model's generation, we add a retrieval module and extract relevant legal articles before the model answers any queries. When learning domain-specific skills, we find that experts' experience is much more useful than experiences distilled from ChatGPT, where hundreds of expert-written data outperform tens of thousands of ChatGPT-generated ones. We will release our model and data.

CLMay 8, 2023
A Frustratingly Easy Improvement for Position Embeddings via Random Padding

Mingxu Tao, Yansong Feng, Dongyan Zhao

Position embeddings, encoding the positional relationships among tokens in text sequences, make great contributions to modeling local context features in Transformer-based pre-trained language models. However, in Extractive Question Answering, position embeddings trained with instances of varied context lengths may not perform well as we expect. Since the embeddings of rear positions are updated fewer times than the front position embeddings, the rear ones may not be properly trained. In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to architectures of existing pre-trained language models. We adjust the token order of input sequences when fine-tuning, to balance the number of updating times of every position embedding. Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions, especially when models are trained on short contexts but evaluated on long contexts. Our code and data will be released for future research.