CLSep 19, 2023

Baichuan 2: Open Large-scale Language Models

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng

Peking U

arXiv:2309.10305v437.6996 citationsh-index: 147Has Code

Originality Incremental advance

AI Analysis

This provides open-source, multilingual LLMs to benefit the research community, though it is incremental as it builds on existing model architectures.

The paper tackles the problem of limited open-source and multilingual capabilities in large language models by introducing Baichuan 2, a series of models with 7B and 13B parameters trained on 2.6 trillion tokens, which matches or outperforms similar-sized open-source models on benchmarks like MMLU and excels in domains like medicine and law.

Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

View on arXiv PDF Code

Similar