CLNov 27, 2022

Understanding BLOOM: An empirical study on diverse NLP tasks

arXiv:2211.14865v24 citationsh-index: 15
AI Analysis

This study provides empirical insights into BLOOM's capabilities and limitations for NLP researchers and practitioners, though it is incremental as it evaluates an existing model.

This paper empirically evaluates the BLOOM large language model on diverse NLP tasks, finding that its performance doesn't scale with parameter size, it underperforms monolingual GPT-2 in cross-lingual settings, and it generates text that is at least 17% less toxic than GPT models.

We view the landscape of large language models (LLMs) through the lens of the recently released BLOOM model to understand the performance of BLOOM and other decoder-only LLMs compared to BERT-style encoder-only models. We achieve this by evaluating the smaller BLOOM model variants (\textit{350m/560m} and \textit{1b3/1b7}) on several NLP benchmark datasets and popular leaderboards. We make the following observations: (1) BLOOM performance does not scale with parameter size, unlike other LLMs like GPT and BERT. Experiments fine-tuning BLOOM models show that the 560m variant performs similarly to or better than the 1b7 variant, (2) Zero-shot cross-lingual and multi-lingual fine-tuning experiments show that BLOOM is at par or worse than monolingual GPT-2 models, and (3) Toxicity analysis of prompt-based text generation using the RealToxicityPrompts dataset shows that the text generated by BLOOM is at least 17\% less toxic than GPT-2 and GPT-3 models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes