Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
It addresses the problem of cost and accessibility for developers or organizations needing efficient language models, though it is incremental as it focuses on evaluation rather than new model development.
This work tackles the challenge of selecting small, open language models for practical applications by proposing an evaluation framework that measures semantic correctness across task types, domains, and reasoning types, and shows that appropriately chosen small models can outperform or compete with state-of-the-art large models like GPT-4o in specific scenarios.
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs do not perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains, and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify the best LM and prompt style depending on specific application requirements using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.