CLAIAug 30, 2024

OnlySportsLM: Optimizing Sports-Domain Language Models with SOTA Performance under Billion Parameters

arXiv:2409.00286v14 citationsh-index: 4
Originality Incremental advance
AI Analysis

It provides a replicable blueprint for efficient AI development in specialized fields, addressing the need for high-quality, domain-specific models without requiring massive parameters.

This paper tackles the problem of building efficient domain-specific language models by introducing OnlySportsLM, a 196M parameter model trained on a 600 billion token sports dataset, which achieves a 37.62%/34.08% accuracy improvement over previous state-of-the-art models and matches the performance of larger models in the sports domain.

This paper explores the potential of a small, domain-specific language model trained exclusively on sports-related data. We investigate whether extensive training data with specially designed small model structures can overcome model size constraints. The study introduces the OnlySports collection, comprising OnlySportsLM, OnlySports Dataset, and OnlySports Benchmark. Our approach involves: 1) creating a massive 600 billion tokens OnlySports Dataset from FineWeb, 2) optimizing the RWKV architecture for sports-related tasks, resulting in a 196M parameters model with 20-layer, 640-dimension structure, 3) training the OnlySportsLM on part of OnlySports Dataset, and 4) testing the resultant model on OnlySports Benchmark. OnlySportsLM achieves a 37.62%/34.08% accuracy improvement over previous 135M/360M state-of-the-art models and matches the performance of larger models such as SomlLM 1.7B and Qwen 1.5B in the sports domain. Additionally, the OnlySports collection presents a comprehensive workflow for building high-quality, domain-specific language models, providing a replicable blueprint for efficient AI development across various specialized fields.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes