CLMar 16, 2025

Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

arXiv:2503.22698v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of balancing specialization and generalization for on-device language models on edge devices, with incremental architectural improvements.

The study investigated trade-offs between domain-specific optimization and cross-domain robustness in on-device language models, finding that conventional techniques reduce target task perplexity by 18-25% but cause general-task F1 scores to drop by 12-29%. The proposed Generalized Edge Model (GEM) achieved cross-domain F1 accuracy of 0.89 with <100ms latency and improved general-task performance by 7% compared to GPT-4 Lite while maintaining domain-specific parity.

The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains--healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks--we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools--Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)--which show strong correlation between model compression intensity and brittleness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes