CLMar 10

Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models

arXiv:2603.09595v123.2h-index: 1

Predicted impact top 99% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

This provides practical guidance for political scientists on model selection based on class prevalence, error tolerance, and resources, addressing an incremental empirical gap in the discipline.

The paper tackles the problem of choosing NLP models for political science tasks by comparing a fine-tuned general-purpose model (Confli-mBERT) against a domain-specific model (ConfliBERT) for conflict event classification, finding that Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%, with performance differences mainly in rare event categories.

Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.

View on arXiv PDF

Similar