Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
This work addresses the underexplored performance of LLMs in technical telecom domains, highlighting limitations and the need for domain-adapted models to support trustworthy AI assistants in engineering, but it is incremental as it focuses on benchmarking existing models.
The paper evaluated two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on a benchmark of 105 technical telecom questions, finding that Gemma performed better in semantic fidelity and LLM-rated correctness, while DeepSeek had slightly higher lexical consistency.
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.