CL AIJun 2

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

Atm Mizanur Rahman, Md Arid Hasan, Syed Ishtiaque Ahmed, Sharifa Sultana

arXiv:2606.033315.8h-index: 10

AI Analysis

This work provides a benchmark and evaluation for LLMs in a safety-critical, real-world domain, highlighting their limitations for practical deployment.

LLMs show potential for consumer device repair assistance but remain unreliable for high-risk tasks, with GPT-5.4 performing best overall; Bangla responses consistently underperform English ones.

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

View on arXiv PDF

Similar