RO CLMay 4

Benchmarking Local Language Models for Social Robots using Edge Devices

Dorian Lamouille, Matevž B. Zorec, Farnaz Baksh, Karl Kruusamäe

arXiv:2605.0311138.2Has Code

Predicted impact top 57% in RO · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers deploying LLMs on resource-constrained social robots, this provides a systematic benchmark and a three-tier architecture to balance responsiveness and accuracy.

The paper benchmarks 25 open-source language models for local deployment on edge devices like the Raspberry Pi 4, evaluating inference efficiency, general knowledge (MMLU up to 57.2%), and teaching effectiveness. It finds that Granite4 Tiny Hybrid (7B) offers the best balance with 2.5 tokens/s, 0.90 tokens/joule, and 54.6% MMLU accuracy, and that high MMLU accuracy is not necessary for strong teaching scores.

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.

View on arXiv PDF Code

Similar