ROCLMay 4

Benchmarking Local Language Models for Social Robots using Edge Devices

arXiv:2605.0311138.2Has Code
Predicted impact top 57% in RO · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers deploying LLMs on resource-constrained social robots, this provides a systematic benchmark and a three-tier architecture to balance responsiveness and accuracy.

The paper benchmarks 25 open-source language models for local deployment on edge devices like the Raspberry Pi 4, evaluating inference efficiency, general knowledge (MMLU up to 57.2%), and teaching effectiveness. It finds that Granite4 Tiny Hybrid (7B) offers the best balance with 2.5 tokens/s, 0.90 tokens/joule, and 54.6% MMLU accuracy, and that high MMLU accuracy is not necessary for strong teaching scores.

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes