SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
This work addresses the need for quantifiable benchmarks to assess LLMs as copilots in sensor processing for cyber-physical systems, though it is incremental in nature.
The authors tackled the problem of evaluating Large Language Models (LLMs) for sensor data processing by creating SensorBench, a comprehensive benchmark with diverse real-world datasets, and found that LLMs perform well on simpler tasks but struggle with compositional tasks compared to experts, while self-verification prompting outperformed other strategies in 48% of tasks.
Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems. Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools. However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective. The benchmark incorporates diverse real-world sensor datasets for various tasks. The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts. Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks. Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot.