SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection
This work addresses the challenge of wildfire monitoring for safety-critical applications, but it is incremental as it primarily benchmarks existing models without proposing new methods.
The authors tackled the problem of early-stage wildfire smoke detection by introducing SmokeBench, a benchmark to evaluate multimodal large language models (MLLMs) on smoke recognition and localization tasks, finding that while some models can classify smoke in large areas, all struggle with accurate localization, especially in early stages, with smoke volume strongly correlating with performance.
Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.