Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection
This addresses accuracy and efficiency issues in object detection for edge-cloud applications, representing an incremental improvement through hybrid methods.
The paper tackles performance degradation in object detection under complex scenarios like low-light and occlusions by proposing an adaptive guidance method using Multimodal LLMs for semantic enhancement in an edge-cloud framework, achieving over 79% latency reduction and 70% computational cost savings while maintaining accuracy.
Traditional object detection methods face performance degradation challenges in complex scenarios such as low-light conditions and heavy occlusions due to a lack of high-level semantic understanding. To address this, this paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method leveraging Multimodal Large Language Models (MLLM), achieving an effective balance between accuracy and efficiency. Specifically, the method first employs instruction fine-tuning to enable the MLLM to generate structured scene descriptions. It then designs an adaptive mapping mechanism that dynamically converts semantic information into parameter adjustment signals for edge detectors, achieving real-time semantic enhancement. Within an edge-cloud collaborative inference framework, the system automatically selects between invoking cloud-based semantic guidance or directly outputting edge detection results based on confidence scores. Experiments demonstrate that the proposed method effectively enhances detection accuracy and efficiency in complex scenes. Specifically, it can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes while maintaining accuracy.