CV ROJun 24, 2025

Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Runwei Guan, Ningwei Ouyang, Tianhao Xu, Shaofeng Liang, Wei Dai, Yafeng Sun, Shang Gao, Songning Lai, Shanliang Yao, Xuming Hu, Ryan Wen Liu, Yutao Yue

arXiv:2506.19288v210.25 citationsh-index: 11IEEE transactions on circuits and systems for video technology (Print)

Originality Incremental advance

AI Analysis

This addresses the problem of automated scene understanding for unmanned surface vessels, enabling better monitoring and decision-making, though it is incremental as it builds on existing vision-language models.

The authors tackled the lack of global semantic understanding in waterway environments by introducing WaterCaption, the first captioning dataset for waterways with 20.2k image-text pairs, and Da Yu, an edge-deployable model that surpasses state-of-the-art on captioning benchmarks.

Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

View on arXiv PDF

Similar