CVMar 11, 2025

Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding

Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei

arXiv:2503.08152v13.62 citationsh-index: 2Has CodeIEEE transactions on circuits and systems for video technology (Print)

Originality Incremental advance

AI Analysis

This addresses the challenge of counting objects in underwater environments with limited visibility and dynamic similarities, which is incremental as it builds on existing video-based counting methods.

The paper tackles the problem of counting indiscernible marine objects in underwater videos by proposing a depth-assisted network with adaptive motion-differentiated feature encoding, achieving state-of-the-art performance on a new dataset of 50 videos with around 40,800 labels and competitive results on three crowd counting datasets.

Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at https://github.com/OUCVisionGroup/VIMOC-Net.

View on arXiv PDF Code

Similar