CVNov 15, 2024

Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images

Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis A. Argyros

arXiv:2411.10334v12.0h-index: 40Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for lightweight, real-time multi-task models in robotics and practical applications, though it is incremental as it builds on existing multi-teacher distillation methods.

The paper tackles the problem of real-time multi-task learning on RGB images by introducing Y-MAP-Net, a Y-shaped neural network that simultaneously predicts depth, surface normals, human pose, semantic segmentation, and multi-label captions from a single evaluation, achieving computational efficiency suitable for robotics.

We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

View on arXiv PDF

Similar