CVNov 15, 2024

Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images

arXiv:2411.10334v1h-index: 40Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for lightweight, real-time multi-task models in robotics and practical applications, though it is incremental as it builds on existing multi-teacher distillation methods.

The paper tackles the problem of real-time multi-task learning on RGB images by introducing Y-MAP-Net, a Y-shaped neural network that simultaneously predicts depth, surface normals, human pose, semantic segmentation, and multi-label captions from a single evaluation, achieving computational efficiency suitable for robotics.

We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes