CVNov 30, 2023

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

arXiv:2311.18835v118 citationsh-index: 64Has Code
Originality Incremental advance
AI Analysis

This provides a more intuitive and versatile interface for computer vision tasks, though it is incremental as it builds on existing multimodal and instruction-following approaches.

The paper tackles the problem of unifying diverse vision tasks through natural language instructions, resulting in a model that achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning without task-specific tuning.

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes