CVOct 26, 2021

ViDA-MAN: Visual Dialog with Digital Humans

arXiv:2110.13384v16 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for more human-like and immersive interactions in digital assistants, though it appears incremental by integrating existing multi-modal techniques.

The paper tackles the problem of creating a digital-human agent for multi-modal interaction, resulting in ViDA-MAN, which offers real-time audio-visual responses to speech inquiries with sub-second latency and high-quality videos.

We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes