Yunbo Xu

CV
h-index10
4papers
11citations
Novelty51%
AI Score36

4 Papers

CVSep 9, 2024Code
Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Xuesong Zhang, Jia Li, Yunbo Xu et al.

Autonomous navigation guided by natural language instructions in embodied environments remains a challenge for vision-language navigation (VLN) agents. Although recent advancements in learning diverse and fine-grained visual environmental representations have shown promise, the fragile performance improvements may not conclusively attribute to enhanced visual grounding,a limitation also observed in related vision-language tasks. In this work, we preliminarily investigate whether advanced VLN models genuinely comprehend the visual content of their environments by introducing varying levels of visual perturbations. These perturbations include ground-truth depth images, perturbed views and random noise. Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy. Inspired by these insights, we further present a versatile Multi-Branch Architecture (MBA) designed to delve into the impact of both the branch quantity and visual quality. The proposed MBA extends a base agent into a multi-branch variant, where each branch processes a different visual input. This approach is embarrassingly simple yet agnostic to topology-based VLN agents. Extensive experiments on three VLN benchmarks (R2R, REVERIE, SOON) demonstrate that our method with optimal visual permutations matches or even surpasses state-of-the-art results. The source code is available at here.

HCOct 11, 2024
DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

Jia Li, Yangchen Yu, Yin Chen et al.

Engagement estimation plays a crucial role in understanding human social behaviors, attracting increasing research interests in fields such as affective computing and human-computer interaction. In this paper, we propose a Dialogue-Aware Transformer framework (DAT) with Modality-Group Fusion (MGF), which relies solely on audio-visual input and is language-independent, for estimating human engagement in conversations. Specifically, our method employs a modality-group fusion strategy that independently fuses audio and visual features within each modality for each person before inferring the entire audio-visual content. This strategy significantly enhances the model's performance and robustness. Additionally, to better estimate the target participant's engagement levels, the introduced Dialogue-Aware Transformer considers both the participant's behavior and cues from their conversational partners. Our method was rigorously tested in the Multi-Domain Engagement Estimation Challenge held by MultiMediate'24, demonstrating notable improvements in engagement-level regression precision over the baseline model. Notably, our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.

CVOct 1, 2025
Disentangling Foreground and Background for vision-Language Navigation via Online Augmentation

Yunbo Xu, Xuesong Zhang, Jia Li et al.

Following language instructions, vision-language navigation (VLN) agents are tasked with navigating unseen environments. While augmenting multifaceted visual representations has propelled advancements in VLN, the significance of foreground and background in visual observations remains underexplored. Intuitively, foreground regions provide semantic cues, whereas the background encompasses spatial connectivity information. Inspired on this insight, we propose a Consensus-driven Online Feature Augmentation strategy (COFA) with alternative foreground and background features to facilitate the navigable generalization. Specifically, we first leverage semantically-enhanced landmark identification to disentangle foreground and background as candidate augmented features. Subsequently, a consensus-driven online augmentation strategy encourages the agent to consolidate two-stage voting results on feature preferences according to diverse instructions and navigational locations. Experiments on REVERIE and R2R demonstrate that our online foreground-background augmentation boosts the generalization of baseline and attains state-of-the-art performance.

CVDec 9, 2024
Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Xuesong Zhang, Yunbo Xu, Jia Li et al.

Navigating unseen environments from natural language instructions remains challenging for egocentric agents in Vision-and-Language Navigation (VLN). Humans naturally ground concrete semantic knowledge within spatial layouts during indoor navigation. Although prior work has introduced diverse environment representations to improve reasoning, auxiliary modalities are often naively concatenated with RGB features, which underutilizes each modality's distinct contribution. We propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at multiple scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, capturing fine-grained semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth Enhanced Spatial Perception (DSP) module incrementally builds a trajectory-level depth exploration map, providing a coarse-grained representation of global spatial layout. Extensive experiments show that the hierarchical representation enrichment of SUSA significantly improves navigation performance over the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON) and generalizes better to the continuous R2R-CE benchmark.