Wenfang Xu

NA
h-index28
4papers
18citations
Novelty41%
AI Score39

4 Papers

CLDec 25, 2025Code
WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

Zhaojiang Lin, Yong Xu, Kai Sun et al.

Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

NANov 16, 2017
Adaptive aggregation on graphs

Wenfang Xu, Ludmil T. Zikatanov

We generalize some of the functional (hyper-circle) a posteriori estimates from finite element settings to general graphs or Hilbert space settings. We provide several theoretical results in regard to the generalized a posteriori error estimators. We use these estimates to construct aggregation based coarse spaces for graph Laplacians. The estimator is used to assess the quality of an aggregation adaptively. Furthermore, a reshaping algorithm based is tested on several numerical examples.

NAApr 5, 2018
Constructing Frequency Domains on Graphs in Near-Linear Time

John C. Urschel, Wenfang Xu, Ludmil T. Zikatanov

Analysis of big data has become an increasingly relevant area of research, with data often represented on discrete networks both constructed and organic. While for structured domains, there exist intuitive definitions of signals and frequencies, the definitions are much less obvious for data sets associated with a given network. Often, the eigenvectors of an induced graph Laplacian are used to construct an orthogonal set of low-frequency vectors. For larger graphs, however, the computational cost of creating such structures becomes untenable, and the quality of the approximation is adequate only for signals near the span of the set. We propose a construction of a full basis of frequencies with computational complexity that is near-linear in time and linear in storage. Using this frequency domain, we can compress data sets on unstructured graphs more robustly and accurately than spectral-based constructions.

CVOct 30, 2025
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun et al.

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.